Say Goodbye to Delays: Expedite Your Experience with Stream Query

Introducing Stream Query

Experiencing delays while using Large Language Models (LLMs) such as GPT-4 can be frustrating. Those 5-10 seconds of waiting might not seem like much, but they add up, testing the patience of you and your users. While we can’t reduce the actual response time of generative LLMs, we enhanced how responses are perceived, significantly improving user satisfaction by addressing one of the most notable pain points: the wait for a complete response.

We’re thrilled to unveil Stream Query, a new API endpoint that returns the search results first, and then streams summary responses in small chunks, which helps reduce perceived latency. This means that we first return back the query search result first. Then when we return generative summary results, as soon as the first chunk of the response is ready, it’s sent your way, allowing you to start digesting the information immediately. This approach virtually eliminates perceived latency, offering a smooth and continuous interaction that keeps pace with the speed of thought.

The conventional method of waiting for an entire response introduces unnecessary delays. Stream Query changes the game by delivering the response in digestible segments as they’re generated. This is especially advantageous when dealing with LLMs like GPT-4, where response times can stretch between 5 to 10 seconds. By streaming content in real-time, we drastically cut down waiting times, enhancing overall responsiveness and enriching the user experience.

We provide much more than just an API endpoint; we’ve also developed complementary concatenation tools designed to seamlessly integrate Stream Query results into a fluid user experience. At Vectara, we prioritize simplicity and efficiency, ensuring that our developers always come first. Our tools are crafted to streamline your workflow, allowing you to effortlessly merge streaming data for optimal performance and user satisfaction.

Getting Started with Stream Query

API: The Stream Query is a separate API endpoint, and it has the same request parameters as the Standard Query API: The /v1/stream-query endpoint enables streaming for REST. Use this endpoint instead of the standard /v1/queryendpoint. For gRPC, view the protobuf definition at serving.proto.

When you send a request, the Stream Query API returns responses in parts or chunks as they become available, rather than waiting for the entire response to be generated. Each part of the streamed response will have a future_id field, which is a unique identifier for that particular chunk. The API continues streaming additional parts as the API continues processing the request. Once the API streams the entire summary, you get a final response with the done field set to true. If you enabled the Factual Consistency Score, then the value appears after this field.

Example Stream Query Request

In this example, we issue a query to the /v1/stream-query endpoint. Below is an example request:

Example Stream Query Response

For example, you will first get responses streamed without summary.

After streaming for response finishes, the summary section starts. The first chunk contains the first half of the summary, and “done: false” indicates that there’s more content in subsequent chunks.

At this point, the consumer should store the chunk and optionally show it to the user, while it waits for subsequent chunks. The second and final chunk might look like this:

Users should append this second chunk’s text to the stored text of the first chunk. Because “done” is set to “true”, the user can consider the query to be completely resolved at this point.

In addition, you will have to connect or concatenate the multiple chucks to form the complete summary once streaming is done. Or you can use the below open source tools, which includes the concatenation functionality and makes your life easier.

You can find more details on our API documentation.

Console: Stream Query is the default setting, allowing you to immediately grasp its impact. Simply navigate to the query tab after selecting your corpus on the left-hand side of the console. This hands-on approach not only demonstrates Stream Query’s efficiency but also provides code snippets for API requests, ensuring a smooth integration into your workflow.

Open source tools:

Stream-Query-Client: In a few lines of code, you can now add streaming to your JS applications. Use this to easily make requests to our streaming API and set up how you’d like to process responses.

React-Chatbot: Our chatbot now defaults to streaming responses. The useChat hook has been updated as well, allowing you to quickly implement custom chat interfaces with streaming enabled.

Create-UI: The UIs you generate will now come with streaming by default. For your convenience, toggling it on and off is as easy as modifying a single variable, so you can continue making generated UIs truly your own.

By introducing Stream Query, we’re not just optimizing technology; we’re redefining user experience. Say Goodbye to Waiting and welcome to a world where waiting becomes a thing of the past!

As always, we’d love to hear your feedback! Connect with us on our forums or on our Discord.

Sign up for a free account to see how Vectara can help you easily leverage retrieval-augmented generation in your GenAI apps.