Handling Streaming Responses

A common method for text generation involves making a single API call, for example, using a generate function. This approach, while straightforward, presents a significant drawback for interactive applications: users must wait until the entire response is generated before seeing any output. For short responses, this might be acceptable, but for longer ones, the experience can feel slow and unresponsive. A much better user experience is to display the response as it's being generated, token by token. This is known as streaming.

Streaming drastically reduces the perceived latency of your application. Instead of waiting several seconds for a full paragraph, the user sees the first words appear almost instantly and can begin reading while the rest of the response is generated. This makes the application feel faster and more dynamic, which is particularly important for chatbots, coding assistants, and other real-time tools.

The difference between a standard call and a streaming call can be visualized by their timelines. A standard call waits for all processing to finish before returning data, while a streaming call returns data in chunks throughout the processing period.

Timeline comparison between a standard API call and a streaming call. Streaming delivers the first piece of content much earlier, improving responsiveness.

Using `generate_stream` for Real-Time Responses

To handle streaming, the toolkit provides the generate_stream function. Instead of returning a single GenerationResponse object after the full generation is complete, generate_stream returns a Python generator. You can iterate over this generator to receive chunks of the response as they become available.

Here is a basic example of how to use it:

from kerb.generation import generate_stream, ModelName

prompt = "Explain the concept of async/await in Python in 2 sentences."

print("Streaming Response:")
full_content = ""
for chunk in generate_stream(prompt, model=ModelName.GPT_4O_MINI):
    # Print each piece of content to the console immediately
    print(chunk.content, end="", flush=True)
    full_content += chunk.content

print("\n\n--- Generation Complete ---")
print(f"Final assembled content: {full_content}")

In this code, the for loop processes each StreamChunk object as it arrives from the API. We print the content of each chunk immediately, using end="" to prevent newlines and flush=True to ensure the output is displayed right away. At the same time, we concatenate the content into the full_content variable to have the complete response available at the end.

Building on Conversational Context

The generate_stream function works just as well with conversational history. You can pass a list of Message objects, and the model's response will be streamed back. This pattern is ideal for building chatbots that feel interactive and engaging.

from kerb.generation import generate_stream, ModelName
from kerb.core import Message
from kerb.core.types import MessageRole

messages = [
    Message(role=MessageRole.SYSTEM, content="You are a concise Python tutor."),
    Message(role=MessageRole.USER, content="What is a decorator?"),
]

print("Assistant: ", end="", flush=True)

# The response is streamed token by token
for chunk in generate_stream(messages, model=ModelName.GPT_4O_MINI):
    print(chunk.content, end="", flush=True)

print() # Final newline

This approach provides an immediate visual feedback loop for the user, as the assistant's response appears to be "typed out" in real time.

Using Callbacks for Advanced Chunk Processing

Sometimes, you may want to do more than just print the content of each chunk. For example, you might want to log each chunk for analytics, check for specific keywords as they arrive, or update a user interface. The generate_stream function accepts an optional callback argument for these scenarios.

The callback is a function that will be executed for every chunk received from the stream.

import time
from kerb.generation import generate_stream, ModelName

chunks_received = []

def process_chunk(chunk):
    """A callback function to process each chunk."""
    chunks_received.append({
        "content": chunk.content,
        "timestamp": time.time(),
        "finish_reason": chunk.finish_reason
    })
    # You could also log to a file, update a UI, etc.

prompt = "List 5 Python design patterns."

print("Streaming with a callback...")
full_response = ""
for chunk in generate_stream(
    prompt,
    model=ModelName.GPT_4O_MINI,
    callback=process_chunk
):
    print(chunk.content, end="", flush=True)
    full_response += chunk.content

print("\n\n--- Analysis from Callback ---")
print(f"Total chunks received: {len(chunks_received)}")

if len(chunks_received) > 1:
    time_span = chunks_received[-1]["timestamp"] - chunks_received[0]["timestamp"]
    print(f"Total streaming duration: {time_span:.3f}s")
    print(f"Average time between chunks: {time_span / len(chunks_received):.4f}s")

Using a callback is a clean way to separate the logic of handling each chunk from the main flow of your application. It helps keep your code organized, especially as the actions you perform on each chunk become more complex.

When to Use Streaming

Streaming is highly recommended for any application involving direct user interaction. The improvement in perceived performance is significant.

Common use cases include:

Chatbots: Provide instant feedback and a more natural conversational flow.
Code Generation: Show code being generated in real-time, allowing developers to spot issues early.
Long-Form Writing: Allow users to see and edit content as it's being written by the LLM.
Real-Time Data Analysis: Display summaries or insights from a stream of incoming data as they are processed.

By mastering streaming, you can build LLM applications that are not only powerful but also feel fast, responsive, and intuitive to your users.

Was this section helpful?

References

Chat completions, OpenAI, 2024 (OpenAI) - Explains how to make streaming requests to the OpenAI Chat Completions API, demonstrating the common pattern for token-by-token generation in Large Language Models.
yield expressions, Python Software Foundation, 2024 - Provides the official explanation of Python's yield keyword and generator functions, which are fundamental for implementing and understanding iterable streaming responses.

Handling Streaming Responses

Using generate_stream for Real-Time Responses

Building on Conversational Context

Using Callbacks for Advanced Chunk Processing

When to Use Streaming

Using `generate_stream` for Real-Time Responses