While many applications process and generate text, modern LLMs are increasingly multimodal, capable of understanding inputs beyond language, such as images. Integrating vision capabilities allows you to build applications that can "see" and interpret visual contexts, opening up a new class of powerful use cases.
This section introduces how to process and analyze images using vision-language models (VLMs). You will learn how to prepare images for API calls, send them to different models through a unified interface, and guide the model's analysis with targeted prompts to perform tasks like object recognition, text extraction, and structured data generation.
Vision-language models do not process image files directly. Instead, they require the image data to be encoded into a format that can be transmitted via an API, typically Base64. This process converts the binary data of an image into a string representation. The toolkit handles this encoding automatically, but understanding the process is useful.
The image_to_base64 function provides a straightforward way to perform this conversion.
from kerb.multimodal import image_to_base64
# Assume you have an image file named 'product-diagram.jpg'
image_path = "product-diagram.jpg"
# Encode the image to a Base64 string
encoded_image = image_to_base64(image_path)
print(f"Base64 encoded string (first 60 chars): {encoded_image[:60]}...")
This encoded string, along with your text prompt, forms the multimodal input that gets sent to the vision model.
Different providers, such as OpenAI, Anthropic, and Google, have their own specific API formats for handling multimodal inputs. The toolkit simplifies this with a single function, analyze_image_with_vision_model, which abstracts away the provider-specific details. This allows you to write code once and switch between different vision models with minimal changes.
The following diagram shows the typical workflow for image analysis.
The function automatically handles Base64 encoding and formats the request according to the selected model's requirements.
Let's start with a basic task: asking a model to describe an image. We provide the path to an image and a text prompt guiding the analysis.
from kerb.multimodal import analyze_image_with_vision_model, VisionModel
# Assume 'product-diagram.jpg' is an image of a product schematic
image_path = "product-diagram.jpg"
prompt_text = "Describe the object shown in this image."
# Analyze the image using OpenAI's GPT-4o model
analysis_result = analyze_image_with_vision_model(
image_path=image_path,
prompt=prompt_text,
model=VisionModel.GPT4O
)
print(analysis_result.description)
The model parameter, using the VisionModel enum, determines which provider and model to use. You could switch to Anthropic's Claude 3.5 Sonnet by simply changing model=VisionModel.CLAUDE_3_5_SONNET.
The quality of the analysis depends heavily on the text prompt you provide. By being specific, you can guide the model to perform a variety of sophisticated tasks.
Vision models are effective at Optical Character Recognition (OCR), allowing you to extract text from images like scanned documents, screenshots, or photographs.
# Assume 'product-specs.jpg' contains text with product specifications
ocr_prompt = "Extract all text from this image. Transcribe it exactly as it appears."
ocr_result = analyze_image_with_vision_model(
image_path="product-specs.jpg",
prompt=ocr_prompt,
model=VisionModel.GPT4O
)
print("Extracted Text:\n")
print(ocr_result.description)
This is particularly useful for building RAG systems that need to index text from a combination of digital and scanned documents.
One of the most powerful applications of VLMs is extracting structured data from images. By instructing the model to return its findings in a specific format like JSON, you can create a reliable data processing pipeline.
Imagine you have an image of a business card. You can ask the model to extract the contact information into a JSON object.
# Assume 'business-card.jpg' is an image of a business card
structured_prompt = """
Analyze the business card in the image and extract the following information.
Return the output as a valid JSON object with these keys:
- name
- title
- company
- phone
- email
- website
"""
structured_result = analyze_image_with_vision_model(
image_path="business-card.jpg",
prompt=structured_prompt,
model=VisionModel.GPT4O
)
# The result can now be easily parsed and used
import json
contact_info = json.loads(structured_result.description)
print(json.dumps(contact_info, indent=2))
This technique can be applied to invoices, receipts, forms, or any image containing structured information.
You can also provide multiple images in a single call to perform comparative analysis. The function accepts a list of image paths for the image_path parameter. This is useful for tasks like comparing product versions, identifying changes between two diagrams, or finding the odd one out in a set of images.
# Assume you have two images of a chart from different quarters
image_paths = ["sales-q1.jpg", "sales-q2.jpg"]
comparison_prompt = "These two images show sales charts for Q1 and Q2. " \
"Compare them and describe the main trend you observe."
comparison_result = analyze_image_with_vision_model(
image_path=image_paths,
prompt=comparison_prompt,
model=VisionModel.GPT4O
)
print(comparison_result.description)
By providing both images in the same context, the model can directly reference and compare them to provide a more insightful analysis than if it had processed them separately.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with