By Wei Ming T. on Jan 10, 2025
Moondream is an advanced vision language model that combines powerful image processing capabilities with natural language understanding, enabling a seamless interaction between visual data and text-based queries.
With features such as captioning, object detection, visual querying, and experimental functionalities like gaze detection, Moondream offers a comprehensive toolkit for developers working on AI-powered projects. Its compact design ensures accessibility, making it suitable for macOS and other platforms, even with limited computational resources.
Moondream is available in two variants:
This guide will walk you through setting up Moondream, installing necessary dependencies, and running a sample script to leverage its full capabilities.
Before running Moondream, ensure your system meets the following requirements:
Run the following commands in your terminal to install the required libraries:
pip install transformers torch einops accelerate pyvips
VIPS is a high-performance image processing library required for some Moondream features.
brew install vips
Ensure you have the latest version of the accelerate library:
pip install 'accelerate>=0.26.0'
Here is the full Python script for running Moondream:
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
# Load the Moondream model with MPS (Metal Performance Shaders) support
model = AutoModelForCausalLM.from_pretrained(
"vikhyatk/moondream2",
revision="2025-01-09",
trust_remote_code=True,
device_map={"": "mps"} # Enable GPU acceleration using MPS
)
# Load an image
image = Image.open("group_photo.jpg")
# Generate captions
print("Short caption:")
print(model.caption(image, length="short")["caption"])
print("\nNormal caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
# Streaming generation example
print(t, end="", flush=True)
print("\n")
# Visual Querying
query = "How many people are in the image?"
print(f"\nVisual query: '{query}'")
print(model.query(image, query)["answer"])
# Object Detection
object_type = "face"
print(f"\nObject detection: '{object_type}'")
objects = model.detect(image, object_type)["objects"]
print(f"Found {len(objects)} {object_type}(s)")
# Pointing
point_type = "person"
print(f"\nPointing: '{point_type}'")
points = model.point(image, point_type)["points"]
print(f"Found {len(points)} {point_type}(s)")
device_map={"": "mps"}
parameter enables GPU acceleration using Metal Performance Shaders.moondream.py
.group_photo.jpg
) in the same directory as the script. You can use any photo but for this example we'll just use a normal people group photo. You can find one and insert your own.python moondream.py
Then, it should output something like the following:
Short caption:
A family of nine, dressed in blue and black, gathers in a park, with the father in a blue shirt and the mother in a black dress, surrounded by lush greenery.
Normal caption:
A family of nine, consisting of a man, a woman, and their children, is gathered together in a park setting. The man is positioned on the left, wearing a blue shirt, and the woman is on the right, wearing a blue shirt. The children are seated in the center, with one girl wearing a blue shirt and glasses, and another girl wearing a dark blue or navy shirt. The woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the right is wearing a light blue shirt, and the woman on the left is wearing a dark blue or navy shirt. The background features a mix of green and brown foliage, creating a natural and serene atmosphere.{'caption': ' A family of nine, consisting of a man, a woman, and their children, is gathered together in a park setting. The man is positioned on the left, wearing a blue shirt, and the woman is on the right, wearing a blue shirt. The children are seated in the center, with one girl wearing a blue shirt and glasses, and another girl wearing a dark blue or navy shirt. The woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the right is wearing a light blue shirt, and the woman on the left is wearing a dark blue or navy shirt. The background features a mix of green and brown foliage, creating a natural and serene atmosphere.'}
Visual query: 'How many people are in the image?'
There are nine people in the image.
Object detection: 'face'
Found 1 face(s)
Pointing: 'person'
Found 10 person(s)
device_map={"": "mps"}
setting enables Metal Performance Shaders for GPU acceleration.Running Moondream is a straightforward process with the right setup. With its advanced vision-language capabilities, Moondream is a powerful tool for developers looking to integrate AI into their projects.
Follow this guide to set up Moondream on your system and explore its full potential. For more information, refer to the official documentation.
© 2025 ApX Machine Learning. All rights reserved.
Learn Data Science & Machine Learning
Machine Learning Tools