Moondream is an advanced vision language model that combines powerful image processing capabilities with natural language understanding, enabling a seamless interaction between visual data and text-based queries.

With features such as captioning, object detection, visual querying, and experimental functionalities like gaze detection, Moondream offers a comprehensive toolkit for developers working on AI-powered projects. Its compact design ensures accessibility, making it suitable for macOS and other platforms, even with limited computational resources.

Moondream is available in two variants:

Moondream 2B: The larger, full-feature model with 2 billion parameters, designed for general-purpose applications requiring robust performance.
Moondream 0.5B: A compact version with 500 million parameters, optimized for edge devices or scenarios with constrained hardware resources

This guide will walk you through setting up Moondream, installing necessary dependencies, and running a sample script to leverage its full capabilities.

Prerequisites

Before running Moondream, ensure your system meets the following requirements:

Python 3.8+ Installed

Use the latest Python version compatible with Moondream.

Required Libraries

Install the necessary Python packages and system dependencies.

Step 1: Install Required Dependencies

Run the following commands in your terminal to install the required libraries:

Install Python Dependencies

pip install transformers torch einops accelerate pyvips

Install VIPS via Homebrew

VIPS is a high-performance image processing library required for some Moondream features.

brew install vips

Upgrade accelerate

Ensure you have the latest version of the accelerate library:

pip install 'accelerate>=0.26.0'

Step 2: Load and Run the Moondream Model

Here is the full Python script for running Moondream:

from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image

# Load the Moondream model with MPS (Metal Performance Shaders) support
model = AutoModelForCausalLM.from_pretrained(
    "vikhyatk/moondream2",
    revision="2025-01-09",
    trust_remote_code=True,
    device_map={"": "mps"}  # Enable GPU acceleration using MPS
)

# Load an image
image = Image.open("group_photo.jpg")

# Generate captions
print("Short caption:")
print(model.caption(image, length="short")["caption"])

print("\nNormal caption:")
for t in model.caption(image, length="normal", stream=True)["caption"]:
    # Streaming generation example
    print(t, end="", flush=True)
print("\n")

# Visual Querying
query = "How many people are in the image?"
print(f"\nVisual query: '{query}'")
print(model.query(image, query)["answer"])

# Object Detection
object_type = "face"
print(f"\nObject detection: '{object_type}'")
objects = model.detect(image, object_type)["objects"]
print(f"Found {len(objects)} {object_type}(s)")

# Pointing
point_type = "person"
print(f"\nPointing: '{point_type}'")
points = model.point(image, point_type)["points"]
print(f"Found {len(points)} {point_type}(s)")

Explanation of the Script

Model Initialization

The script loads the Moondream model directly using Hugging Face's AutoModelForCausalLM.
The device_map={"": "mps"} parameter enables GPU acceleration using Metal Performance Shaders.

Image Captioning

Generate short or normal-length captions for the input image.
Streaming generation is supported for real-time caption output.

Visual Querying

Ask questions about the image, such as "How many people are in the image?"

Object Detection

Detect specific objects in the image, such as "face."

Pointing

Identify and locate specific elements, like "person," in the image.

Step 3: Running the Script

Save the above Python script as moondream.py.
Place the target image (group_photo.jpg) in the same directory as the script. You can use any photo but for this example we'll just use a normal people group photo. You can find one and insert your own.
Run the script:

python moondream.py

Then, it should output something like the following:

Short caption:
 A family of nine, dressed in blue and black, gathers in a park, with the father in a blue shirt and the mother in a black dress, surrounded by lush greenery.

Normal caption:
 A family of nine, consisting of a man, a woman, and their children, is gathered together in a park setting. The man is positioned on the left, wearing a blue shirt, and the woman is on the right, wearing a blue shirt. The children are seated in the center, with one girl wearing a blue shirt and glasses, and another girl wearing a dark blue or navy shirt. The woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the right is wearing a light blue shirt, and the woman on the left is wearing a dark blue or navy shirt. The background features a mix of green and brown foliage, creating a natural and serene atmosphere.{'caption': ' A family of nine, consisting of a man, a woman, and their children, is gathered together in a park setting. The man is positioned on the left, wearing a blue shirt, and the woman is on the right, wearing a blue shirt. The children are seated in the center, with one girl wearing a blue shirt and glasses, and another girl wearing a dark blue or navy shirt. The woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the left is wearing a blue shirt, and the woman on the right is wearing a dark blue or navy shirt. The man on the right is wearing a light blue shirt, and the woman on the left is wearing a dark blue or navy shirt. The background features a mix of green and brown foliage, creating a natural and serene atmosphere.'}

Visual query: 'How many people are in the image?'
 There are nine people in the image.

Object detection: 'face'
Found 1 face(s)

Pointing: 'person'
Found 10 person(s)

Additional Notes

GPU Support

The device_map={"": "mps"} setting enables Metal Performance Shaders for GPU acceleration.

Handling Large Models

Ensure sufficient RAM or disk space for large model checkpoints like moondream2.

Streaming Output

For tasks like captioning, the script demonstrates streaming output for real-time processing.

Conclusion

Running Moondream is a straightforward process with the right setup. With its advanced vision-language capabilities, Moondream is a powerful tool for developers looking to integrate AI into their projects.

Follow this guide to set up Moondream on your system and explore its full potential. For more information, refer to the official documentation.

How To Run MoonDream Vision Model (MacOS)