How to Install and Run QwQ-32B

By Ryan A. on Mar 6, 2025

Guest Author

QwQ-32B is the latest reasoning-focused large language model with 32 billion parameters, designed as part of the Qwen model series. Unlike conventional instruction-tuned models, QwQ-32B has been optimized to think and reason more effectively, making it a powerful tool for logical reasoning, coding, and mathematical problem-solving.

What makes QwQ-32B unique is its use of Reinforcement Learning (RL) to enhance reasoning capabilities. This is similar to models like DeepSeek-R1, which integrates multi-stage training for advanced problem-solving. Despite being significantly smaller, the Qwen team claims that QwQ-32B can rival DeepSeek-R1.

For more information, refer to the official announcement:

Performance Benchmarks

QwQ-32B outperforms its earlier QwQ-Preview version across multiple benchmarks, as shown below:

Benchmark QwQ-Preview QwQ-32B
AIME24 50 79.5
LiveCodeBench 50 63.4
LiveBench 40.25 73.1
IFEval 40.35 83.9
BFCL 17.59 66.4

Additionally, compared to other models such as DeepSeek-R1-Distilled and OpenAI o1-mini, QwQ-32B holds up well despite its relatively smaller size. Below is a benchmark comparison:

Benchmark

System Requirements

For users with high-end hardware, running QwQ-32B via Hugging Face Transformers provides full access to the model's capabilities. You would have to run the 4-bit quantized models for lower-end retail GPU.

Full Model Requirements

Version Recommended Hardware
Nvidia GPU 4x RTX 4090 (24GB each)
Mac M-Chip MacBook Pro (M3 Max, 128GB RAM)

4-bit Quantized Version Requirements

Version Recommended Hardware
Nvidia GPU RTX 4090 (24GB)
Mac M-Chip MacBook Pro (M2, 32GB RAM)

Run using Ollama (Easier Method)

For a simpler setup, you can use Ollama, which provides a 4-bit quantized version of QwQ-32B. This method requires less setup and is ideal for users without high-end GPUs.

Step 1: Install Ollama

Run the following command:

curl -fsSL https://ollama.com/install.sh | sh

Step 2: Pull and Run QwQ-32B

ollama run qwq:32b

Run with Hugging Face Transformers

Step 1: Install Dependencies

Run the following command to install the necessary libraries:

pip install torch transformers accelerate

Step 2: Load the Model and Tokenizer

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/QwQ-32B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype= "auto",
    device_map= "auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Step 3: Running Inference

Here's a sample query to test the model:

prompt = "Hello world!"

messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=32768)

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Conclusion

QwQ-32B demonstrates competitive performance against models with significantly more parameters, highlighting the potential of reinforcement learning in enhancing reasoning capabilities. By achieving results comparable to larger models like DeepSeek-R1 while maintaining a relatively smaller size, QwQ-32B represents a step forward in making high-performance reasoning models more accessible to a broader range of users.

© 2025 ApX Machine Learning. All rights reserved.

LangML Suite

Coming Soon
  • Priority access to high-performance cloud LLM infrastructure
  • Be among the first to optimize RAG workflows at scale
  • Early access to an advanced fine-tuning suite
Learn More
;