By Wei Ming T. on Dec 9, 2024
Llama 3.3 brings multilingual dialogue capabilities, rivaling larger models like Llama 3.1 (405B) on many benchmarks. This model is "compact" and efficient but requires a capable workstation for optimal performance. This guide covers everything you need to set up and run Llama 3.3 on Ubuntu Linux with Ollama.
Variant | VRAM | Recommended Hardware | Best Use Case |
---|---|---|---|
latest | 43GB | NVIDIA A5000/A6000 | Production environments requiring latest features |
70b | 43GB | NVIDIA A5000/A6000 | General purpose usage |
70b-instruct-fp16 | 141GB | Multi-GPU with NVLink | Research requiring maximum precision |
70b-instruct-q2_K | 26GB | NVIDIA RTX 3090/4090 | Home users, basic inference |
70b-instruct-q3_K_M | 34GB | NVIDIA A5000/A6000 | Production deployments |
70b-instruct-q3_K_S | 31GB | NVIDIA RTX 4090/A5000 | Balanced performance/quality |
70b-instruct-q4_0 | 40GB | NVIDIA A6000 | Higher quality inference |
70b-instruct-q4_1 | 44GB | NVIDIA A6000 | High-quality inference |
70b-instruct-q4_K_M | 43GB | NVIDIA A6000 | Production quality inference |
70b-instruct-q4_K_S | 40GB | NVIDIA A6000 | Balanced inference speed/quality |
70b-instruct-q5_0 | 49GB | NVIDIA A6000 | Near FP16 quality |
70b-instruct-q5_1 | 53GB | NVIDIA A100 | High-precision inference |
70b-instruct-q5_K_M | 50GB | NVIDIA A6000/A100 | Production quality inference |
70b-instruct-q6_K | 58GB | NVIDIA A100 | High-precision inference |
70b-instruct-q8_0 | 75GB | Multiple A100s | Maximum quality inference |
Run the following command to install Ollama:
curl -fsSL https://ollama.com/install.sh | sh
Verify the installation:
ollama --version
For manual installation:
curl -L https://ollama.com/download/ollama-linux-amd64.tgz -o ollama-linux-amd64.tgz
sudo tar -C /usr -xzf ollama-linux-amd64.tgz
ollama serve
ollama -v
For systems with AMD GPUs, download and install the ROCm package:
curl -L https://ollama.com/download/ollama-linux-amd64-rocm.tgz -o ollama-linux-amd64-rocm.tgz
sudo tar -C /usr -xzf ollama-linux-amd64-rocm.tgz
For ARM64 architectures, use this package:
curl -L https://ollama.com/download/ollama-linux-arm64.tgz -o ollama-linux-arm64.tgz
sudo tar -C /usr -xzf ollama-linux-arm64.tgz
Download the Llama 3.3 model using Ollama's pull command:
ollama pull llama3.3
To download specific variants optimized for your GPU:
ollama pull llama3.3-70b-instruct-q3_K_M
Start using the model interactively:
ollama run llama3.3
Example interaction:
User: What is the capital of Japan?
Assistant: The capital of Japan is Tokyo.
To run Ollama as a server without the desktop application:
ollama serve
For development builds:
./ollama serve
Then, in a separate shell, run a model:
./ollama run llama3.3
Ollama provides a REST API for running and managing models. Here are common usage examples to make a request to the served model:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.3",
"prompt": "Why is the sky blue?"
}'
curl http://localhost:11434/api/chat -d '{
"model": "llama3.3",
"messages": [
{ "role": "user", "content": "Why is the sky blue?" }
]
}'
For a complete list of API endpoints and options, refer to the Ollama API documentation.
Installation Issues: Ensure curl
is installed:
sudo apt update && sudo apt install curl -y
Performance Lags: Use quantized models (e.g., q3_K_M
) or a GPU-enabled system.
Service Logs: Check logs:
journalctl -e -u ollama
With these detailed instructions, you can confidently install and customize Llama 3.3 for your hardware and use case. Ollama simplifies deployment, allowing you to focus on leveraging Llama 3.3's powerful capabilities for multilingual dialogue and beyond.
© 2024 ApX Machine Learning. All rights reserved.
Learn Data Science & Machine Learning
Machine Learning Tools
Featured Posts