Llama 3.2 3B: Specifications and GPU VRAM Requirements

Llama 3.2 3B

开源

开放权重

参数

上下文长度

128K

模态

Text

架构

Dense

许可证

Llama 3.2 Community License

发布日期

25 Sept 2024

训练数据截止日期

Dec 2023

技术规格

注意力结构

Grouped-Query Attention

隐藏维度大小

2048

层数

注意力头

键值头

激活函数

归一化

位置嵌入

ROPE

系统要求

不同量化方法和上下文大小的显存要求

Llama 3.2 3B

Llama 3.2 3B is a compact, instruction-tuned, and text-only generative language model developed by Meta. It is part of the Llama 3.2 model family, which also includes 1 billion parameter text models and larger multimodal variants. The model is specifically designed for efficient deployment in resource-constrained environments, such as edge and mobile devices. Its primary purpose is to facilitate scalable assistant and agentic language technologies by offering capabilities for tasks such as summarization, instruction following, rewriting, and knowledge retrieval. The model supports multilingual interactions, with official support for eight languages including English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

The architectural foundation of Llama 3.2 3B is an auto-regressive transformer. Key innovations include the adoption of Grouped-Query Attention (GQA) to enhance inference scalability, a technique that improves throughput without a proportional increase in hardware demands. Training involved knowledge distillation from larger Llama variants, specifically Llama 3.1 8B and 70B models, where their output logits served as token-level targets during pre-training to recover performance after pruning. Post-training alignment, particularly for instruction-tuned versions, utilizes supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF). Furthermore, the model incorporates advanced quantization techniques, employing 4-bit groupwise quantization for transformer block weights and 8-bit per-token dynamic quantization for activations, optimizing its operation for environments like PyTorch's ExecuTorch framework.

Llama 3.2 3B is engineered for robust performance in on-device scenarios, balancing computational efficiency with output quality. It features an extended context window of 128,000 tokens, enabling processing of longer inputs for tasks such as document summarization and extended conversations. While the full precision models support this context length, quantized versions are typically configured for an 8,000-token context. The model's design prioritizes low-latency inferencing, making it suitable for applications that require rapid responses and operate with limited computational resources, such as mobile AI-powered writing assistants and customer service applications. The pre-trained variants also provide a foundational basis for further fine-tuning across various natural language generation tasks.

关于 Llama 3.2

Meta's Llama 3.2 family introduces vision models, integrating image encoders with language models for multimodal text and image processing. It also includes lightweight variants optimized for efficient on-device deployment, supporting an extended 128K token context length.

其他 Llama 3.2 模型

Llama 3.2 1B

评估基准

排名适用于本地LLM。

排名

#56

基准	分数	排名
Refactoring Aider Refactoring	0.26	19
Coding Aider Coding	0.26	23
Graduate-Level QA GPQA	0.33	25
General Knowledge MMLU	0.33	36

排名

#56

编程排名

#47

GPU 要求

完整计算器

量化

选择模型权重的量化方法

上下文大小：1024 个令牌

63k

125k

所需显存:

资源

官方文档发布说明下载权重源代码

Llama 3.2 3B

技术规格

系统要求

Llama 3.2 3B

关于 Llama 3.2

其他 Llama 3.2 模型

评估基准

排名

GPU 要求

所需显存:

推荐 GPU

资源