趋近智
参数
72B
上下文长度
32.768K
模态
Text
架构
Dense
许可证
Tongyi Qianwen LICENSE AGREEMENT
发布日期
7 Jun 2024
知识截止
-
注意力结构
Grouped-Query Attention
隐藏维度大小
8192
层数
80
注意力头
128
键值头
8
激活函数
SwigLU
归一化
RMS Normalization
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
Qwen2-72B is a significant iteration within the Qwen2 large language model series, developed by Alibaba. This model is engineered to handle a diverse array of natural language processing tasks, encompassing both comprehension and generation, alongside proficiency in coding and mathematical problem-solving. It functions as a foundational model, intended for further specialized fine-tuning to address particular application domains.
The architectural foundation of Qwen2-72B is the Transformer, augmented with several advancements to enhance computational efficiency and model performance. Key innovations include the adoption of the SwiGLU activation function and the implementation of Group Query Attention (GQA), which optimizes the attention mechanism for reduced memory footprint and accelerated inference. Furthermore, the model incorporates an enhanced tokenizer, designed to process a wide spectrum of natural languages and programming code effectively. Notably, Qwen2-72B maintains a dense model architecture, distinguishing it from Mixture-of-Experts (MoE) configurations found in other variants within the broader Qwen2 family.
From a functional perspective, Qwen2-72B demonstrates capabilities across multiple critical areas. It is designed to excel in tasks requiring sophisticated natural language understanding, robust language generation, and adeptness in coding and mathematical reasoning. While positioned as a base model, it provides a strong pre-trained foundation suitable for post-training methodologies such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF). This design philosophy supports its application in scenarios demanding extensive multilingual understanding, complex code manipulation, or advanced mathematical computation.
The Alibaba Qwen2 model family comprises large language models built upon the Transformer architecture. It includes both dense and Mixture-of-Experts (MoE) variants, designed for diverse language tasks. Technical features include Grouped Query Attention and support for extended context lengths up to 131,072 tokens, optimizing memory footprint for inference.
排名适用于本地LLM。
排名
#30
基准 | 分数 | 排名 |
---|---|---|
Refactoring Aider Refactoring | 0.56 | 9 |
Coding Aider Coding | 0.56 | 12 |
Professional Knowledge MMLU Pro | 0.64 | 17 |
Graduate-Level QA GPQA | 0.42 | 21 |
General Knowledge MMLU | 0.42 | 28 |