趋近智
参数
111B
上下文长度
256K
模态
Text
架构
Dense
许可证
CC-BY-NC
发布日期
13 Mar 2025
训练数据截止日期
Jun 2024
注意力结构
Multi-Head Attention
隐藏维度大小
-
层数
-
注意力头
-
键值头
-
激活函数
SwigLU
归一化
-
位置嵌入
Absolute Position Embedding
Cohere Command A is a large-scale generative model engineered for high-performance enterprise workflows, particularly those involving tool use, agents, and retrieval-augmented generation (RAG). Developed to provide a high-throughput alternative for production environments, the model maintains a significant parameter count of 111 billion while optimizing for deployment on common dual-GPU hardware configurations. Its design focuses on business-critical accuracy and speed, supporting a standard context window of 256,000 tokens to facilitate the processing of extensive corporate documentation and long-form conversational histories.
The model's architecture is a decoder-only transformer that utilizes several sophisticated structural innovations to balance local and global context. It features a hybrid attention mechanism where three-quarters of the layers employ sliding window attention for efficient local modeling, while every fourth layer uses a full global attention mechanism to maintain long-range dependencies. Technical specifications include the use of Grouped Query Attention (GQA) for optimized inference throughput, SwiGLU activation functions for improved gradient flow, and the omission of bias terms to stabilize the training process. Positional information is handled via Rotary Positional Embeddings (RoPE) in local attention layers, whereas global layers utilize a position-agnostic approach.
Optimized for global enterprise deployment, Command A is trained across 23 languages, including major business languages such as English, French, Spanish, Chinese, and Arabic. The model is specifically aligned for conversational tool use, allowing it to interact with external APIs, databases, and search engines with high precision. This alignment, achieved through supervised fine-tuning and preference optimization, makes it particularly effective for multi-step agentic reasoning and financial data manipulation where extracting numerical details from complex contexts is required.
排名
#106
| 基准 | 分数 | 排名 |
|---|---|---|
Summarization ProLLM Summarization | 0.86 | 6 |
Coding Aider Coding | 0.12 | 9 |