Parameters
111B
Context Length
256K
Modality
Text
Architecture
Dense
License
CC-BY-NC
Release Date
13 Mar 2025
Knowledge Cutoff
-
Attention Structure
Multi-Head Attention
Hidden Dimension Size
-
Number of Layers
-
Attention Heads
-
Key-Value Heads
-
Activation Function
SwigLU
Normalization
-
Position Embedding
Absolute Position Embedding
VRAM requirements for different quantization methods and context sizes
Cohere Command A is a large language model specifically engineered for enterprise applications that demand high performance, security, and computational efficiency. This model is designed to excel in business-critical tasks such as tool use, retrieval augmented generation (RAG), agentic workflows, and multilingual use cases. It demonstrates notable efficiency, capable of running on minimal GPU configurations, thereby reducing the computational overhead for private deployments. Command A is trained to perform effectively across 23 languages, ensuring its applicability in diverse global business environments.
The architectural foundation of Command A is an optimized decoder-only transformer. This architecture incorporates interleaved attention mechanisms, combining three layers of sliding window attention with Rotary Positional Embeddings (RoPE) for efficient local context modeling. A fourth layer employs global attention without positional embeddings, allowing for unrestricted token interactions across extended sequences. Further architectural innovations include grouped-query attention to enhance throughput, shared input and output embeddings to conserve memory, and the omission of bias terms for training stabilization. The model utilizes SwiGLU activation functions.
Command A is optimized for throughput and long-context reasoning. It supports a context length of 256,000 tokens, which enables it to process extensive documents for various enterprise applications. The model is also designed for conversational interactions and is capable of generating responses in a chatty style, optionally using markdown for clarity. It is particularly adept at extracting and manipulating numerical information in financial settings and is trained for conversational tool use, allowing it to interact with external systems such as APIs and databases.
Ranking is for Local LLMs.
Rank
#25
Benchmark | Score | Rank |
---|---|---|
Summarization ProLLM Summarization | 0.86 | 🥈 2 |
General Knowledge MMLU | 0.81 | 🥉 3 |
StackUnseen ProLLM Stack Unseen | 0.23 | 10 |
Agentic Coding LiveBench Agentic | 0.05 | 14 |
Coding LiveBench Coding | 0.54 | 15 |
Reasoning LiveBench Reasoning | 0.36 | 19 |
Mathematics LiveBench Mathematics | 0.46 | 20 |
Data Analysis LiveBench Data Analysis | 0.50 | 21 |
Overall Rank
#25
Coding Rank
#32
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens