Parameters
111B
Context Length
256K
Modality
Text
Architecture
Dense
License
CC-BY-NC
Release Date
13 Mar 2025
Knowledge Cutoff
Jun 2024
Attention Structure
Multi-Head Attention
Hidden Dimension Size
-
Number of Layers
-
Attention Heads
-
Key-Value Heads
-
Activation Function
SwigLU
Normalization
-
Position Embedding
Absolute Position Embedding
Cohere Command A is a large-scale generative model engineered for high-performance enterprise workflows, particularly those involving tool use, agents, and retrieval-augmented generation (RAG). Developed to provide a high-throughput alternative for production environments, the model maintains a significant parameter count of 111 billion while optimizing for deployment on common dual-GPU hardware configurations. Its design focuses on business-critical accuracy and speed, supporting a standard context window of 256,000 tokens to facilitate the processing of extensive corporate documentation and long-form conversational histories.
The model's architecture is a decoder-only transformer that utilizes several sophisticated structural innovations to balance local and global context. It features a hybrid attention mechanism where three-quarters of the layers employ sliding window attention for efficient local modeling, while every fourth layer uses a full global attention mechanism to maintain long-range dependencies. Technical specifications include the use of Grouped Query Attention (GQA) for optimized inference throughput, SwiGLU activation functions for improved gradient flow, and the omission of bias terms to stabilize the training process. Positional information is handled via Rotary Positional Embeddings (RoPE) in local attention layers, whereas global layers utilize a position-agnostic approach.
Optimized for global enterprise deployment, Command A is trained across 23 languages, including major business languages such as English, French, Spanish, Chinese, and Arabic. The model is specifically aligned for conversational tool use, allowing it to interact with external APIs, databases, and search engines with high precision. This alignment, achieved through supervised fine-tuning and preference optimization, makes it particularly effective for multi-step agentic reasoning and financial data manipulation where extracting numerical details from complex contexts is required.
Rank
#106
| Benchmark | Score | Rank |
|---|---|---|
Summarization ProLLM Summarization | 0.86 | 6 |
Coding Aider Coding | 0.12 | 9 |
Overall Rank
#106
Coding Rank
#90
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens