Parameters
40B
Context Length
2.048K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
5 Jun 2023
Knowledge Cutoff
Feb 2023
Attention
Attention Structure
Multi-Query Attention
Attention Heads
64
Key-Value Heads
1
Attention Head Dimension
64
Position Embedding
ROPE
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
Layer Normalization
Activation Function
-
Dimensions
Hidden Dimension Size
8,192
Number of Layers
60
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
65,024
Falcon-40B is a 40-billion parameter causal decoder-only language model developed by the Technology Innovation Institute (TII). This foundational model was trained on one trillion tokens, primarily derived from the RefinedWeb dataset, which is a high-quality, filtered, and deduplicated web corpus, enhanced with additional curated data. The model's core objective is causal language modeling, which involves predicting the subsequent token in a given sequence. It is designed to serve as a robust base model for a variety of natural language processing applications.
The architectural design of Falcon-40B is an adaptation of the GPT-3 framework, incorporating specific modifications for enhanced efficiency and performance. Key architectural innovations include the implementation of rotary positional embeddings (RoPE) for improved handling of sequence positions, and an attention mechanism featuring both multiquery attention (MQA) and FlashAttention. MQA is a critical optimization, allowing for the sharing of a single key and value pair across all attention heads, thereby significantly improving inference scalability without impacting pretraining efficiency. The decoder block employs a parallel attention and Multi-Layer Perceptron (MLP) structure, augmented with two-layer normalization schemes to stabilize training and improve model performance.
Falcon-40B is optimized for efficient inference, which contributes to its higher processing speeds and scalability for deployment. As a raw, pretrained model, it is designed to be further fine-tuned for specific tasks. Its capabilities extend to various natural language generation and understanding applications, including content creation, machine translation, sentiment analysis, and language tutoring. The model supports several languages, exhibiting strong proficiency in English, German, Spanish, and French, alongside limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish.
The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.
No evaluation benchmarks for Falcon-40B available.
Overall Rank
-
Coding Rank
-
Total Score
72
/ 100
Falcon-40B exhibits a strong transparency profile for a foundational model, particularly regarding its architectural modifications and the composition of its training data. The release of the RefinedWeb extract and the eventual adoption of the Apache 2.0 license demonstrate a commitment to open-source principles. However, the model's transparency is hampered by its initial restrictive licensing and a lack of granular detail regarding training compute costs and evaluation reproducibility.
Architectural Provenance
Falcon-40B is explicitly documented as a causal decoder-only model based on the GPT-3 architecture with significant, well-documented modifications. These include the use of Rotary Positional Embeddings (RoPE), Multi-Query Attention (MQA) for inference efficiency, and FlashAttention. The model's technical specifications (60 layers, 8192 embedding dimension, 64 heads) are publicly available via the official Hugging Face model card and the 'Falcon Series' technical paper on arXiv. However, it loses points for the lack of a full, peer-reviewed primary paper at the time of its initial peak popularity, though the arXiv report eventually filled many gaps.
Dataset Composition
TII provided a relatively high level of transparency regarding the training data, disclosing that the model was trained on 1 trillion tokens. They released a detailed breakdown of the composition: RefinedWeb-English (75%), RefinedWeb-Europe (7%), Books (6%), Conversations (5%), Code (5%), and Technical data (2%). They also released a 600B token extract of the RefinedWeb dataset for public audit. It falls short of a perfect score because the 'curated' portions (Books, Code, etc.) are described generally (e.g., 'massive web crawl', 'Reddit') without specific source lists or full public access to the non-web components.
Tokenizer Integrity
The tokenizer is publicly accessible via Hugging Face and the 'transformers' library. It has a clearly stated vocabulary size of 65,024 tokens and uses a BPE-based approach. Documentation confirms it was trained on the RefinedWeb dataset, ensuring alignment with the training data. The vocabulary includes extra values for downstream adaptations, which is a level of detail often omitted by other providers.
Parameter Density
The model is clearly defined as a dense architecture with 40 billion total parameters. Unlike MoE models, there is no ambiguity between total and active parameters. The architectural breakdown (layers, heads, dimensions) is fully disclosed in technical documentation. It loses minor points for not providing a granular breakdown of parameter distribution between attention and FFN layers in the primary model card, though this can be inferred from the code.
Training Compute
TII disclosed the hardware used (384 A100 40GB GPUs on AWS SageMaker) and the training duration (two months). They also mentioned the use of a custom distributed training codebase ('Gigatron') and 3D parallelism. However, it lacks a specific calculation of the total GPU-hours, a detailed carbon footprint analysis, or an official cost estimate, which are requirements for the highest scores in this category.
Benchmark Reproducibility
While Falcon-40B was heavily marketed based on its OpenLLM Leaderboard performance, the initial release lacked a comprehensive technical report detailing the exact evaluation prompts and few-shot settings used for all internal benchmarks. Third-party verification is available via the OpenLLM Leaderboard, but the lack of a dedicated, reproducible evaluation suite provided directly by the authors at launch limits transparency. The score is further impacted by the general industry-wide concerns regarding web-scale training data and benchmark overlap.
Identity Consistency
Falcon-40B demonstrates high identity consistency. It does not suffer from the 'identity crisis' seen in some fine-tuned models that claim to be GPT-4. It correctly identifies its version and origin when prompted in its 'Instruct' variant. TII has been clear about the distinction between the base and instruct versions.
License Clarity
The model is currently licensed under the permissive Apache 2.0 license, which is clearly stated and allows for commercial use. However, the score is tempered by the model's history: it was initially released under a restrictive, custom 'Falcon LLM License' that required royalties for commercial use over a certain threshold. While TII corrected this quickly due to community pressure, the initial ambiguity and the shift in terms prevent a perfect score.
Hardware Footprint
Hardware requirements are well-documented by both the official model card and the community. TII provides clear guidance that the model requires ~90GB of VRAM for FP16, and documentation for 8-bit (~45GB) and 4-bit (~27GB) quantization is readily available. The impact of quantization on memory is explicitly addressed, though more detailed documentation on specific accuracy tradeoffs for different quantization levels would be beneficial.
Versioning Drift
The model follows a basic versioning scheme (Falcon-40B vs Falcon-40B-Instruct), but it lacks a rigorous semantic versioning system or a detailed public changelog for weight updates. While the weights on Hugging Face are timestamped, there is no formal mechanism for tracking silent updates or performance drift over time beyond community-led benchmarks.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online