Falcon-40B

Open Source

Open Weights

Parameters

40B

Context Length

Modality

Text

Architecture

Dense

License

Apache 2.0

Release Date

5 Jun 2023

Knowledge Cutoff

Feb 2023

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

85.52 GB VRAM

Consumer

4x RTX 4090

24GB VRAM

Datacenter

2x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

2,048 tokens

85.53 GB VRAM

Consumer

4x RTX 4090

24GB VRAM

Datacenter

2x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

Architecture Diagram

Evaluation Benchmarks

No evaluation benchmarks for Falcon-40B available.

Rankings

Overall Rank

Coding Rank

About Falcon-40B

Falcon-40B is a 40-billion parameter causal decoder-only language model developed by the Technology Innovation Institute (TII). This foundational model was trained on one trillion tokens, primarily derived from the RefinedWeb dataset, which is a high-quality, filtered, and deduplicated web corpus, enhanced with additional curated data. The model's core objective is causal language modeling, which involves predicting the subsequent token in a given sequence. It is designed to serve as a robust base model for a variety of natural language processing applications.

The architectural design of Falcon-40B is an adaptation of the GPT-3 framework, incorporating specific modifications for enhanced efficiency and performance. Key architectural innovations include the implementation of rotary positional embeddings (RoPE) for improved handling of sequence positions, and an attention mechanism featuring both multiquery attention (MQA) and FlashAttention. MQA is a critical optimization, allowing for the sharing of a single key and value pair across all attention heads, thereby significantly improving inference scalability without impacting pretraining efficiency. The decoder block employs a parallel attention and Multi-Layer Perceptron (MLP) structure, augmented with two-layer normalization schemes to stabilize training and improve model performance.

Falcon-40B is optimized for efficient inference, which contributes to its higher processing speeds and scalability for deployment. As a raw, pretrained model, it is designed to be further fine-tuned for specific tasks. Its capabilities extend to various natural language generation and understanding applications, including content creation, machine translation, sentiment analysis, and language tutoring. The model supports several languages, exhibiting strong proficiency in English, German, Spanish, and French, alongside limited capabilities in Italian, Portuguese, Polish, Dutch, Romanian, Czech, and Swedish.

Technical Specifications

Attention

Attention Structure

Multi-Query Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

Position Embedding

ROPE

RoPE Theta

Sliding Window Attention

Sliding Window Size

Sliding Window Ratio

Linear Attention

Linear Attention Ratio

Normalization

Layer Normalization

Activation Function

Dimensions

Hidden Dimension Size

8,192

Number of Layers

FFN Intermediate Size (Dense)

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

65,024

Model Integrity

Total Score

B+

72 / 100

Upstream

23.0 / 30

Model

28.0 / 40

Downstream

20.5 / 30

Falcon-40B Model Integrity Report

Total Score

/ 100

B+

Audit Note

Falcon-40B exhibits a strong transparency profile for a foundational model, particularly regarding its architectural modifications and the composition of its training data. The release of the RefinedWeb extract and the eventual adoption of the Apache 2.0 license demonstrate a commitment to open-source principles. However, the model's transparency is hampered by its initial restrictive licensing and a lack of granular detail regarding training compute costs and evaluation reproducibility.

Upstream

23.0 / 30

Architectural Provenance

7.5 / 10

Falcon-40B is explicitly documented as a causal decoder-only model based on the GPT-3 architecture with significant, well-documented modifications. These include the use of Rotary Positional Embeddings (RoPE), Multi-Query Attention (MQA) for inference efficiency, and FlashAttention. The model's technical specifications (60 layers, 8192 embedding dimension, 64 heads) are publicly available via the official Hugging Face model card and the 'Falcon Series' technical paper on arXiv. However, it loses points for the lack of a full, peer-reviewed primary paper at the time of its initial peak popularity, though the arXiv report eventually filled many gaps.

Dataset Composition

7.0 / 10

TII provided a relatively high level of transparency regarding the training data, disclosing that the model was trained on 1 trillion tokens. They released a detailed breakdown of the composition: RefinedWeb-English (75%), RefinedWeb-Europe (7%), Books (6%), Conversations (5%), Code (5%), and Technical data (2%). They also released a 600B token extract of the RefinedWeb dataset for public audit. It falls short of a perfect score because the 'curated' portions (Books, Code, etc.) are described generally (e.g., 'massive web crawl', 'Reddit') without specific source lists or full public access to the non-web components.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly accessible via Hugging Face and the 'transformers' library. It has a clearly stated vocabulary size of 65,024 tokens and uses a BPE-based approach. Documentation confirms it was trained on the RefinedWeb dataset, ensuring alignment with the training data. The vocabulary includes extra values for downstream adaptations, which is a level of detail often omitted by other providers.

Model

28.0 / 40

Parameter Density

8.0 / 10

The model is clearly defined as a dense architecture with 40 billion total parameters. Unlike MoE models, there is no ambiguity between total and active parameters. The architectural breakdown (layers, heads, dimensions) is fully disclosed in technical documentation. It loses minor points for not providing a granular breakdown of parameter distribution between attention and FFN layers in the primary model card, though this can be inferred from the code.

Training Compute

6.0 / 10

TII disclosed the hardware used (384 A100 40GB GPUs on AWS SageMaker) and the training duration (two months). They also mentioned the use of a custom distributed training codebase ('Gigatron') and 3D parallelism. However, it lacks a specific calculation of the total GPU-hours, a detailed carbon footprint analysis, or an official cost estimate, which are requirements for the highest scores in this category.

Benchmark Reproducibility

5.0 / 10

While Falcon-40B was heavily marketed based on its OpenLLM Leaderboard performance, the initial release lacked a comprehensive technical report detailing the exact evaluation prompts and few-shot settings used for all internal benchmarks. Third-party verification is available via the OpenLLM Leaderboard, but the lack of a dedicated, reproducible evaluation suite provided directly by the authors at launch limits transparency. The score is further impacted by the general industry-wide concerns regarding web-scale training data and benchmark overlap.

Identity Consistency

9.0 / 10

Falcon-40B demonstrates high identity consistency. It does not suffer from the 'identity crisis' seen in some fine-tuned models that claim to be GPT-4. It correctly identifies its version and origin when prompted in its 'Instruct' variant. TII has been clear about the distinction between the base and instruct versions.

Downstream

20.5 / 30

License Clarity

7.5 / 10

The model is currently licensed under the permissive Apache 2.0 license, which is clearly stated and allows for commercial use. However, the score is tempered by the model's history: it was initially released under a restrictive, custom 'Falcon LLM License' that required royalties for commercial use over a certain threshold. While TII corrected this quickly due to community pressure, the initial ambiguity and the shift in terms prevent a perfect score.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented by both the official model card and the community. TII provides clear guidance that the model requires ~90GB of VRAM for FP16, and documentation for 8-bit (~45GB) and 4-bit (~27GB) quantization is readily available. The impact of quantization on memory is explicitly addressed, though more detailed documentation on specific accuracy tradeoffs for different quantization levels would be beneficial.

Versioning Drift

5.0 / 10

The model follows a basic versioning scheme (Falcon-40B vs Falcon-40B-Instruct), but it lacks a rigorous semantic versioning system or a detailed public changelog for weight updates. While the weights on Hugging Face are timestamped, there is no formal mechanism for tracking silent updates or performance drift over time beyond community-led benchmarks.

Resources

Official Documentation Download Weights Source Code

About Falcon

The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.

Falcon-40B

System Requirements

Architecture Diagram

Evaluation Benchmarks

Rankings

About Falcon-40B

Technical Specifications

Model Integrity

Falcon-40B Model Integrity Report

Audit Note

Upstream

Model

Downstream

Resources

About Falcon

Other Falcon Models