Parameters
7B
Context Length
2.048K
Modality
Text
Architecture
Dense
License
Apache 2.0
Release Date
5 Jun 2023
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Query Attention
Attention Heads
71
Key-Value Heads
1
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
-
Sliding Window Attention
No
Sliding Window Size
-
Normalization
Layer Normalization
Activation Function
-
Dimensions
Hidden Dimension Size
4,544
Number of Layers
32
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
65,024
Falcon-7B is a 7 billion parameter causal decoder-only language model developed by the Technology Innovation Institute (TII). Its primary purpose is to serve as a high-performance, efficient foundation for a wide array of natural language processing tasks, encompassing both language understanding and generation capabilities. The model's design emphasizes utility within research and commercial applications, providing a robust open-source option for developers and practitioners.
Architecturally, Falcon-7B builds upon the transformer framework, incorporating specific modifications to optimize performance and efficiency. A core innovation is the implementation of Multi-Query Attention (MQA), which enhances inference speed and reduces memory overhead by allowing all attention heads to share a single key and value projection. This contrasts with traditional multi-head attention that uses separate projections for each head. Furthermore, the model integrates FlashAttention, a technique that significantly accelerates both training and inference computations through memory-efficient attention mechanisms. Positional encoding is handled via Rotary Positional Embeddings (RoPE), contributing to the model's ability to process sequence information effectively. The decoder blocks feature a parallel arrangement of attention and Multi-Layer Perceptron (MLP) components, unified by a single layer normalization.
Trained on a vast dataset of 1,500 billion tokens, primarily sourced from the RefinedWeb corpus and augmented with curated datasets, Falcon-7B exhibits proficiency in generating coherent and contextually relevant text. Its architectural optimizations are specifically tailored to facilitate efficient inference, making it well-suited for deployment in scenarios where rapid response times are critical. Common use cases include text generation, chatbots, summarization, and question answering. The model is released under the Apache 2.0 license, permitting broad commercial use and fostering its integration into various AI-driven solutions and continued research endeavors.
The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.
No evaluation benchmarks for Falcon-7B available.
Overall Rank
-
Coding Rank
-
Total Score
74
/ 100
Falcon-7B exhibits a strong transparency profile, particularly regarding its architectural design and the composition of its primary training dataset, RefinedWeb. The model's transition to a standard Apache 2.0 license and its clear self-identification are major strengths. However, it lacks detailed environmental impact data and centralized evaluation code, which limits full third-party reproducibility of its training and performance claims.
Architectural Provenance
Falcon-7B is well-documented as a causal decoder-only transformer model. TII provides specific details on architectural modifications including Multi-Query Attention (MQA), FlashAttention, and Rotary Positional Embeddings (RoPE). The use of a parallel attention/MLP decoder block with a single layer normalization is explicitly detailed in the model card and the 'RefinedWeb' technical paper. While the training code 'Gigatron' is mentioned, it is not fully open-sourced, which prevents a perfect score.
Dataset Composition
The model's training data is extensively documented through the RefinedWeb paper and model card. TII discloses a breakdown of the 1,500B tokens: RefinedWeb-English (79%), Books (7%), Conversations (6%), Code (3%), RefinedWeb-French (3%), and Technical (2%). They also released a 600B token extract of RefinedWeb for public verification. However, the specific 'curated corpora' inspired by The Pile are described generally rather than with exhaustive file-level lists.
Tokenizer Integrity
The tokenizer is publicly available on Hugging Face with a stated vocabulary size of 65,024 tokens. It uses a BPE-based approach and is shared across the Falcon-7B and 40B models. Documentation confirms it was trained on the RefinedWeb corpus, ensuring alignment with the pretraining data. Language support for English and French is explicitly stated and matches the tokenizer's design.
Parameter Density
The model is clearly defined as a dense architecture with 7 billion total parameters. Detailed hyperparameters are provided: 32 layers, a hidden dimension of 4544, and 71 attention heads. Unlike MoE models, there is no ambiguity regarding active vs. total parameters, and the architectural breakdown is consistent across all official documentation.
Training Compute
TII discloses that the model was trained on 384 A100 40GB GPUs on AWS SageMaker over approximately two weeks. They provide the parallelism strategy (2D: PP=2, DP=192) and the custom 'Gigatron' codebase name. However, they do not provide a specific carbon footprint calculation or a detailed breakdown of total energy consumption in kWh, which are requirements for a high score in this category.
Benchmark Reproducibility
While Falcon-7B results are prominently featured on the Open LLM Leaderboard, the specific evaluation code and exact prompt templates used by TII for their internal reporting are not fully centralized in a single reproducible repository. The 'RefinedWeb' paper discusses evaluation philosophy but lacks a 'one-click' reproduction script for all claimed metrics. Discrepancies in scores across different leaderboard versions have been noted by the community.
Identity Consistency
Falcon-7B demonstrates high identity consistency. It correctly identifies itself as a model developed by TII and does not exhibit the 'identity crisis' seen in some models that claim to be GPT-4. The versioning is clear, and the model card explicitly distinguishes between the base and instruct variants.
License Clarity
The model is currently released under the Apache 2.0 license, which is a standard, well-understood open-source license allowing for commercial use without royalties. Although there was initial confusion regarding a custom TII license at launch, the transition to Apache 2.0 was swift and clearly documented across all official platforms (Hugging Face, TII website).
Hardware Footprint
VRAM requirements are well-documented by both TII and the community, with approximately 15-16GB required for FP16 inference. Documentation for 4-bit and 8-bit quantization is available, noting that 4-bit can run on roughly 5.3GB of VRAM. While TII provides general guidance, detailed context-length scaling curves and specific accuracy-tradeoff benchmarks for various quantization levels are primarily provided by third parties rather than official documentation.
Versioning Drift
The model uses basic versioning on Hugging Face (e.g., commit hashes and tags), but lacks a formal, detailed changelog or semantic versioning for the weights themselves. While the release of 'Falcon-180B' and 'Falcon-2' followed, the original 7B model has not seen documented iterative updates that would allow users to track performance drift or safety tuning changes over time.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online