Parameters
3B
Context Length
32.768K
Modality
Text
Architecture
Dense
License
TII Falcon-LLM License 2.0
Release Date
17 Dec 2024
Knowledge Cutoff
-
Attention
Attention Structure
Multi-Query Attention
Attention Heads
48
Key-Value Heads
1
Attention Head Dimension
-
Position Embedding
ROPE
RoPE Theta
-
Sliding Window Attention
-
Sliding Window Size
-
Normalization
RMS Normalization
Activation Function
SwigLU
Dimensions
Hidden Dimension Size
1,536
Number of Layers
32
FFN Intermediate Size (Dense)
-
Multi-Token Prediction Heads
-
Tokenizer
Vocabulary Size
-
Falcon-3B is a member of the Falcon 3 family of decoder-only large language models, developed by the Technology Innovation Institute (TII). This model variant, with 3 billion parameters, is engineered for efficient deployment on various hardware, including systems with limited resources such as laptops and single GPUs. Its primary purpose is to deliver robust performance across a spectrum of natural language processing tasks, focusing on reasoning, language understanding, instruction following, code generation, and mathematics. The Falcon-3B model also supports multilingual capabilities, specifically English, French, Spanish, and Portuguese.
The architectural foundation of Falcon-3B is a transformer-based causal decoder-only design. It incorporates several innovations to enhance efficiency and performance. Notably, it utilizes Grouped Query Attention (GQA), a mechanism that optimizes inference speed and reduces Key-Value (KV) cache memory consumption by sharing parameters among attention heads. The model employs SwiGLU as its activation function and RMSNorm for normalization, contributing to stable and effective learning. Positional embeddings are handled using Rotary Positional Embeddings (RoPE) to support extended context comprehension. Furthermore, the model leverages FlashAttention 2 for accelerated attention computations and features a high vocabulary size of 131,000 tokens, enabling improved compression and downstream performance.
Falcon-3B, along with its instruction-tuned counterpart, has been developed using techniques such as pruning and knowledge distillation from the larger Falcon3-7B-Base model, resulting in an efficient and performant compact model. The base variant supports a context length of 8,000 tokens, while the instruction-tuned variant extends this capability to 32,000 tokens, allowing it to process and generate responses for longer and more complex inputs. This design paradigm makes Falcon-3B a suitable choice for applications requiring advanced AI functionalities in environments where computational resources are a consideration.
The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.
No evaluation benchmarks for Falcon-3B available.
Overall Rank
-
Coding Rank
-
Total Score
66
/ 100
Falcon-3B demonstrates a strong commitment to architectural transparency, providing clear documentation on its derivation from larger models and its specific structural parameters. The model excels in identity consistency and provides helpful guidance for deployment on consumer hardware through various quantization formats. However, significant transparency gaps remain regarding the specific composition of its training datasets and the detailed environmental impact of its compute requirements.
Architectural Provenance
Falcon-3B is explicitly documented as a transformer-based causal decoder-only model. TII provides specific details on its derivation, noting it was pruned and 'healed' from the larger Falcon3-7B-Base model using knowledge distillation. Key architectural components are disclosed, including the use of Grouped Query Attention (GQA) with 12 query heads and 4 KV heads, SwiGLU activation, RMSNorm, and Rotary Positional Embeddings (RoPE) with a specific base value (1000042) to support its 32K context window. While a full peer-reviewed paper for the Falcon 3 series is less accessible than for Falcon 1, the technical specifications on Hugging Face and the official TII blog provide a clear lineage and structural breakdown.
Dataset Composition
The model's training involved a two-stage process: a large-scale pretraining of the 7B parent on 14 trillion tokens, followed by a 100-gigatoken 'healing' phase for the 3B variant. TII identifies the data categories as web, code, STEM, and multilingual content (English, French, Spanish, Portuguese). However, specific percentage breakdowns of these components are not provided, and the exact sources beyond the 'RefinedWeb' legacy are not publicly listed. The post-training dataset is described as 1.2 million samples covering STEM, conversations, and safety, but the specific datasets used for this alignment are not named or accessible for audit.
Tokenizer Integrity
The tokenizer is publicly available via the Hugging Face repository and is well-documented. It features a vocabulary size of 131,072 tokens, which is a significant expansion over previous Falcon versions to improve compression and multilingual performance. The tokenizer approach is consistent with the claimed language support (EN, FR, ES, PT). Technical details such as the use of FlashAttention 2 for optimized computation are also verified in the model's configuration files.
Parameter Density
The model is a dense architecture with 3 billion total parameters. TII provides a detailed architectural breakdown, including 22 decoder blocks, a hidden dimension of 1536, and a head dimension of 256. Unlike MoE models, there is no ambiguity regarding active vs. total parameters. The impact of quantization is also addressed through the official release of GGUF, AWQ, and 1.58-bit variants, providing transparency into how parameter density translates to different precision formats.
Training Compute
TII discloses the hardware used for the distillation and healing process (1024 H100 GPU chips). However, the total GPU hours for the 3B variant's specific training run are not explicitly stated, nor is there a calculated carbon footprint or detailed energy consumption report for this specific model. While the scale of the infrastructure is clear, the lack of duration and environmental metrics prevents a higher score.
Benchmark Reproducibility
TII provides scores for standard benchmarks like MMLU-PRO (29.7), MATH (19.9), and IFEval (54.4). They specify the use of the 'lm-evaluation-harness' and note that they report raw scores without 'fewshot_as_multiturn' to distinguish their results from competitors. However, the exact prompts, few-shot examples, and full evaluation code are not provided in a standalone reproducible repository, leading to reported discrepancies between internal TII scores and those on independent leaderboards like the Open LLM Leaderboard.
Identity Consistency
The model exhibits high identity consistency, correctly identifying itself as a member of the Falcon 3 family developed by TII. It does not suffer from the 'identity crisis' seen in some fine-tuned models that claim to be GPT-4 or Llama. Versioning is clear in the naming convention (Falcon3-3B-Instruct), and the model card explicitly outlines its intended use cases and limitations.
License Clarity
The model is released under the 'TII Falcon-LLM License 2.0'. This is a custom license based on Apache 2.0 but includes specific 'Acceptable Use' restrictions and requirements for attribution (e.g., 'built using Falcon LLM technology'). While the terms are publicly accessible and relatively clear, the use of a non-standard, custom license rather than a pure OSI-approved license like Apache 2.0 or MIT introduces some legal complexity for commercial users.
Hardware Footprint
Hardware requirements are well-documented for various deployment scenarios. TII and community documentation provide VRAM estimates for FP16 (~7.3GB) and various quantized versions (INT8, INT4, 1.58-bit). The model is specifically marketed for consumer-grade hardware like laptops, and the documentation accurately reflects the memory scaling required for its 32K context window. The availability of multiple quantization formats (GGUF, AWQ) with associated performance notes aids transparency.
Versioning Drift
The model uses a clear naming convention for its initial release, but there is no public, centralized changelog or semantic versioning system to track updates to the weights or underlying datasets over time. While the release date (December 2024) is clear, the lack of a formal mechanism to notify users of silent updates or performance drift limits the score.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens
APX AI
Online