ApX logoApX logo

Falcon-3B

Parameters

3B

Context Length

32.768K

Modality

Text

Architecture

Dense

License

TII Falcon-LLM License 2.0

Release Date

17 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Multi-Query Attention

Attention Heads

48

Key-Value Heads

1

Attention Head Dimension

-

Position Embedding

ROPE

RoPE Theta

-

Sliding Window Attention

-

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

1,536

Number of Layers

32

FFN Intermediate Size (Dense)

-

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

-

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 1.5k · Context: 32.8kx 32 layersRMSNormPre-AttentionMulti-Query Attention48Q / 1KV headsHead dim: 32+RMSNormPre-FFNFeed-Forward NetworkSwiGLU+Final RMSNormOutput Logits

Falcon-3B

Falcon-3B is a member of the Falcon 3 family of decoder-only large language models, developed by the Technology Innovation Institute (TII). This model variant, with 3 billion parameters, is engineered for efficient deployment on various hardware, including systems with limited resources such as laptops and single GPUs. Its primary purpose is to deliver robust performance across a spectrum of natural language processing tasks, focusing on reasoning, language understanding, instruction following, code generation, and mathematics. The Falcon-3B model also supports multilingual capabilities, specifically English, French, Spanish, and Portuguese.

The architectural foundation of Falcon-3B is a transformer-based causal decoder-only design. It incorporates several innovations to enhance efficiency and performance. Notably, it utilizes Grouped Query Attention (GQA), a mechanism that optimizes inference speed and reduces Key-Value (KV) cache memory consumption by sharing parameters among attention heads. The model employs SwiGLU as its activation function and RMSNorm for normalization, contributing to stable and effective learning. Positional embeddings are handled using Rotary Positional Embeddings (RoPE) to support extended context comprehension. Furthermore, the model leverages FlashAttention 2 for accelerated attention computations and features a high vocabulary size of 131,000 tokens, enabling improved compression and downstream performance.

Falcon-3B, along with its instruction-tuned counterpart, has been developed using techniques such as pruning and knowledge distillation from the larger Falcon3-7B-Base model, resulting in an efficient and performant compact model. The base variant supports a context length of 8,000 tokens, while the instruction-tuned variant extends this capability to 32,000 tokens, allowing it to process and generate responses for longer and more complex inputs. This design paradigm makes Falcon-3B a suitable choice for applications requiring advanced AI functionalities in environments where computational resources are a consideration.

About Falcon

The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.


Other Falcon Models

Evaluation Benchmarks

No evaluation benchmarks for Falcon-3B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B

66 / 100

Falcon-3B Model Integrity Report

Total Score

66

/ 100

B

Audit Note

Falcon-3B demonstrates a strong commitment to architectural transparency, providing clear documentation on its derivation from larger models and its specific structural parameters. The model excels in identity consistency and provides helpful guidance for deployment on consumer hardware through various quantization formats. However, significant transparency gaps remain regarding the specific composition of its training datasets and the detailed environmental impact of its compute requirements.

Upstream

20.5 / 30

Architectural Provenance

7.5 / 10

Falcon-3B is explicitly documented as a transformer-based causal decoder-only model. TII provides specific details on its derivation, noting it was pruned and 'healed' from the larger Falcon3-7B-Base model using knowledge distillation. Key architectural components are disclosed, including the use of Grouped Query Attention (GQA) with 12 query heads and 4 KV heads, SwiGLU activation, RMSNorm, and Rotary Positional Embeddings (RoPE) with a specific base value (1000042) to support its 32K context window. While a full peer-reviewed paper for the Falcon 3 series is less accessible than for Falcon 1, the technical specifications on Hugging Face and the official TII blog provide a clear lineage and structural breakdown.

Dataset Composition

4.5 / 10

The model's training involved a two-stage process: a large-scale pretraining of the 7B parent on 14 trillion tokens, followed by a 100-gigatoken 'healing' phase for the 3B variant. TII identifies the data categories as web, code, STEM, and multilingual content (English, French, Spanish, Portuguese). However, specific percentage breakdowns of these components are not provided, and the exact sources beyond the 'RefinedWeb' legacy are not publicly listed. The post-training dataset is described as 1.2 million samples covering STEM, conversations, and safety, but the specific datasets used for this alignment are not named or accessible for audit.

Tokenizer Integrity

8.5 / 10

The tokenizer is publicly available via the Hugging Face repository and is well-documented. It features a vocabulary size of 131,072 tokens, which is a significant expansion over previous Falcon versions to improve compression and multilingual performance. The tokenizer approach is consistent with the claimed language support (EN, FR, ES, PT). Technical details such as the use of FlashAttention 2 for optimized computation are also verified in the model's configuration files.

Model

26.0 / 40

Parameter Density

8.0 / 10

The model is a dense architecture with 3 billion total parameters. TII provides a detailed architectural breakdown, including 22 decoder blocks, a hidden dimension of 1536, and a head dimension of 256. Unlike MoE models, there is no ambiguity regarding active vs. total parameters. The impact of quantization is also addressed through the official release of GGUF, AWQ, and 1.58-bit variants, providing transparency into how parameter density translates to different precision formats.

Training Compute

4.0 / 10

TII discloses the hardware used for the distillation and healing process (1024 H100 GPU chips). However, the total GPU hours for the 3B variant's specific training run are not explicitly stated, nor is there a calculated carbon footprint or detailed energy consumption report for this specific model. While the scale of the infrastructure is clear, the lack of duration and environmental metrics prevents a higher score.

Benchmark Reproducibility

5.0 / 10

TII provides scores for standard benchmarks like MMLU-PRO (29.7), MATH (19.9), and IFEval (54.4). They specify the use of the 'lm-evaluation-harness' and note that they report raw scores without 'fewshot_as_multiturn' to distinguish their results from competitors. However, the exact prompts, few-shot examples, and full evaluation code are not provided in a standalone reproducible repository, leading to reported discrepancies between internal TII scores and those on independent leaderboards like the Open LLM Leaderboard.

Identity Consistency

9.0 / 10

The model exhibits high identity consistency, correctly identifying itself as a member of the Falcon 3 family developed by TII. It does not suffer from the 'identity crisis' seen in some fine-tuned models that claim to be GPT-4 or Llama. Versioning is clear in the naming convention (Falcon3-3B-Instruct), and the model card explicitly outlines its intended use cases and limitations.

Downstream

19.5 / 30

License Clarity

7.0 / 10

The model is released under the 'TII Falcon-LLM License 2.0'. This is a custom license based on Apache 2.0 but includes specific 'Acceptable Use' restrictions and requirements for attribution (e.g., 'built using Falcon LLM technology'). While the terms are publicly accessible and relatively clear, the use of a non-standard, custom license rather than a pure OSI-approved license like Apache 2.0 or MIT introduces some legal complexity for commercial users.

Hardware Footprint

7.5 / 10

Hardware requirements are well-documented for various deployment scenarios. TII and community documentation provide VRAM estimates for FP16 (~7.3GB) and various quantized versions (INT8, INT4, 1.58-bit). The model is specifically marketed for consumer-grade hardware like laptops, and the documentation accurately reflects the memory scaling required for its 32K context window. The availability of multiple quantization formats (GGUF, AWQ) with associated performance notes aids transparency.

Versioning Drift

5.0 / 10

The model uses a clear naming convention for its initial release, but there is no public, centralized changelog or semantic versioning system to track updates to the weights or underlying datasets over time. While the release date (December 2024) is clear, the lack of a formal mechanism to notify users of silent updates or performance drift limits the score.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs

Falcon-3B: Specifications and GPU VRAM Requirements