ApX logoApX logo

Falcon3-3B

Parameters

3B

Context Length

32.768K

Modality

Text

Architecture

Dense

License

TII Falcon-LLM License 2.0

Release Date

17 Dec 2024

Knowledge Cutoff

-

Technical Specifications

Attention

Attention Structure

Grouped-Query Attention

Attention Heads

24

Key-Value Heads

6

Attention Head Dimension

256

Position Embedding

ROPE

RoPE Theta

1,000,042

Sliding Window Attention

No

Sliding Window Size

-

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

1,536

Number of Layers

28

FFN Intermediate Size (Dense)

9,216

Multi-Token Prediction Heads

-

Tokenizer

Vocabulary Size

131,072

Architecture Diagram

Input TokensToken EmbeddingPosition: RoPEHidden: 1.5k · Context: 32.8k · Vocab: 131.1kx 28 layersRMSNormPre-AttentionGrouped-Query Attention24Q / 6KV headsHead dim: 256+RMSNormPre-FFNFeed-Forward NetworkSwiGLUIntermediate: 9.2k+Final RMSNormOutput Logits

Falcon3-3B

The Falcon3-3B model is part of the Falcon 3 family of open foundation models developed by the Technology Innovation Institute (TII). This model is designed for a balance of performance and efficiency, enabling its deployment on a range of computing infrastructures, including smaller devices. It is developed to support advancements in capabilities related to science, mathematics, and code generation. The Falcon 3 series includes both base models for general-purpose generative tasks and instruct models for conversational applications, emphasizing accessibility in advanced artificial intelligence systems.

Architecturally, Falcon3-3B employs a transformer-based causal decoder-only design. It incorporates 22 decoder blocks, contributing to its processing depth. For attention mechanisms, the model utilizes Grouped Query Attention (GQA) with 12 query heads and 4 key-value heads, along with a wider head dimension of 256. This configuration supports efficient inference operations. The model integrates SwiGLU as its activation function and RMSNorm for normalization, in addition to using Rotary Position Embeddings (RoPE) with a high value to handle extended context. It also leverages Flash Attention 2 for optimized memory and speed during operations.

The Falcon3-3B model, particularly its instruct variant, supports a context length of up to 32,768 tokens, while the base version supports 8,192 tokens. It is engineered to perform on tasks such as reasoning, language understanding, instruction following, and mathematical problem-solving. The model has been trained to support four languages: English, French, Spanish, and Portuguese. Its design considerations include the availability of quantized versions, such as int4, int8, and 1.58 Bitnet, which further enhance its efficiency and suitability for resource-constrained environments.

About Falcon 3

The TII Falcon 3 model family comprises open-source, decoder-only language models (1B-10B parameters) designed for efficiency. Key innovations include an extended 32K token context window, Grouped-Query Attention (GQA), and specialized versions for scientific and code-oriented applications. Some variants integrate Mamba-based architectures.


Other Falcon 3 Models

Evaluation Benchmarks

No evaluation benchmarks for Falcon3-3B available.

Rankings

Overall Rank

-

Coding Rank

-

Model Integrity

Total Score

B+

71 / 100

Falcon3-3B Model Integrity Report

Total Score

71

/ 100

B+

Audit Note

Falcon3-3B exhibits strong transparency regarding its architectural design and hardware requirements, providing detailed specifications and accessible weights. Its primary transparency weaknesses lie in the lack of a comprehensive technical paper detailing dataset proportions and specific compute costs. While the model is highly verifiable in its structure and licensing, more granular disclosure of training data sources and evaluation prompts would be required for an exemplary rating.

Upstream

21.5 / 30

Architectural Provenance

8.0 / 10

The model is explicitly identified as a transformer-based causal decoder-only architecture. TII provides specific details on its derivation, noting it was pruned and 'healed' from the larger Falcon3-7B-Base model using knowledge distillation. Architectural specifics are well-documented, including the use of 22 decoder blocks, Grouped Query Attention (GQA) with 12 query and 4 KV heads, SwiGLU activation, RMSNorm, and Rotary Position Embeddings (RoPE) with a specific high value (1000042) for context handling.

Dataset Composition

4.5 / 10

While TII discloses the total token count for the Falcon 3 family (14 trillion) and the specific amount used for 'healing' the 3B variant (100 billion tokens), the breakdown of the dataset is only described in general categories: web, code, STEM, and high-quality multilingual data. Specific proportions, source names beyond the legacy 'RefinedWeb', and detailed filtering/cleaning methodologies for this specific version are not publicly detailed in a technical paper.

Tokenizer Integrity

9.0 / 10

The tokenizer is publicly accessible via the Hugging Face repository and is well-documented. It features a vocabulary size of 131,072 tokens, which is a significant increase from previous versions. The approach is consistent with the claimed support for English, French, Spanish, and Portuguese, and the tokenizer files (tokenizer.json, tokenizer_config.json) are available for direct inspection and verification.

Model

29.0 / 40

Parameter Density

8.5 / 10

The model's parameter count is clearly stated as 3 billion. As a dense model, all parameters are active during inference. TII provides a detailed architectural breakdown including the number of layers (22), head dimensions (256), and attention configurations (GQA), which allows for precise verification of the parameter density claims.

Training Compute

5.0 / 10

TII discloses the hardware used (1024 H100 GPU chips) for the training process. However, specific GPU-hours for the 3B variant's distillation and healing phase are not explicitly provided, nor is there a detailed carbon footprint calculation or energy consumption report specific to this model's development cycle.

Benchmark Reproducibility

6.0 / 10

TII provides benchmark results on standard sets like MMLU-Pro, MATH, and IFEval. They specify the use of the 'lm-evaluation-harness' framework. However, while they mention 'internal pipeline' settings, the exact prompts and few-shot configurations are not fully documented in a public technical report, leading to some community noted discrepancies when compared to standard leaderboard evaluations.

Identity Consistency

9.5 / 10

The model demonstrates high identity consistency, correctly identifying itself as part of the Falcon 3 family in its system prompts and documentation. There is no evidence of the model claiming to be a competitor's product (e.g., GPT-4), and it maintains a clear versioning identity across its base and instruct variants.

Downstream

20.5 / 30

License Clarity

7.5 / 10

The model is released under the 'TII Falcon-LLM License 2.0'. This is a custom license based on Apache 2.0 but includes specific requirements, such as mandatory attribution for derivative works ('built using AI technology from TII'). While the terms are legally clear and allow for commercial use, it is not a standard OSI-approved license, which adds a layer of complexity for users.

Hardware Footprint

8.0 / 10

Hardware requirements are well-documented, with specific VRAM estimates provided for various quantization levels (FP16, INT8, INT4). For example, FP16 is noted to require approximately 7-8GB of VRAM. The availability of official quantized versions (GGUF, AWQ, GPTQ) and documentation on their impact on memory makes the hardware footprint highly transparent.

Versioning Drift

5.0 / 10

The model follows a clear release versioning (Falcon 3 series), but a detailed, granular changelog for weight updates or specific 'drift' documentation is lacking. While the release date and initial version are clear, there is no established public system for tracking silent updates or performance changes over time beyond the initial Hugging Face commit history.

GPU Requirements

Full Calculator

Choose the quantization method for model weights

Context Size: 1,024 tokens

1k
16k
32k

VRAM Required:

Recommended GPUs