Falcon-3B

Open Source

Open Weights

Parameters

Context Length

33K

Modality

Text

Architecture

Dense

License

TII Falcon-LLM License 2.0

Release Date

17 Dec 2024

Knowledge Cutoff

System Requirements

VRAM requirements for different quantization methods and context sizes

1,024 tokens

7.82 GB VRAM

Consumer

1x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

32,768 tokens

8.36 GB VRAM

Consumer

1x RTX 4090

24GB VRAM

Datacenter

1x NVIDIA A100

80GB VRAM

Apple Silicon

1x Apple M3 Max

128GB VRAM

Architecture Diagram

Evaluation Benchmarks

No evaluation benchmarks for Falcon-3B available.

Rankings

Overall Rank

Coding Rank

About Falcon-3B

Falcon-3B is a member of the Falcon 3 family of decoder-only large language models, developed by the Technology Innovation Institute (TII). This model variant, with 3 billion parameters, is engineered for efficient deployment on various hardware, including systems with limited resources such as laptops and single GPUs. Its primary purpose is to deliver robust performance across a spectrum of natural language processing tasks, focusing on reasoning, language understanding, instruction following, code generation, and mathematics. The Falcon-3B model also supports multilingual capabilities, specifically English, French, Spanish, and Portuguese.

The architectural foundation of Falcon-3B is a transformer-based causal decoder-only design. It incorporates several innovations to enhance efficiency and performance. Notably, it utilizes Grouped Query Attention (GQA), a mechanism that optimizes inference speed and reduces Key-Value (KV) cache memory consumption by sharing parameters among attention heads. The model employs SwiGLU as its activation function and RMSNorm for normalization, contributing to stable and effective learning. Positional embeddings are handled using Rotary Positional Embeddings (RoPE) to support extended context comprehension. Furthermore, the model leverages FlashAttention 2 for accelerated attention computations and features a high vocabulary size of 131,000 tokens, enabling improved compression and downstream performance.

Falcon-3B, along with its instruction-tuned counterpart, has been developed using techniques such as pruning and knowledge distillation from the larger Falcon3-7B-Base model, resulting in an efficient and performant compact model. The base variant supports a context length of 8,000 tokens, while the instruction-tuned variant extends this capability to 32,000 tokens, allowing it to process and generate responses for longer and more complex inputs. This design paradigm makes Falcon-3B a suitable choice for applications requiring advanced AI functionalities in environments where computational resources are a consideration.

Technical Specifications

Attention

Attention Structure

Multi-Query Attention

Attention Heads

Key-Value Heads

Attention Head Dimension

Position Embedding

ROPE

RoPE Theta

Sliding Window Attention

Sliding Window Size

Sliding Window Ratio

Linear Attention

Linear Attention Ratio

Normalization

RMS Normalization

Activation Function

SwigLU

Dimensions

Hidden Dimension Size

1,536

Number of Layers

FFN Intermediate Size (Dense)

Multi-Token Prediction Heads

Tokenizer

Vocabulary Size

Model Integrity

Total Score

66 / 100

Upstream

20.5 / 30

Model

26.0 / 40

Downstream

19.5 / 30

Resources

Official Documentation Release Notes Download Weights

About Falcon

The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.