Falcon-180B: Specifications and GPU VRAM Requirements

Falcon-180B

Closed Source

Open Weights

Parameters

180B

Context Length

2.048K

Modality

Text

Architecture

Dense

License

Falcon-180B TII License and Acceptable Use Policy

Release Date

23 Sept 2023

Knowledge Cutoff

Dec 2022

Technical Specifications

Attention Structure

Multi-Query Attention

Hidden Dimension Size

12288

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

GELU

Normalization

Layer Normalization

Position Embedding

ROPE

System Requirements

VRAM requirements for different quantization methods and context sizes

Falcon-180B

The Falcon-180B model, developed by the Technology Innovation Institute (TII), represents a large-scale causal decoder-only language model designed for advanced natural language processing tasks. It is an evolution of the Falcon 40B model, significantly scaled in parameter count. The model aims to serve as a foundational component for various applications requiring sophisticated language understanding and generation capabilities, including text generation, conversational AI, and summarization. This model has been specifically engineered to facilitate further fine-tuning for specialized use cases, with a separate chat-optimized variant available that has been fine-tuned on instruction datasets.

Architecturally, Falcon-180B implements an optimized transformer design, drawing inspiration from the GPT-3 framework while incorporating key innovations. A notable feature is the adoption of Multi-Query Attention (MQA), which enhances scalability and optimizes inference performance by enabling all attention heads to share a single key and value projection. The model also utilizes Rotary Position Embeddings (RoPE) for encoding positional information within sequences and incorporates FlashAttention for efficient attention computations. Its decoder blocks employ a parallel attention/MultiLayer Perceptron (MLP) structure with two layer norms, contributing to its processing efficiency. Training was conducted on a vast dataset of 3.5 trillion tokens, primarily derived from TII's RefinedWeb dataset (approximately 85%), supplemented by curated corpora including technical papers, conversations, and code. This extensive pretraining, which involved up to 4,096 A100 GPUs and accumulated around 7,000,000 GPU hours, leveraged a custom distributed training codebase named Gigatron, employing a 3D parallelism strategy combined with ZeRO optimization.

Falcon-180B is engineered for robust performance across a spectrum of language-based activities. Its design supports tasks that necessitate deep understanding and logical reasoning, such as complex research, code generation, and knowledge-based querying. The extensive training on a diverse corpus enables the model to effectively store and retrieve information, making it suitable for question answering systems and generating summaries of complex topics. The model's inherent versatility allows it to adapt to and perform effectively in a wide array of domains, supporting its utility as a powerful tool for diverse applications.

About Falcon

The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.

Other Falcon Models

Evaluation Benchmarks

Ranking is for Local LLMs.

No evaluation benchmarks for Falcon-180B available.

Rankings

Overall Rank

Coding Rank

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Download Weights