趋近智
参数
180B
上下文长度
2.048K
模态
Text
架构
Dense
许可证
Falcon-180B TII License and Acceptable Use Policy
发布日期
23 Sept 2023
知识截止
Dec 2022
注意力结构
Multi-Query Attention
隐藏维度大小
12288
层数
60
注意力头
96
键值头
1
激活函数
GELU
归一化
Layer Normalization
位置嵌入
ROPE
不同量化方法和上下文大小的显存要求
The Falcon-180B model, developed by the Technology Innovation Institute (TII), represents a large-scale causal decoder-only language model designed for advanced natural language processing tasks. It is an evolution of the Falcon 40B model, significantly scaled in parameter count. The model aims to serve as a foundational component for various applications requiring sophisticated language understanding and generation capabilities, including text generation, conversational AI, and summarization. This model has been specifically engineered to facilitate further fine-tuning for specialized use cases, with a separate chat-optimized variant available that has been fine-tuned on instruction datasets.
Architecturally, Falcon-180B implements an optimized transformer design, drawing inspiration from the GPT-3 framework while incorporating key innovations. A notable feature is the adoption of Multi-Query Attention (MQA), which enhances scalability and optimizes inference performance by enabling all attention heads to share a single key and value projection. The model also utilizes Rotary Position Embeddings (RoPE) for encoding positional information within sequences and incorporates FlashAttention for efficient attention computations. Its decoder blocks employ a parallel attention/MultiLayer Perceptron (MLP) structure with two layer norms, contributing to its processing efficiency. Training was conducted on a vast dataset of 3.5 trillion tokens, primarily derived from TII's RefinedWeb dataset (approximately 85%), supplemented by curated corpora including technical papers, conversations, and code. This extensive pretraining, which involved up to 4,096 A100 GPUs and accumulated around 7,000,000 GPU hours, leveraged a custom distributed training codebase named Gigatron, employing a 3D parallelism strategy combined with ZeRO optimization.
Falcon-180B is engineered for robust performance across a spectrum of language-based activities. Its design supports tasks that necessitate deep understanding and logical reasoning, such as complex research, code generation, and knowledge-based querying. The extensive training on a diverse corpus enables the model to effectively store and retrieve information, making it suitable for question answering systems and generating summaries of complex topics. The model's inherent versatility allows it to adapt to and perform effectively in a wide array of domains, supporting its utility as a powerful tool for diverse applications.
The TII Falcon model family comprises causal decoder-only language models (7B, 40B). Their architecture, adapted from GPT-3, integrates rotary positional embeddings, Multi-Query Attention for inference efficiency, and FlashAttention for accelerated operations. Models are trained on the RefinedWeb dataset.
排名适用于本地LLM。
没有可用的 Falcon-180B 评估基准。