CroissantLLM Base: Specifications and GPU VRAM Requirements

CroissantLLM Base

Open Source

Open Weights

Parameters

1.3B

Context Length

2.048K

Modality

Text

Architecture

Dense

License

Apache-2.0

Release Date

29 Feb 2024

Knowledge Cutoff

Nov 2023

Technical Specifications

Attention Structure

Multi-Head Attention

Hidden Dimension Size

2048

Number of Layers

Attention Heads

Key-Value Heads

Activation Function

SwigLU

Normalization

RMS Normalization

Position Embedding

Absolute Position Embedding

CroissantLLM Base

CroissantLLM Base is a 1.3 billion parameter decoder-only transformer model designed to provide balanced bilingual proficiency in French and English. Unlike many contemporary large language models that treat non-English languages as secondary through minor data inclusion, CroissantLLM was pre-trained using a strictly balanced 1:1 ratio of French and English data. This architectural choice aims to mitigate linguistic bias and ensure that French cultural and technical knowledge is represented with the same fidelity as English. The model was trained on 3 trillion tokens, a substantial corpus that exceeds the training volume of many larger open-source models in its class.

Technically, the model is built upon the Llama architecture, incorporating established components such as Rotary Positional Encodings (RoPE) and RMSNorm to stabilize deep network activations. To optimize for the bilingual use case, the developers introduced a custom SentencePiece-based tokenizer trained on a high-quality mix of French, English, and code data. This tokenizer achieves significantly lower fertility rates for French text compared to standard multilingual tokenizers, improving both computational efficiency and the model's ability to capture linguistic nuances. The architecture features 24 layers with a hidden dimension of 2048 and 16 attention heads, following a dense structure without the use of mixture-of-experts.

CroissantLLM Base is engineered for high performance on consumer-grade hardware, making it suitable for deployment on local devices such as personal computers and mobile systems. Its training history is highly transparent, with the researchers releasing extensive details on the pre-training data and providing access to checkpoints throughout the training process. The model serves as a foundation for various downstream tasks, particularly translation and content generation in French-centric environments, where its specialized vocabulary and balanced training provide a distinct advantage over models trained on predominantly English-centric datasets.

About CroissantLLM

CroissantLLM is a bilingual French-English language model developed by French research institutions. The model is trained on a curated mix of French and English data to provide language understanding while preserving French linguistic heritage. It is designed for low-resource inference on consumer-grade hardware.

Other CroissantLLM Models

No related models available

Evaluation Benchmarks

No evaluation benchmarks for CroissantLLM Base available.

Rankings

Overall Rank

Coding Rank

Model Transparency

Total Score

B+

82 / 100

Upstream

25.5 / 30

Model

32.0 / 40

Downstream

24.5 / 30

CroissantLLM Base Transparency Report

Total Score

/ 100

B+

Audit Note

CroissantLLM Base exhibits an exemplary transparency profile, characterized by the rare release of intermediate training checkpoints and the full disclosure of its bilingual dataset composition. The model's adherence to the FMTI framework and the provision of custom evaluation code for its bilingual benchmarks set a high standard for open-source initiatives. Its primary strengths lie in its architectural clarity and permissive licensing, though more structured versioning documentation would further enhance its long-term reproducibility.

Upstream

25.5 / 30

Architectural Provenance

8.5 / 10

CroissantLLM is explicitly documented as a 1.3B parameter decoder-only transformer based on the Llama architecture. The technical report and GitHub repository provide extensive details on the architectural components, including the use of Rotary Positional Encodings (RoPE) and RMSNorm. The training methodology is fully described as a from-scratch pre-training on 3 trillion tokens with a specific 1:1 bilingual data ratio, and the researchers have released dozens of intermediate checkpoints (every 5,000 steps) which allows for verifiable tracking of the model's development.

Dataset Composition

8.0 / 10

The training data is well-documented with a clear 50/50 split between French and English. The researchers released the 'Croissant Dataset' on Hugging Face, which includes a manually curated French split. The composition is broken down into specific categories such as web (Common Crawl), code (The Stack), and legal documents. While the exact filtering code for every subset isn't fully centralized in a single step-by-step guide, the high-level sources and proportions are publicly verifiable through the technical paper and the released dataset itself.

Tokenizer Integrity

9.0 / 10

The model uses a custom SentencePiece-based tokenizer with a vocabulary size of 64,000 tokens, specifically trained on a balanced mix of French, English, and code. Detailed fertility rate comparisons (tokens per word) against other models like Llama 2 and Bloom are provided in the paper, demonstrating its optimization for French. The tokenizer files are publicly available on Hugging Face for inspection and verification.

Model

32.0 / 40

Parameter Density

9.0 / 10

The model is a dense architecture with 1.3 billion parameters. The architectural breakdown is precise: 24 layers, a hidden dimension of 2048, and 16 attention heads. There is no ambiguity regarding active vs. total parameters as it is not a Mixture-of-Experts (MoE) model. The parameter count is consistent across all official documentation and repository files.

Training Compute

7.5 / 10

The training was conducted on the Jean Zay supercomputer using 512 NVIDIA A100 80GB GPUs. The total training duration and step count (190k steps) are disclosed. While a specific carbon footprint calculation wasn't prominently featured in the primary model card, the hardware specifications and training time provide sufficient data for independent estimation, and the project's adherence to the FMTI transparency framework includes compute disclosure.

Benchmark Reproducibility

7.0 / 10

The researchers introduced 'FrenchBench' to evaluate the model and released the evaluation code on GitHub. They provide results for standard benchmarks like MMLU and Arc-Challenge alongside their custom bilingual evaluations. However, while the code is public, the exact prompt templates for every single sub-task in FrenchBench require some digging through the repository to fully replicate, preventing a perfect score.

Identity Consistency

8.5 / 10

The model correctly identifies its version and bilingual nature in documentation. As a base model, it does not have the 'identity' issues often seen in instruction-tuned models (e.g., claiming to be GPT-4). It is transparent about its limitations as a foundation model that requires few-shot prompting or fine-tuning for specific tasks.

Downstream

24.5 / 30

License Clarity

9.5 / 10

The model, code, and datasets are released under the Apache 2.0 license, which is a standard, permissive open-source license. There are no conflicting proprietary terms or 'non-commercial only' restrictions. The licensing is clearly stated on the Hugging Face model card and in the GitHub repository.

Hardware Footprint

8.0 / 10

The model is specifically marketed for consumer-grade hardware. Documentation confirms it can run on a single GPU with 8GB-16GB VRAM depending on precision. Quantized versions (4-bit, 8-bit) are available via the community and MLC-LLM, with clear guidance on the memory savings. The scaling of memory with context length is predictable given the standard Llama architecture.

Versioning Drift

7.0 / 10

The project uses clear versioning for its checkpoints and final release. The Hugging Face repository maintains a commit history and the paper documents the 'final' 190k step version. However, a formal semantic versioning changelog for post-release updates is less structured, relying mostly on GitHub/Hugging Face commit messages rather than a dedicated 'Versions' document.

GPU Requirements

Full Calculator

Quantization

Choose the quantization method for model weights

Context Size: 1,024 tokens

VRAM Required:

Recommended GPUs

Resources

Official Documentation Release Notes Read the Paper Download Weights Source Code