Parameters
1.3B
Context Length
2.048K
Modality
Text
Architecture
Dense
License
Apache-2.0
Release Date
29 Feb 2024
Knowledge Cutoff
Nov 2023
Attention Structure
Multi-Head Attention
Hidden Dimension Size
2048
Number of Layers
24
Attention Heads
16
Key-Value Heads
16
Activation Function
SwigLU
Normalization
RMS Normalization
Position Embedding
Absolute Position Embedding
CroissantLLM Base is a 1.3 billion parameter decoder-only transformer model designed to provide balanced bilingual proficiency in French and English. Unlike many contemporary large language models that treat non-English languages as secondary through minor data inclusion, CroissantLLM was pre-trained using a strictly balanced 1:1 ratio of French and English data. This architectural choice aims to mitigate linguistic bias and ensure that French cultural and technical knowledge is represented with the same fidelity as English. The model was trained on 3 trillion tokens, a substantial corpus that exceeds the training volume of many larger open-source models in its class.
Technically, the model is built upon the Llama architecture, incorporating established components such as Rotary Positional Encodings (RoPE) and RMSNorm to stabilize deep network activations. To optimize for the bilingual use case, the developers introduced a custom SentencePiece-based tokenizer trained on a high-quality mix of French, English, and code data. This tokenizer achieves significantly lower fertility rates for French text compared to standard multilingual tokenizers, improving both computational efficiency and the model's ability to capture linguistic nuances. The architecture features 24 layers with a hidden dimension of 2048 and 16 attention heads, following a dense structure without the use of mixture-of-experts.
CroissantLLM Base is engineered for high performance on consumer-grade hardware, making it suitable for deployment on local devices such as personal computers and mobile systems. Its training history is highly transparent, with the researchers releasing extensive details on the pre-training data and providing access to checkpoints throughout the training process. The model serves as a foundation for various downstream tasks, particularly translation and content generation in French-centric environments, where its specialized vocabulary and balanced training provide a distinct advantage over models trained on predominantly English-centric datasets.
CroissantLLM is a bilingual French-English language model developed by French research institutions. The model is trained on a curated mix of French and English data to provide language understanding while preserving French linguistic heritage. It is designed for low-resource inference on consumer-grade hardware.
No evaluation benchmarks for CroissantLLM Base available.
Overall Rank
-
Coding Rank
-
Total Score
82
/ 100
CroissantLLM Base exhibits an exemplary transparency profile, characterized by the rare release of intermediate training checkpoints and the full disclosure of its bilingual dataset composition. The model's adherence to the FMTI framework and the provision of custom evaluation code for its bilingual benchmarks set a high standard for open-source initiatives. Its primary strengths lie in its architectural clarity and permissive licensing, though more structured versioning documentation would further enhance its long-term reproducibility.
Architectural Provenance
CroissantLLM is explicitly documented as a 1.3B parameter decoder-only transformer based on the Llama architecture. The technical report and GitHub repository provide extensive details on the architectural components, including the use of Rotary Positional Encodings (RoPE) and RMSNorm. The training methodology is fully described as a from-scratch pre-training on 3 trillion tokens with a specific 1:1 bilingual data ratio, and the researchers have released dozens of intermediate checkpoints (every 5,000 steps) which allows for verifiable tracking of the model's development.
Dataset Composition
The training data is well-documented with a clear 50/50 split between French and English. The researchers released the 'Croissant Dataset' on Hugging Face, which includes a manually curated French split. The composition is broken down into specific categories such as web (Common Crawl), code (The Stack), and legal documents. While the exact filtering code for every subset isn't fully centralized in a single step-by-step guide, the high-level sources and proportions are publicly verifiable through the technical paper and the released dataset itself.
Tokenizer Integrity
The model uses a custom SentencePiece-based tokenizer with a vocabulary size of 64,000 tokens, specifically trained on a balanced mix of French, English, and code. Detailed fertility rate comparisons (tokens per word) against other models like Llama 2 and Bloom are provided in the paper, demonstrating its optimization for French. The tokenizer files are publicly available on Hugging Face for inspection and verification.
Parameter Density
The model is a dense architecture with 1.3 billion parameters. The architectural breakdown is precise: 24 layers, a hidden dimension of 2048, and 16 attention heads. There is no ambiguity regarding active vs. total parameters as it is not a Mixture-of-Experts (MoE) model. The parameter count is consistent across all official documentation and repository files.
Training Compute
The training was conducted on the Jean Zay supercomputer using 512 NVIDIA A100 80GB GPUs. The total training duration and step count (190k steps) are disclosed. While a specific carbon footprint calculation wasn't prominently featured in the primary model card, the hardware specifications and training time provide sufficient data for independent estimation, and the project's adherence to the FMTI transparency framework includes compute disclosure.
Benchmark Reproducibility
The researchers introduced 'FrenchBench' to evaluate the model and released the evaluation code on GitHub. They provide results for standard benchmarks like MMLU and Arc-Challenge alongside their custom bilingual evaluations. However, while the code is public, the exact prompt templates for every single sub-task in FrenchBench require some digging through the repository to fully replicate, preventing a perfect score.
Identity Consistency
The model correctly identifies its version and bilingual nature in documentation. As a base model, it does not have the 'identity' issues often seen in instruction-tuned models (e.g., claiming to be GPT-4). It is transparent about its limitations as a foundation model that requires few-shot prompting or fine-tuning for specific tasks.
License Clarity
The model, code, and datasets are released under the Apache 2.0 license, which is a standard, permissive open-source license. There are no conflicting proprietary terms or 'non-commercial only' restrictions. The licensing is clearly stated on the Hugging Face model card and in the GitHub repository.
Hardware Footprint
The model is specifically marketed for consumer-grade hardware. Documentation confirms it can run on a single GPU with 8GB-16GB VRAM depending on precision. Quantized versions (4-bit, 8-bit) are available via the community and MLC-LLM, with clear guidance on the memory savings. The scaling of memory with context length is predictable given the standard Llama architecture.
Versioning Drift
The project uses clear versioning for its checkpoints and final release. The Hugging Face repository maintains a commit history and the paper documents the 'final' 190k step version. However, a formal semantic versioning changelog for post-release updates is less structured, relying mostly on GitHub/Hugging Face commit messages rather than a dedicated 'Versions' document.
Full Calculator
Choose the quantization method for model weights
Context Size: 1,024 tokens