Managing code with Git is standard practice, but applying the same approach directly to the multi-terabyte datasets and multi-gigabyte model checkpoints common in LLMOps quickly becomes impractical. Standard Git repositories are not designed to handle such large binary artifacts efficiently. Cloning repositories becomes prohibitively slow, storage costs balloon due to Git's history tracking of large files, and performance degrades significantly. This necessitates specialized strategies for versioning these large assets while maintaining reproducibility and traceability.
The goal remains consistent with standard MLOps: link specific versions of code, data, and models together to ensure experiments and deployments are reproducible. However, the mechanisms must adapt to the scale involved.
Why Standard Git Falters
Git tracks changes by storing snapshots of files. For text files (like source code), it's efficient at compressing and storing differences (deltas). However, for large binary files (datasets, model weights):
- Repository Bloat: Each version of a large file stored directly in Git adds significantly to the repository size. Even minor changes can result in storing near-duplicates.
- Performance Degradation: Operations like
git clone
, git checkout
, and git push/pull
involve transferring these large files, leading to extremely long wait times, especially in distributed teams or CI/CD systems.
- Inefficient Diffing: Git's delta compression algorithms are often ineffective for binary files, meaning it stores nearly full copies even for small changes.
Strategies for Versioning Large Artifacts
Several approaches have emerged to address these challenges, often involving storing large files outside the Git repository while keeping metadata or pointers within Git.
1. Git Large File Storage (LFS)
Git LFS is an extension to Git that replaces large files in your repository with small text pointers. The actual large files are stored on a separate LFS server (which could be hosted yourself, on GitHub, GitLab, etc.).
- Mechanism: During
git add
, LFS intercepts large files (based on configuration), stores them on the LFS server, and adds a pointer file to the Git repository. During git checkout
or git pull
, LFS uses the pointer file to download the corresponding large file from the LFS server.
- Pros: Relatively seamless integration with existing Git workflows. Users interact with Git commands largely as usual. Widely supported by Git hosting platforms.
- Cons: Still relies on Git for managing the pointer files, which can add overhead. Performance can depend heavily on the LFS server bandwidth and location. Doesn't inherently manage data pipelines or dependencies. Can become costly based on storage and bandwidth usage on the LFS server. Might not scale effectively for extremely large (multi-PB) datasets or scenarios with very frequent large file updates.
2. Data Version Control (DVC)
DVC is an open source tool specifically designed for data versioning, ML pipeline management, and experiment tracking. It operates alongside Git.
- Mechanism: DVC stores metadata about datasets and models (often as small
.dvc
files containing hashes and remote storage locations) in your Git repository. The actual data artifacts are stored in external storage (S3, GCS, Azure Blob, HDFS, local storage, etc.) using content-addressable storage. This means files are identified by their content hash, avoiding duplication. dvc add
tracks a file/directory, dvc push
uploads it to remote storage, and dvc pull
downloads it based on the .dvc
file in the current Git commit.
- Pros: Explicitly designed for data versioning. Storage agnostic. Deduplicates data effectively via content hashing. Integrates pipeline definitions (DAGs) and metrics tracking. Scales well for large datasets.
- Cons: Introduces a separate tool and commands (
dvc add
, dvc push
, dvc pull
) alongside Git, requiring some learning curve. Workflow is slightly different from pure Git/LFS.
Conceptual flow showing how Git tracks code and DVC metadata files, while DVC manages pushing/pulling large artifacts to/from remote storage based on content hashes stored in the .dvc
files.
3. Lakehouse Table Formats (Delta Lake, Apache Iceberg, Apache Hudi)
These formats, often used within data lakehouses, provide ACID transactions, versioning (time travel), and schema evolution capabilities primarily for large tabular datasets managed by engines like Spark, Trino, or Flink.
- Mechanism: Data is stored in underlying object storage (like S3) in formats like Parquet, but the table format maintains transaction logs that define the state of the table at different points in time. You can query the table "as of" a specific timestamp or version.
- Pros: Excellent for managing large, evolving structured or semi-structured datasets used in training. Provides atomicity and consistency. Enables querying historical data states easily. Integrates well with data processing ecosystems.
- Cons: Primarily designed for table-like data, less suited for versioning unstructured data blobs or monolithic model checkpoints directly (though metadata about these could potentially be stored in tables). Requires integration with compatible query engines.
4. Native Object Storage Versioning
Cloud object storage services (AWS S3, Google Cloud Storage, Azure Blob Storage) offer built-in versioning capabilities.
- Mechanism: When enabled, modifying or deleting an object creates a new version instead of overwriting the old one. Each version has a unique ID, and previous versions can be listed and restored.
- Pros: Simple to enable. Managed by the storage provider. Provides a basic rollback capability.
- Cons: Versioning is tied to the object itself, not explicitly linked to code commits without additional tracking mechanisms. Managing relationships between code versions and specific object versions often requires manual effort or custom tooling. Can lead to significant storage costs if lifecycle policies (e.g., deleting old versions) are not configured appropriately. Doesn't provide sophisticated diffing or pipeline awareness.
5. Artifact Repositories and Metadata Tracking
MLOps platforms and experiment tracking tools (MLflow, Weights & Biases, Neptune.ai, Vertex AI Experiments, SageMaker Experiments) often include artifact tracking capabilities.
- Mechanism: These tools log artifacts (datasets, models) associated with specific experiment runs or code versions. They store metadata linking the run ID (often tied to a Git commit) with the artifact's location (e.g., an S3 URI) and potentially its hash.
- Pros: Integrates versioning with the broader experiment management context. Provides a UI for browsing artifacts and their associated runs/metrics. Flexible.
- Cons: Focus is more on logging and linking rather than direct version control operations like diffing or merging data. Relies heavily on consistent logging practices within training/evaluation scripts. Might not enforce versioning as strictly as dedicated tools like DVC.
Versioning Model Checkpoints
Model checkpoints, especially for large foundational models, can range from tens to hundreds of gigabytes or more. The challenges are similar to versioning large datasets.
- Common Solutions: Git LFS, DVC, and artifact repositories are frequently used. DVC's content-addressable storage is particularly beneficial if intermediate checkpoints share layers.
- Checkpoint Strategy: Versioning every single checkpoint during a long training run might be excessive. A common strategy is to version:
- The initial pre-trained model (if applicable).
- Key intermediate checkpoints (e.g., based on evaluation metrics improving or at regular intervals).
- The final trained model.
- Fine-tuned model versions derived from a base model.
- PEFT Considerations: Parameter-Efficient Fine-Tuning (PEFT) methods like LoRA produce much smaller adapter weights (megabytes instead of gigabytes). These are easier to version using standard Git or Git LFS. However, you still need to version or reliably reference the large base model they apply to. Your versioning system must track both the base model version and the adapter version.
Best Practices
- Separate Code and Artifacts: Use Git exclusively for code and configuration files. Use a dedicated solution (LFS, DVC, Object Versioning + Metadata Tracking) for large data and model artifacts.
- Use Content Hashing: Tools like DVC leverage content hashing, which naturally deduplicates identical files, saving storage and bandwidth.
- Tagging and Conventions: Establish clear naming conventions and use Git tags to mark significant commits corresponding to specific dataset versions or trained models used in production releases.
- Automation: Integrate artifact versioning commands (
dvc push/pull
, git lfs push/pull
) into your CI/CD pipelines to ensure consistency and reproducibility. Link CI/CD runs explicitly to the Git commit and the artifact versions used/produced.
- Metadata is Significant: Even when using simpler object storage versioning, maintain robust metadata linking code commits to the specific object versions used for training or deployment. Experiment tracking tools are invaluable here.
- Storage Lifecycle Management: Implement policies for managing old artifact versions in remote storage to control costs, especially when using native object storage versioning or Git LFS. Decide how long historical versions need to be retained.
Choosing the right approach depends on your team's workflow, the scale of your data/models, your existing infrastructure (cloud provider, on-premise), and the need for integrated pipeline features. Often, a combination of Git for code and a tool like DVC or careful use of an artifact repository provides a robust solution for LLMOps.