Data lakes offer virtually unlimited storage capacity at a low cost, but this flexibility introduces significant organizational challenges. Without a strict governing structure, a data lake can rapidly deteriorate into an unmanageable collection of disconnected files, often referred to as a "data swamp." To maintain data quality and reliability, engineers employ the Medallion Architecture. This design pattern organizes data into three distinct layers, Bronze, Silver, and Gold, based on the quality and validation level of the data stored.
The primary goal of this architecture is to apply data quality incrementally. As data flows through these layers, it becomes cleaner, more structured, and more aggregated. This multi-hop approach allows different personas (data engineers, data scientists, and business analysts) to access data at the stage most appropriate for their specific needs.
The Bronze layer, often called the "Raw Zone" or "Landing Zone," serves as the initial entry point for all data entering the lake. The priority in this layer is write speed and historical retention rather than read performance or data cleanliness.
Data in the Bronze layer is typically an append-only, immutable record of the source system. It retains the original format of the source, such as JSON, CSV, XML, or external database dumps. Engineers should not alter the data content during ingestion. If the source system sends a string in a field that should be an integer, the Bronze layer stores the string. This fidelity ensures that you always have a complete record of what the source system produced.
Core characteristics of Bronze data include:
_ingestion_timestamp and _source_file_name to track lineage.This layer acts as a safety net. If a bug occurs in your downstream transformation logic, you can fix the code and replay the transformation pipeline starting from the Bronze layer without needing to request data from the source system again.
The Silver layer represents the validated, enriched version of your data. While Bronze is a dump of everything, Silver is a trusted asset. In this layer, data is filtered, cleaned, and augmented. It usually adopts a high-performance columnar storage format like Apache Parquet, often managed by a transaction log (such as Delta Lake or Apache Iceberg) to handle updates and deletes effectively.
Transitions from Bronze to Silver involve "cleaning" operations. You enforce schemas, handle null values, deduplicate records, and convert data types. For example, a timestamp string in Bronze becomes a true timestamp object in Silver.
The Silver layer functions as an Enterprise Data Warehouse view within the lake. It is typically normalized (3rd Normal Form) and contains atomic data rather than aggregations. Data scientists frequently query the Silver layer because it provides clean, granular data required for training machine learning models, without the pre-computed bias of business aggregations.
Typical transformations in the Silver layer:
store_name from a store_id) if essential for basic understanding.The Gold layer is the curated, aggregated layer designed for specific business use cases. While Silver is organized around data subjects (Customers, Products, Orders), Gold is organized around project-specific requirements (Quarterly Sales Report, Customer Churn Analytics).
Data in the Gold layer is highly transformed. It applies business logic to the granular data found in Silver. This often involves heavy aggregations, such as summing daily sales, calculating moving averages, or determining complex KPIs. The data model here frequently shifts from a normalized structure to a dimensional model, such as a Star Schema, to optimize read performance for BI tools like Tableau, PowerBI, or Looker.
Because the data is pre-computed and aggregated, the query latency is significantly lower. Business analysts and executives consume this data effectively without needing to understand the underlying complexities of the raw ingestions or the cleaning logic.
The progression of data through the Medallion architecture. Data moves from raw ingestion (Bronze) to validated structures (Silver) and finally to aggregated business metrics (Gold).
Implementing this layered approach provides structural isolation. A failure in the raw ingestion process (Bronze) does not immediately break the executive dashboard (Gold), as the Gold tables simply retain their last known good state until the pipeline recovers.
Furthermore, this separation aligns with the "Schema-on-Read" flexibility of data lakes while providing the reliability of "Schema-on-Write" in the upper layers. You write to Bronze with minimal constraints to capture data quickly. You write to Silver and Gold with strict constraints to ensure reliability for downstream consumers.
By decoupling the ingestion format from the consumption format, you optimize for different constraints:
In the following sections, we will analyze how this logical architecture maps to specific physical implementation choices, such as Lambda and Kappa architectures, to handle the timing of how data moves through these layers.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with