Data lakes serve as the central repository for raw and processed data in modern AI and analytics stacks. This course examines the technical architecture required to build scalable, high-performance data lakes. We define the structural layers, moving from raw ingestion to refined tables suitable for machine learning and reporting. The curriculum addresses storage formats like Apache Parquet and Avro, open table formats such as Apache Iceberg and Delta Lake, and the separation of compute from storage. You will configure metadata catalogs, implement ingestion pipelines, and execute distributed queries. The content focuses on architectural patterns, specifically the Medallion architecture, and provides technical guidance on partition strategies and schema management.
Prerequisites SQL & programming basics
Level:
Architecture Patterns
Design multi-layered data architectures using the Medallion pattern (Bronze, Silver, Gold).
Storage Formats
Implement columnar storage using Apache Parquet and manage transactions with Open Table Formats like Iceberg.
Data Ingestion
Construct batch and streaming pipelines to move data from sources into the lake reliably.
Query Optimization
Optimize data retrieval speeds using partitioning, file pruning, and distributed query engines.