Introduction to Data Lake Architectures

Data lakes serve as the central repository for raw and processed data in modern AI and analytics stacks. This course examines the technical architecture required to build scalable, high-performance data lakes. We define the structural layers, moving from raw ingestion to refined tables suitable for machine learning and reporting. The curriculum addresses storage formats like Apache Parquet and Avro, open table formats such as Apache Iceberg and Delta Lake, and the separation of compute from storage. You will configure metadata catalogs, implement ingestion pipelines, and execute distributed queries. The content focuses on architectural patterns, specifically the Medallion architecture, and provides technical guidance on partition strategies and schema management.

Prerequisites SQL & programming basics

Level:

Intermediate

Architecture Patterns
Design multi-layered data architectures using the Medallion pattern (Bronze, Silver, Gold).
Storage Formats
Implement columnar storage using Apache Parquet and manage transactions with Open Table Formats like Iceberg.
Data Ingestion
Construct batch and streaming pipelines to move data from sources into the lake reliably.
Query Optimization
Optimize data retrieval speeds using partitioning, file pruning, and distributed query engines.

There are no prerequisite courses for this course.

There are no recommended next courses at the moment.

Share your feedback to help other learners.