All Courses

Introduction to Data Lake Architectures

Chapter 1: Architectural Foundations

Decoupling Compute and Storage

Object Storage Semantics

The Medallion Architecture

Lambda and Kappa Architectures

Chapter 2: File Formats and Optimization

Row-Oriented vs Columnar Storage

Apache Parquet Internals

Open Table Formats

Compression Algorithms

Partitioning Strategies

Hands-on Practical: Optimizing File Layouts

Chapter 3: Ingestion Pipelines

Batch Ingestion Workflows

Change Data Capture (CDC)

Handling Schema Evolution

The Small File Problem

Idempotency in Pipelines

Hands-on Practical: Building a CDC Pipeline

Chapter 4: Metadata and Cataloging

The Role of the Metastore

Partition Discovery

Technical Governance

Data Lineage Implementation

Hands-on Practical: Configuring a Catalog

Chapter 5: Querying and Performance

Distributed Query Engines

File Pruning and Skipping

Vectorized Query Execution

Caching Strategies

Hands-on Practical: Query Analysis

The Small File Problem

Was this section helpful?

© 2025 ApX Machine LearningEngineered with