Optimizing storage layouts transforms sluggish queries into high-performance analytical operations. Files within a raw data lake often land as loose collections of JSON or CSV objects. While easy to inspect, these formats incur high input/output (I/O) costs during analysis because the query engine must read every byte to locate relevant records. Converting this data into a columnar format like Apache Parquet, partitioning it strategically, and registering it with a table format like Apache Iceberg, drastically reduces the amount of data scanned during query execution.
In this practical session, you will simulate a common data engineering workflow: ingesting raw server logs, converting them into optimized storage, and measuring the structural benefits.
Consider a scenario where a web application generates clickstream data. The raw data arrives in an object store bucket (such as Amazon S3 or Azure Data Lake Storage) in JSON Lines format. Each file contains thousands of records tracking user interactions.
A single record looks like this:
{"event_id": "u89-12", "timestamp": "2023-10-27T10:00:00Z", "event_type": "view_item", "item_id": "sku_123", "country": "US"}
Running a query to count events from "US" on a specific date requires the engine to open every JSON file, parse the text, and filter the results in memory. This is the "scan-heavy" pattern we intend to eliminate.
Our objective is to move this data from a raw "Bronze" state to a structured "Silver" state. The transformation involves three specific changes:
The workflow moves unstructured raw data through a processing engine to create a structured, partitioned directory layout managed by metadata.
Using a distributed processing framework like Apache Spark is the standard approach for this volume of data. First, we read the raw JSON files to infer the schema. While JSON is flexible, the schema inference process can be costly if the dataset is massive.
# PySpark Code
raw_path = "s3://bucket/raw/events/*.json"
df = spark.read.json(raw_path)
# Inspect the physical plan or schema
df.printSchema()
At this stage, the data is uncompressed and text-based. If the dataset is 100 GB in JSON, it essentially requires 100 GB of I/O to read fully.
We convert the dataframe to Apache Parquet. Parquet uses columnar storage, meaning values from the same column are stored contiguously. This allows for highly efficient compression algorithms like Snappy or Zstd because data of the same type (e.g., timestamps or integers) compresses better than mixed types.
The mathematical impact of compression is significant. The compression ratio is defined as:
For repetitive log data, often ranges from 4:1 to 10:1. This means a 100 GB JSON dataset might occupy only 10-25 GB as Snappy-compressed Parquet, immediately reducing network I/O by 75% or more.
Partitioning is the most critical decision in physical layout design. It breaks the data into sub-directories based on distinct values of a column. This enables partition pruning, where the query engine skips entire directories that do not match the query's filter.
However, you must choose the partition column carefully to avoid the "small file problem."
user_id (millions of unique values) creates millions of tiny directories and files. The overhead of listing these files exceeds the time to read them.event_date (one value per day) creates manageable chunks.For our web logs, we will partition by event_date.
# Writing the data as partitioned Parquet
output_path = "s3://bucket/silver/events/"
(df
.write
.mode("overwrite")
.partitionBy("event_date")
.format("parquet")
.save(output_path))
The physical layout on disk now changes from a flat list of files to a hierarchical structure:
/silver/events/event_date=2023-10-27/part-001.parquet
/silver/events/event_date=2023-10-28/part-001.parquet
While directory partitioning helps, file systems like S3 are eventually consistent and slow at listing thousands of files. Apache Iceberg solves this by maintaining a metadata layer (manifest files) that tracks exactly which files belong to the table. This allows for O(1) planning time rather than O(N) directory listing time.
We can rewrite the previous operation to use the Iceberg format. This requires an Iceberg catalog to be configured in your Spark session.
# Writing as an Iceberg table
(df
.writeTo("prod.catalog.web_events")
.partitionedBy(df.event_date)
.createOrReplace())
When you execute a query against this table, the engine reads the Iceberg metadata first.
SELECT count(*)
FROM prod.catalog.web_events
WHERE event_date = '2023-10-27';
Because the metadata explicitly links event_date='2023-10-27' to specific files, the engine ignores all other data without ever touching the storage layer for those irrelevant partitions.
The following chart illustrates the latency differences between querying the raw JSON, flat Parquet, and the partitioned Iceberg table for a query filtering on a single date.
Transitioning from Raw JSON to Partitioned Iceberg can reduce query execution time by orders of magnitude due to data skipping and efficient decoding.
By restructuring the file layout, you have achieved three outcomes:
country and event_date), ignoring the heavy payload column.This "Silver" layer now serves as a foundation for downstream analytics and machine learning features.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•