Optimizing storage layouts transforms sluggish queries into high-performance analytical operations. Files within a raw data lake often land as loose collections of JSON or CSV objects. While easy to inspect, these formats incur high input/output (I/O) costs during analysis because the query engine must read every byte to locate relevant records. Converting this data into a columnar format like Apache Parquet, partitioning it strategically, and registering it with a table format like Apache Iceberg, drastically reduces the amount of data scanned during query execution.In this practical session, you will simulate a common data engineering workflow: ingesting raw server logs, converting them into optimized storage, and measuring the structural benefits.The Scenario: High-Volume Web LogsConsider a scenario where a web application generates clickstream data. The raw data arrives in an object store bucket (such as Amazon S3 or Azure Data Lake Storage) in JSON Lines format. Each file contains thousands of records tracking user interactions.A single record looks like this:{"event_id": "u89-12", "timestamp": "2023-10-27T10:00:00Z", "event_type": "view_item", "item_id": "sku_123", "country": "US"}Running a query to count events from "US" on a specific date requires the engine to open every JSON file, parse the text, and filter the results in memory. This is the "scan-heavy" pattern we intend to eliminate.Analyzing the Transformation PipelineOur objective is to move this data from a raw "Bronze" state to a structured "Silver" state. The transformation involves three specific changes:Format Conversion: Changing JSON (row-based) to Parquet (column-based).Partitioning: Organizing files into directories based on high-level predicates (e.g., date).Cataloging: Registering the layout with Apache Iceberg to handle metadata.digraph G { rankdir=TB; node [shape=rect, style=filled, fontname="Sans-Serif"]; subgraph cluster_0 { label="Raw Zone (Bronze)"; style=filled; color="#f8f9fa"; raw [label="Raw JSON Files\n(s3://bucket/raw/events/)", fillcolor="#dee2e6", color="#adb5bd"]; } subgraph cluster_1 { label="Transformation Logic"; style=filled; color="#f8f9fa"; spark [label="Spark Job\n(Read -> Partition -> Write)", fillcolor="#74c0fc", color="#228be6"]; } subgraph cluster_2 { label="Optimized Zone (Silver)"; style=filled; color="#f8f9fa"; subgraph cluster_structure { label="Directory Structure"; style=dashed; p1 [label="date=2023-10-27/\npart-001.parquet", fillcolor="#63e6be", color="#0ca678"]; p2 [label="date=2023-10-28/\npart-002.parquet", fillcolor="#63e6be", color="#0ca678"]; } iceberg [label="Iceberg Metadata\n(Snapshots & Manifests)", fillcolor="#ffec99", color="#f59f00"]; } raw -> spark; spark -> p1; spark -> p2; spark -> iceberg [style=dotted]; }The workflow moves unstructured raw data through a processing engine to create a structured, partitioned directory layout managed by metadata.Step 1: Reading and Inspecting Raw DataUsing a distributed processing framework like Apache Spark is the standard approach for this volume of data. First, we read the raw JSON files to infer the schema. While JSON is flexible, the schema inference process can be costly if the dataset is massive.# PySpark Code raw_path = "s3://bucket/raw/events/*.json" df = spark.read.json(raw_path) # Inspect the physical plan or schema df.printSchema()At this stage, the data is uncompressed and text-based. If the dataset is 100 GB in JSON, it essentially requires 100 GB of I/O to read fully.Step 2: Columnar Conversion and CompressionWe convert the dataframe to Apache Parquet. Parquet uses columnar storage, meaning values from the same column are stored contiguously. This allows for highly efficient compression algorithms like Snappy or Zstd because data of the same type (e.g., timestamps or integers) compresses better than mixed types.The mathematical impact of compression is significant. The compression ratio $R$ is defined as:$$ R = \frac{\text{Uncompressed Size}}{\text{Compressed Size}} $$For repetitive log data, $R$ often ranges from 4:1 to 10:1. This means a 100 GB JSON dataset might occupy only 10-25 GB as Snappy-compressed Parquet, immediately reducing network I/O by 75% or more.Step 3: Selecting a Partition StrategyPartitioning is the most critical decision in physical layout design. It breaks the data into sub-directories based on distinct values of a column. This enables partition pruning, where the query engine skips entire directories that do not match the query's filter.However, you must choose the partition column carefully to avoid the "small file problem."High Cardinality (Bad): Partitioning by user_id (millions of unique values) creates millions of tiny directories and files. The overhead of listing these files exceeds the time to read them.Low Cardinality (Good): Partitioning by event_date (one value per day) creates manageable chunks.For our web logs, we will partition by event_date.# Writing the data as partitioned Parquet output_path = "s3://bucket/silver/events/" (df .write .mode("overwrite") .partitionBy("event_date") .format("parquet") .save(output_path))The physical layout on disk now changes from a flat list of files to a hierarchical structure:/silver/events/event_date=2023-10-27/part-001.parquet /silver/events/event_date=2023-10-28/part-001.parquetStep 4: Registering with Apache IcebergWhile directory partitioning helps, file systems like S3 are eventually consistent and slow at listing thousands of files. Apache Iceberg solves this by maintaining a metadata layer (manifest files) that tracks exactly which files belong to the table. This allows for O(1) planning time rather than O(N) directory listing time.We can rewrite the previous operation to use the Iceberg format. This requires an Iceberg catalog to be configured in your Spark session.# Writing as an Iceberg table (df .writeTo("prod.catalog.web_events") .partitionedBy(df.event_date) .createOrReplace())When you execute a query against this table, the engine reads the Iceberg metadata first.SELECT count(*) FROM prod.catalog.web_events WHERE event_date = '2023-10-27';Because the metadata explicitly links event_date='2023-10-27' to specific files, the engine ignores all other data without ever touching the storage layer for those irrelevant partitions.Performance ComparisonThe following chart illustrates the latency differences between querying the raw JSON, flat Parquet, and the partitioned Iceberg table for a query filtering on a single date.{"layout": {"title": "Query Latency by File Layout (Lower is Better)", "xaxis": {"title": "Storage Layout"}, "yaxis": {"title": "Execution Time (Seconds)"}, "width": 600, "height": 400, "plot_bgcolor": "#f8f9fa"}, "data": [{"type": "bar", "x": ["Raw JSON", "Flat Parquet", "Partitioned Iceberg"], "y": [120, 45, 5], "marker": {"color": ["#fa5252", "#22b8cf", "#40c057"]}}]}Transitioning from Raw JSON to Partitioned Iceberg can reduce query execution time by orders of magnitude due to data skipping and efficient decoding.Summary of OptimizationBy restructuring the file layout, you have achieved three outcomes:Reduced Storage Cost: Parquet compression lowered the storage footprint.Faster Queries: Columnar stripping allows the engine to read only the columns needed (e.g., just country and event_date), ignoring the heavy payload column.Scan Efficiency: Partitioning and Iceberg metadata ensured the engine only opened files relevant to the specific date requested.This "Silver" layer now serves as a foundation for downstream analytics and machine learning features.