Data storage on disk is linear. Regardless of the logical structure of your tables, the underlying object storage systems, whether Amazon S3, Azure Blob Storage, or Google Cloud Storage, serialize data into a continuous stream of bytes. The efficiency of your data lake depends heavily on how you map your two-dimensional database tables (rows and columns) into this one-dimensional byte stream.
This mapping forces a trade-off between write performance and read performance. In data engineering, we generally categorize file formats into two distinct architectures: row-oriented and columnar. Understanding the mechanical differences between these two is necessary for designing performant data lakes, particularly when moving from raw ingestion (Bronze) to refined analytics (Silver/Gold).
In a row-oriented format, data is stored sequentially by record. If you have a table with columns for user_id, timestamp, and event_type, the file writes all fields for the first row, followed immediately by all fields for the second row.
Common examples of row-oriented formats include CSV, JSON (newline delimited), and Apache Avro.
Consider a dataset containing user transaction logs. In a row-based layout, the bytes on the disk look like this:
This structure is highly efficient for write operations. When an application generates a new record, the system simply appends the new row to the end of the file. This "append-only" pattern minimizes disk seeking and aligns well with the write patterns of transactional systems (OLTP) or streaming ingestion pipelines.
However, this layout introduces significant overhead for analytical queries (OLAP). Analytics typically involve aggregating specific metrics over a large number of records. Suppose you want to calculate the average Amount from the dataset above. To retrieve just the Amount values, the query engine must read the entire file, including the ID and Date fields, because the Amount bytes are interleaved with the other data.
If a table has 100 columns and your query only requires 3 of them, a row-oriented format forces the engine to read and parse 97 columns of irrelevant data. This results in high I/O amplification, where the volume of data read from storage significantly exceeds the volume of data actually needed for computation.
Comparison of how a logical table is serialized into physical storage blocks. Row storage keeps fields from the same record together, while columnar storage groups values from the same column.
Columnar storage transposes the data layout. Instead of writing one row after another, the system groups values by column. All user_id values are stored contiguously, followed by all timestamp values, and then all event_type values.
Common examples of columnar formats include Apache Parquet and Apache ORC.
In the columnar layout, the byte stream looks like this:
This architecture enables a technique called projection pushdown. When a user runs a query like SELECT AVG(Amount) FROM transactions, the query engine can jump directly to the byte offset where the Amount column block begins. It reads only that block and ignores the ID and Date blocks entirely.
For wide tables commonly found in data lakes, often containing hundreds of columns, this reduction in I/O is substantial. If you select 5 columns out of 100, you are strictly reading 5% of the data. Since network bandwidth and disk I/O are often the bottlenecks in distributed processing, reducing the data volume directly translates to faster query execution and lower cloud egress costs.
Beyond I/O pruning, columnar storage offers superior compression ratios. Compression algorithms work by identifying repetitive patterns in data. In a row-oriented format, data types vary constantly as you traverse the file (e.g., Integer, String, Date, Integer...). This heterogeneity makes it difficult for codecs to find patterns.
In a columnar file, adjacent data is always of the same type. Integers are stored next to integers, and strings next to strings. This homogeneity allows the use of lightweight, highly efficient encoding schemes.
For example, consider a country column where the value "United States" appears 1,000 times consecutively. A columnar format can use Run-Length Encoding (RLE) to store this as a tuple: ("United States", 1000). This occupies negligible space compared to storing the string 1,000 times. We will discuss specific encoding internals like RLE and Dictionary Encoding in the "Apache Parquet Internals" section.
To make an informed architectural decision, we must quantify the performance implications of these formats. Let us analyze the read usage pattern.
Defined variables:
Scan Complexity Comparison:
Row-Oriented Scan: To extract columns, the system reads the full row size for every row .
Columnar Scan: To extract columns, the system reads only the fraction of the row size corresponding to those columns. Assuming uniform column size for simplicity:
As the ratio decreases (i.e., the query becomes more selective regarding columns), the performance gap between row and columnar storage widens.
Relative data scanning requirements. Columnar storage performance is inversely proportional to the number of columns selected, whereas row storage cost is constant regardless of projection.
In a modern data lake architecture, you rarely choose one format exclusively. Instead, you select the format based on the stage of the data pipeline.
Use Row-Oriented (Avro/JSON) when:
Use Columnar (Parquet/Iceberg) when:
The standard pattern in the Medallion architecture is to land data in a row-oriented format to ensure capture reliability, and then run an ETL job to transform and rewrite that data into a columnar format for downstream consumers. This approach provides the durability of row-based logging with the performance of columnar analytics.
Was this section helpful?
© 2026 ApX Machine LearningAI Ethics & Transparency•