Apache Parquet has become the de facto standard for data lake storage because it solves the fundamental inefficiency of text-based formats. While CSV and JSON are easy for humans to read, they are computationally expensive for machines to process. Every time a query engine reads a CSV, it must parse every byte, handle delimiters, and infer data types. Parquet reverses this priority. It is a binary, self-describing format optimized for the machine, specifically designed to minimize Input/Output (I/O) operations and CPU usage during analytical queries.The Hierarchical Structure of ParquetTo understand how Parquet achieves high performance, we must look at its internal organization. It does not write data as a continuous stream of values. Instead, it divides data into a distinct hierarchy: files, row groups, column chunks, and pages.Row Group: The file is horizontally partitioned into row groups. A row group logically contains a set of rows (for example, 10,000 rows or 128MB of data). This horizontal slicing allows query engines to parallelize operations. One worker node can process the first row group while another processes the second.Column Chunk: Inside a row group, the data is sliced vertically. Each column in the dataset has a corresponding column chunk. This is where the physical separation of data occurs. If you select only the user_id column, the engine reads only the column chunks associated with user_id, ignoring chunks for timestamp, event_type, or payload.Page: The column chunk is further subdivided into pages. The page is the atomic unit of storage. Compression and encoding are applied at the page level. When a database engine reads data, it retrieves it one page at a time.digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Helvetica", fontsize=10, color=white]; edge [color="#adb5bd"]; subgraph cluster_file { label = "Parquet File"; style=filled; color="#e9ecef"; fontname="Helvetica"; Metadata [label="File Metadata (Footer)", fillcolor="#fa5252", fontcolor=white]; subgraph cluster_rg1 { label = "Row Group 1"; style=filled; color="#dee2e6"; CC1 [label="Column Chunk A\n(Integers)", fillcolor="#228be6", fontcolor=white]; CC2 [label="Column Chunk B\n(Strings)", fillcolor="#12b886", fontcolor=white]; } subgraph cluster_rg2 { label = "Row Group 2"; style=filled; color="#dee2e6"; CC3 [label="Column Chunk A\n(Integers)", fillcolor="#228be6", fontcolor=white]; CC4 [label="Column Chunk B\n(Strings)", fillcolor="#12b886", fontcolor=white]; } } CC1 -> Page1 [label="contains"]; CC1 -> Page2 [label="contains"]; Page1 [label="Page 1\n(Header + Data)", fillcolor="#7950f2", fontcolor=white]; Page2 [label="Page 2\n(Header + Data)", fillcolor="#7950f2", fontcolor=white]; CC1 -> Metadata [style=invis]; }The physical layout of a Parquet file showing how data is segmented into Row Groups and Column Chunks before being split into Pages.File Metadata and StatisticsThe performance gains in Parquet are heavily reliant on metadata. Unlike many formats that place headers at the beginning, Parquet writes file metadata in the footer. This design choice allows data writers to buffer data in memory and write it to disk sequentially. Once all data is written, the writer aggregates the locations and statistics of all row groups and appends them to the end of the file.The metadata includes the schema, version info, and offsets for every column chunk. More importantly, it stores statistics for each column chunk and page, such as:Minimum valueMaximum valueNull countWhen you execute a query like SELECT * FROM orders WHERE order_amount > 500, the query engine reads the footer first. It examines the metadata for the order_amount column chunks. If a specific row group has a min: 0 and max: 100, the engine knows with certainty that no row in that group matches the filter. It skips the entire row group, avoiding the need to read or decompress that data. This technique is significantly faster than scanning every value.Encoding SchemesParquet reduces storage costs through efficient encoding. Encoding is different from compression (like Snappy or Gzip). Compression is a generic algorithm applied to a byte stream, whereas encoding uses knowledge of the data type and distribution to represent values more compactly.Dictionary EncodingThis is the default encoding for many data types. It is highly effective for columns with low cardinality, meaning the number of unique values is small compared to the total number of rows (e.g., country names, product categories, or status flags).Instead of storing the string "United States" thousands of times, Parquet creates a dictionary at the beginning of the column chunk."United States" $\rightarrow$ 0"France" $\rightarrow$ 1"Japan" $\rightarrow$ 2The actual data is then stored as a stream of small integers (indices). If the column data is ["United States", "United States", "France"], Parquet stores [0, 0, 1]. This reduces the space required for long strings down to a few bits per row.Run-Length Encoding (RLE)Run-Length Encoding (RLE) and Bit-Packing are often used in conjunction with dictionary encoding or for boolean values. RLE compresses sequences of repeated data. If a column contains sorted or repetitive data, RLE is extremely efficient.Consider a status column where the value "Active" repeats 500 times followed by "Inactive" 200 times.Raw storage: Stores "Active" 500 times.RLE storage: Stores the pair (500, "Active"), (200, "Inactive").This turns 700 data points into just two tuples. When combined with bit-packing, integers are squeezed into the smallest number of bits required. For example, if the maximum value in a page is 7, Parquet only needs 3 bits per integer ($2^3 = 8$) rather than the standard 32 bits.{"layout": {"title": "Storage Efficiency by Encoding Type (Simulation)", "xaxis": {"title": "Encoding Method"}, "yaxis": {"title": "Storage Size (KB)"}, "template": "simple_white", "width": 600, "height": 400}, "data": [{"type": "bar", "x": ["Raw Text (CSV)", "Dictionary Encoding", "RLE + Bit Packing"], "y": [1500, 450, 120], "marker": {"color": ["#adb5bd", "#4dabf7", "#37b24d"]}}]}A comparison of storage footprint reduction when applying Dictionary Encoding and RLE to a highly repetitive dataset.Data Types: Physical vs. LogicalParquet distinguishes between how data is stored (physical type) and how it is interpreted (logical type). This decoupling keeps the format simple while supporting complex data structures.There are only a few primitive physical types:BOOLEANINT32INT64INT96 (historical, mostly used for timestamps)FLOATDOUBLEBYTE_ARRAYFIXED_LEN_BYTE_ARRAYLogical types map to these primitives. For example, a UTF8 string is physically stored as a BYTE_ARRAY with a logical annotation that tells the reader to interpret the bytes as a UTF-8 string. A DECIMAL value might be stored as a FIXED_LEN_BYTE_ARRAY or INT64 depending on the precision required. This abstraction allows Parquet to evolve and support new types (like UUIDs or Time intervals) without changing the underlying binary readers.Understanding these internals helps in designing schemas. For instance, knowing that INT32 is the underlying storage for dates suggests that sorting your data by date before writing it to Parquet will maximize the efficiency of Run-Length Encoding, resulting in smaller files and faster queries.