Compression in a data lake environment is not merely about saving disk space. While reducing storage costs is a primary benefit, the impact of compression on I/O throughput and query latency is often more significant. When a distributed query engine like Apache Spark or Trino reads data from object storage, the operation involves three distinct stages: reading the bytes from the network, decompressing those bytes in memory, and parsing the data into a usable format.An uncompressed file transfers faster from the disk to the CPU if the network bandwidth is infinite, but cloud networks have limits. Conversely, a highly compressed file reduces network transfer time but demands significant CPU cycles to decompress. The goal of selecting a compression algorithm is to balance these opposing forces. You aim to minimize the total time $T_{total}$ defined as:$$T_{total} = \frac{Size_{compressed}}{Bandwidth_{network}} + \frac{Size_{raw}}{Speed_{decompression}}$$If the CPU takes longer to decompress the data than the network would have taken to transfer the raw bytes, the compression becomes a bottleneck.Splittability and ParallelismBefore examining specific algorithms, we must address the interaction between compression and distributed processing. Data lakes rely on the ability to split large files into smaller chunks (splits) that can be processed in parallel by different worker nodes.Certain compression formats, such as a standard Gzip stream applied to a CSV file, are not splittable. To read the last record in a data.csv.gz file, the engine must decompress the entire stream from the beginning to locate the record boundaries. This forces a single CPU core to process the entire file, negating the benefits of distributed computing.However, this limitation changes when using columnar formats like Apache Parquet or ORC. These formats divide data into internal "row groups" or blocks. The compression codec is applied to each block independently rather than the file as a whole. Therefore, a Parquet file using Gzip compression is fully splittable. The query engine reads the file metadata, identifies the byte ranges for each row group, and assigns different groups to different workers.Common Algorithms in Data LakesWe focus on three primary codecs widely used in modern data architecture: Snappy, Gzip, and Zstandard (Zstd).SnappySnappy is designed for speed. It does not aim for maximum compression ratios but focuses on very high decompression speeds, often exceeding 500 MB/s on modern processors. It provides a reasonable reduction in size (typically 1.5x to 2x) with minimal CPU overhead.Because of this balance, Snappy is the default compression codec for Apache Parquet. It is ideal for "hot" data layers where query latency is the priority. In interactive analytics where users expect responses in seconds, the low CPU cost of Snappy ensures that the processor spends its time executing query logic rather than decompressing bytes.GzipDEFLATE-based Gzip offers high compression ratios (often 3x to 4x for text data) but incurs a high CPU cost during both compression and decompression. In a data lake, Gzip is best suited for "cold" or archival data where storage costs are the primary concern and access is infrequent.When used with text formats like JSON or CSV, Gzip creates non-splittable files. You should avoid generating large (multi-gigabyte) Gzip-compressed JSON files. Instead, if you must use Gzip with text, size the files to be roughly equal to your target partition size (e.g., 128MB - 256MB) so that the lack of internal splitting does not hinder parallelism.Zstandard (Zstd)Zstandard is a modern algorithm developed by Meta that provides a compression ratio comparable to Gzip but with decompression speeds closer to Snappy. It features a tunable compression level (ranging from negative values for speed to 22 for maximum compression).Zstd has become the standard recommendation for general-purpose data lake storage. It offers a "pareto improvement" over Gzip, strictly dominating it in both speed and ratio at standard settings. Many organizations are migrating their Silver and Gold tables to Zstd-compressed Parquet to reduce storage costs without sacrificing query performance.{"layout": {"title": {"text": "Compression Codec Performance Trade-offs", "font": {"size": 16, "color": "#495057"}}, "xaxis": {"title": {"text": "Decompression Speed (MB/s)", "font": {"size": 12, "color": "#868e96"}}, "gridcolor": "#e9ecef", "zerolinecolor": "#ced4da"}, "yaxis": {"title": {"text": "Compression Ratio (Space Saving)", "font": {"size": 12, "color": "#868e96"}}, "gridcolor": "#e9ecef", "zerolinecolor": "#ced4da"}, "plot_bgcolor": "white", "width": 600, "height": 400, "showlegend": true}, "data": [{"x": [550], "y": [2.0], "mode": "markers+text", "text": ["Snappy"], "textposition": "top center", "marker": {"size": 15, "color": "#228be6"}, "name": "Snappy"}, {"x": [100], "y": [3.2], "mode": "markers+text", "text": ["Gzip"], "textposition": "top center", "marker": {"size": 15, "color": "#fa5252"}, "name": "Gzip"}, {"x": [400], "y": [3.1], "mode": "markers+text", "text": ["Zstd"], "textposition": "top center", "marker": {"size": 15, "color": "#40c057"}, "name": "Zstd"}, {"x": [800], "y": [1.0], "mode": "markers+text", "text": ["Uncompressed"], "textposition": "bottom center", "marker": {"size": 12, "color": "#adb5bd"}, "name": "Raw"}]}Comparison of typical decompression speeds versus compression ratios. Zstd occupies the favorable middle ground, offering high ratios with high speed.Column-Specific CompressionA distinct advantage of columnar file formats is the ability to apply different encoding and compression strategies to different columns. Since data in a column is uniform (e.g., all integers or all timestamps), the compression algorithm can exploit this homogeneity.For example, a column containing low-cardinality string data (like country_code) works exceptionally well with Dictionary Encoding combined with RLE (Run-Length Encoding), followed by general-purpose compression. This pipeline often results in storage footprints significantly smaller than row-oriented equivalents.When configuring your ingestion jobs (using Spark or Flink), you generally define the compression codec at the table or file level. However, the underlying writer library (like parquet-mr) handles the application of encodings automatically based on the data type and statistics.Selection StrategyChoosing the right algorithm depends on the data lifecycle stage:Landing / Bronze Layer: Data often arrives as raw JSON or CSV. If these are archival logs, use Gzip or Zstd to minimize S3 storage costs. Splittability is less of a concern here if the files are merely a staging area for processing.Curated / Silver & Gold Layers: These layers serve analytical queries. Use Parquet or Iceberg tables.Default Choice: Snappy. It is widely supported, fast, and requires no tuning.Cost Optimization: Zstd. If your data lake grows to petabytes, switching from Snappy to Zstd can save 30% on storage bills. Ensure your query engines (e.g., older versions of Hive or Presto) support the Zstd codec.The following logic flow outlines the decision process for selecting a codec based on file format and access patterns.digraph G { rankdir=TB; node [shape=box, style=filled, fillcolor="#f8f9fa", fontname="Arial", color="#dee2e6"]; edge [fontname="Arial", fontsize=10, color="#adb5bd"]; Start [label="Select Compression", fillcolor="#e7f5ff", color="#74c0fc"]; Format [label="File Format?", shape=diamond, fillcolor="#fff3bf", color="#fcc419"]; Text [label="Text (CSV/JSON)"]; Columnar [label="Columnar (Parquet/Avro)"]; Access [label="Access Pattern?", shape=diamond, fillcolor="#fff3bf", color="#fcc419"]; Splittable [label="Splittability Needed?", shape=diamond, fillcolor="#fff3bf", color="#fcc419"]; Snappy [label="Use Snappy", fillcolor="#d3f9d8", color="#40c057"]; Zstd [label="Use Zstd", fillcolor="#d3f9d8", color="#40c057"]; Gzip [label="Use Gzip", fillcolor="#ffc9c9", color="#fa5252"]; Start -> Format; Format -> Text [label="Row-based"]; Format -> Columnar [label="Analytics"]; Columnar -> Access; Access -> Snappy [label="Hot / Interactive"]; Access -> Zstd [label="Warm / Storage Sensitive"]; Text -> Splittable; Splittable -> Zstd [label="Yes (via bzip2/lzo or block compression)"]; Splittable -> Gzip [label="No (Archival)"]; }Decision tree for selecting compression algorithms based on file format and performance requirements.By correctly aligning the compression algorithm with the storage format and query requirements, you prevent the CPU from becoming a bottleneck in what should be an I/O-bound architecture.