Ingestion pipelines designed for speed often inadvertently degrade read performance by generating excessive small files. This phenomenon, widely known as the "small file problem," occurs when data is written to storage in increments significantly smaller than the optimal block size of the underlying file system or object store. While individual writes succeed quickly, the accumulation of thousands or millions of tiny files creates severe bottlenecks for downstream analytical queries.
To understand why small files cripple performance, we must look at how distributed query engines interact with object storage systems like Amazon S3, Azure Blob Storage, or Google Cloud Storage.
Object storage is optimized for high throughput (reading large amounts of data continuously) rather than low latency (handling many small requests). When a query engine like Trino or Spark reads a dataset, it performs two distinct types of operations:
LIST) and retrieving file statistics (HEAD).GET).Every file incurs a fixed overhead for metadata operations and connection establishment. This overhead is constant regardless of file size. If we model the total time required to read a dataset of size split into files, the relationship is:
When is small (large files), the bandwidth term dominates, and the system operates efficiently. When is large (small files), the term dominates. The system spends more time waiting for server responses and managing connections than actually transferring data.
Consider the difference in processing overhead between two file layouts for the same 1GB dataset.
Comparison of estimated read times for 1GB of data. The overhead is negligible for large files but becomes the primary bottleneck when the data is fragmented.
The small file problem typically originates in the Bronze (raw) layer during the ingestion phase. Two primary architectural patterns drive this issue.
Real-time ingestion pipelines often utilize micro-batch processing. If a Spark Structured Streaming job runs a trigger every 60 seconds, it commits a new file to the storage layer every minute. Over 24 hours, a single stream generates 1,440 files. If the data volume is low (e.g., 10MB per hour), each file is only ~7KB. Scaling this across 100 concurrent streams results in nearly 150,000 tiny files per day.
Engineers frequently over-partition data to optimize specific query filters. For example, partitioning a dataset by year, month, day, and hour creates a directory structure that separates data into 24 folders per day. If the data is further partitioned by a high-cardinality column like sensor_id, the data spreads too thinly.
If an ingestion job writes 1GB of data but splits it across 1,000 partition directories, the average file size drops to 1MB. The query engine must then list all 1,000 directories to reconstruct the dataset, significantly increasing the planning phase of the query.
Regarding query performance, small files negatively impact the metadata catalog (such as Hive Metastore or AWS Glue). Catalogs must track the location and schema of every file. An explosion in file count bloats the metadata database, causing:
The standard solution to the small file problem is compaction. This process involves reading a collection of small files and rewriting them into fewer, larger files optimized for reading.
Compaction is typically implemented as a maintenance job that runs asynchronously to the ingestion pipeline. This decouples the requirement for low-latency writes (which inherently produce small files) from the requirement for high-performance reads.
The architecture follows a "write-fast, optimize-later" pattern.
Workflow separating low-latency ingestion from file layout optimization. The compaction job consolidates raw files into read-optimized blocks.
Compaction jobs generally use a "bin-packing" algorithm. The goal is to combine files until the sum of their sizes equals a target threshold (usually between 128MB and 1GB for Parquet files).
If you have ten files of 10MB each and a target size of 100MB, the bin-packing algorithm groups them into a single write task. This minimizes network chatter during the rewrite process and ensures the resulting files are sized correctly for vectorization and compression.
Modern open table formats like Apache Iceberg and Delta Lake abstract the complexity of file management. They provide built-in procedures to handle compaction, eliminating the need to write custom file-listing logic.
In a standard data lake without table formats, compaction requires careful orchestration to ensure data consistency. You must read the small files, write the new large file, and then atomically swap them or update the metadata pointers. If the job fails halfway, you risk data duplication or loss.
Table formats solve this by using snapshot isolation. A compaction job can rewrite files in the background while readers continue to query the old snapshots. Once the rewrite is complete, the table format atomically commits a new snapshot pointing to the large files.
For example, in Delta Lake, the OPTIMIZE command performs this function:
-- Standard SQL command to compact files in Delta Lake
OPTIMIZE events_table
WHERE date >= current_date() - 1
ZORDER BY (event_type);
This single command triggers a job that scans the events_table, identifies files below a size threshold, and rewrites them. The optional ZORDER BY clause further improves performance by co-locating similar data within the same set of files, acting as a multi-dimensional sort.
To maintain a healthy data lake, apply these design principles:
user_id or transaction_id) unless strictly necessary for data lifecycle management. Prefer partitioning by coarser grains like date and rely on file skipping (min/max statistics) for granular filtering.Was this section helpful?
OPTIMIZE command for data compaction and Z-ordering.© 2026 ApX Machine LearningAI Ethics & Transparency•