Ingestion pipelines designed for speed often inadvertently degrade read performance by generating excessive small files. This phenomenon, widely known as the "small file problem," occurs when data is written to storage in increments significantly smaller than the optimal block size of the underlying file system or object store. While individual writes succeed quickly, the accumulation of thousands or millions of tiny files creates severe bottlenecks for downstream analytical queries.The Mechanics of Latency and ThroughputTo understand why small files cripple performance, we must look at how distributed query engines interact with object storage systems like Amazon S3, Azure Blob Storage, or Google Cloud Storage.Object storage is optimized for high throughput (reading large amounts of data continuously) rather than low latency (handling many small requests). When a query engine like Trino or Spark reads a dataset, it performs two distinct types of operations:Metadata Operations: Listing objects in a bucket to determine which files to read (LIST) and retrieving file statistics (HEAD).Data Transfer: Opening a connection and streaming the bytes (GET).Every file incurs a fixed overhead for metadata operations and connection establishment. This overhead is constant regardless of file size. If we model the total time $T$ required to read a dataset of size $S$ split into $N$ files, the relationship is:$$T_{total} = N \times (T_{overhead}) + \frac{S}{Bandwidth}$$When $N$ is small (large files), the bandwidth term dominates, and the system operates efficiently. When $N$ is large (small files), the $N \times T_{overhead}$ term dominates. The system spends more time waiting for server responses and managing connections than actually transferring data.Consider the difference in processing overhead between two file layouts for the same 1GB dataset.{"layout": {"title": "Read Time: Overhead vs. Data Transfer", "xaxis": {"title": "File Configuration"}, "yaxis": {"title": "Time (Seconds)"}, "barmode": "stack", "showlegend": true, "width": 600, "height": 400, "margin": {"l": 50, "r": 50, "b": 50, "t": 50}}, "data": [{"x": ["10,000 x 100KB Files", "8 x 128MB Files"], "y": [25, 0.5], "name": "Metadata/Connection Overhead", "type": "bar", "marker": {"color": "#f03e3e"}}, {"x": ["10,000 x 100KB Files", "8 x 128MB Files"], "y": [5, 5], "name": "Data Transfer Time", "type": "bar", "marker": {"color": "#1c7ed6"}}]}Comparison of estimated read times for 1GB of data. The overhead is negligible for large files but becomes the primary bottleneck when the data is fragmented.Common Causes in Ingestion PipelinesThe small file problem typically originates in the Bronze (raw) layer during the ingestion phase. Two primary architectural patterns drive this issue.High-Frequency StreamingReal-time ingestion pipelines often utilize micro-batch processing. If a Spark Structured Streaming job runs a trigger every 60 seconds, it commits a new file to the storage layer every minute. Over 24 hours, a single stream generates 1,440 files. If the data volume is low (e.g., 10MB per hour), each file is only ~7KB. Scaling this across 100 concurrent streams results in nearly 150,000 tiny files per day.Aggressive PartitioningEngineers frequently over-partition data to optimize specific query filters. For example, partitioning a dataset by year, month, day, and hour creates a directory structure that separates data into 24 folders per day. If the data is further partitioned by a high-cardinality column like sensor_id, the data spreads too thinly.If an ingestion job writes 1GB of data but splits it across 1,000 partition directories, the average file size drops to 1MB. The query engine must then list all 1,000 directories to reconstruct the dataset, significantly increasing the planning phase of the query.Impact on Catalog SynchronizationRegarding query performance, small files negatively impact the metadata catalog (such as Hive Metastore or AWS Glue). Catalogs must track the location and schema of every file. An explosion in file count bloats the metadata database, causing:Slower partition discovery times.Timeouts when updating table statistics.Increased costs for managed catalog services that charge based on the number of objects stored.Mitigation Strategy: CompactionThe standard solution to the small file problem is compaction. This process involves reading a collection of small files and rewriting them into fewer, larger files optimized for reading.Compaction is typically implemented as a maintenance job that runs asynchronously to the ingestion pipeline. This decouples the requirement for low-latency writes (which inherently produce small files) from the requirement for high-performance reads.The architecture follows a "write-fast, optimize-later" pattern.digraph G { rankdir=LR; node [shape=box, style="filled", fontname="Helvetica"]; subgraph cluster_ingest { label = "Ingestion Phase"; style = dashed; color = "#adb5bd"; Source [label="Stream Source", fillcolor="#e9ecef"]; IngestJob [label="Streaming Job", fillcolor="#a5d8ff"]; SmallFiles [label="Raw Storage\n(Thousands of small files)", shape=cylinder, fillcolor="#ffc9c9"]; Source -> IngestJob -> SmallFiles; } subgraph cluster_optimize { label = "Optimization Phase"; style = dashed; color = "#adb5bd"; Compactor [label="Compaction Job\n(Batch)", fillcolor="#b2f2bb"]; LargeFiles [label="Optimized Storage\n(Target file size ~128MB)", shape=cylinder, fillcolor="#69db7c"]; SmallFiles -> Compactor -> LargeFiles; } }Workflow separating low-latency ingestion from file layout optimization. The compaction job consolidates raw files into read-optimized blocks.Bin-Packing AlgorithmsCompaction jobs generally use a "bin-packing" algorithm. The goal is to combine files until the sum of their sizes equals a target threshold (usually between 128MB and 1GB for Parquet files).If you have ten files of 10MB each and a target size of 100MB, the bin-packing algorithm groups them into a single write task. This minimizes network chatter during the rewrite process and ensures the resulting files are sized correctly for vectorization and compression.Implementation with Open Table FormatsModern open table formats like Apache Iceberg and Delta Lake abstract the complexity of file management. They provide built-in procedures to handle compaction, eliminating the need to write custom file-listing logic.In a standard data lake without table formats, compaction requires careful orchestration to ensure data consistency. You must read the small files, write the new large file, and then atomically swap them or update the metadata pointers. If the job fails halfway, you risk data duplication or loss.Table formats solve this by using snapshot isolation. A compaction job can rewrite files in the background while readers continue to query the old snapshots. Once the rewrite is complete, the table format atomically commits a new snapshot pointing to the large files.For example, in Delta Lake, the OPTIMIZE command performs this function:-- Standard SQL command to compact files in Delta Lake OPTIMIZE events_table WHERE date >= current_date() - 1 ZORDER BY (event_type);This single command triggers a job that scans the events_table, identifies files below a size threshold, and rewrites them. The optional ZORDER BY clause further improves performance by co-locating similar data within the same set of files, acting as a multi-dimensional sort.Best Practices for Ingestion DesignTo maintain a healthy data lake, apply these design principles:Tune Buffer Sizes: In streaming frameworks like Kafka Connect or Kinesis Firehose, configure buffering hints. Allow the buffer to fill up to at least 64MB or wait for a longer interval (e.g., 5-15 minutes) before flushing to storage, if latency requirements permit.Avoid Over-Partitioning: Do not partition by columns with high cardinality (like user_id or transaction_id) unless strictly necessary for data lifecycle management. Prefer partitioning by coarser grains like date and rely on file skipping (min/max statistics) for granular filtering.Schedule Regular Maintenance: If you cannot buffer data at the source, schedule compaction jobs to run immediately after batch loads or periodically for streams.Monitor File Counts: Set up observability alerts. If the average file size in a table drops below 10MB, trigger an alert to investigate the ingestion configuration.