Raw object storage combined with high-performance file formats like Parquet solves the problem of scalable, low-cost storage. However, this combination introduces a critical gap when compared to traditional relational databases: the lack of transactional guarantees. When you execute a query against a folder of Parquet files, the engine simply reads whatever files exist in that directory. If a data pipeline fails midway through writing a batch of files, or if two users attempt to modify the dataset simultaneously, the query engine may return incomplete or corrupt results.Open Table Formats (OTFs) bridge this gap by adding a metadata layer over the raw data files. Instead of treating a directory as a table, the system relies on a centralized log or metadata tree to define which files belong to the table at any given moment. This approach enables ACID (Atomicity, Consistency, Isolation, Durability) transactions on the data lake.The Metadata LayerIn a standard data lake, the directory structure defines the dataset. In an architecture utilizing an Open Table Format, the state of the table is defined by a specific metadata file. When a query engine requests data, it first reads the metadata to identify the valid list of data files.This decoupling allows for atomic operations. When a new batch of data is ingested, the system writes the new Parquet files to storage, but they remain invisible to readers until a new metadata file is committed. This commit operation is atomic; it swaps the pointer from the old metadata version to the new one. Readers see either the complete previous state or the complete new state, never a partial write.The structure generally follows a hierarchical pattern.digraph G { rankdir=TB; node [shape=box, style=filled, fontname="Arial"]; splines=ortho; nodesep=0.5; ranksep=0.5; catalog [label="Catalog Pointer", fillcolor="#bac8ff", color="#4c6ef5"]; meta [label="Metadata File (v2.json)", fillcolor="#eebefa", color="#be4bdb"]; snap [label="Snapshot List", fillcolor="#d0bfff", color="#7950f2"]; manlist [label="Manifest List", fillcolor="#a5d8ff", color="#228be6"]; manfile [label="Manifest File", fillcolor="#99e9f2", color="#15aabf"]; data1 [label="Data File (001.parquet)", fillcolor="#b2f2bb", color="#40c057"]; data2 [label="Data File (002.parquet)", fillcolor="#b2f2bb", color="#40c057"]; catalog -> meta; meta -> snap; snap -> manlist; manlist -> manfile; manfile -> data1; manfile -> data2; }Hierarchical structure of an Open Table Format showing how a catalog pointer resolves down to individual data files.The components function as follows:Catalog Pointer: Tells the engine where to find the current "head" of the table.Metadata File: Contains table-level information, including the schema, partition definitions, and a list of snapshots.Snapshot: Represents the state of the table at a specific point in time.Manifest List: A list of manifest files that make up a snapshot.Manifest File: A list of actual data files (e.g., s3://bucket/data/file_a.parquet) along with statistics like min/max values for columns in that file.ACID Transactions and ConcurrencyThe primary benefit of this architecture is the implementation of ACID transactions in distributed storage.Atomicity is achieved through the metadata swap. If an INSERT job writes 99 out of 100 files and then crashes, the metadata pointer is never updated. The 99 "orphan" files exist in storage but are ignored by queries because they are not referenced in the active snapshot.Isolation is typically handled via Snapshot Isolation. When a query begins, it locks onto a specific snapshot ID. Even if an ingestion job commits new data milliseconds later, the running query continues to read from the snapshot it started with. This ensures consistent results for long-running analytical queries.Concurrency control mechanisms differ between formats but generally rely on Optimistic Concurrency Control (OCC). If two writers attempt to update the table simultaneously, the first one to commit succeeds. The second writer detects that the underlying version has changed and must retry the operation or fail, depending on the conflict resolution strategy.Time Travel and RollbacksBecause OTFs track the history of snapshots, they enable "Time Travel" capabilities. You can query the data as it existed at a specific timestamp or snapshot ID. This is particularly useful for reproducing machine learning models or auditing changes.To query a previous version of a table using SQL, the syntax often looks like this:-- Select data as of a specific timestamp SELECT * FROM events FOR SYSTEM_TIME AS OF '2023-10-27 10:00:00'; -- Select data as of a specific version ID SELECT * FROM events FOR SYSTEM_VERSION AS OF 123456789;This mechanism also allows for instant rollbacks. If a data engineer accidentally deploys a corrupted pipeline, restoring the table is not a matter of restoring backups. It is simply a metadata operation to point the "current" reference back to the previous snapshot ID.Comparing Primary FormatsThree major specifications dominate the ecosystem: Apache Iceberg, Delta Lake, and Apache Hudi. While they share the same goal, their internal implementations prioritize different workloads.Apache IcebergIceberg focuses on high-performance queries over huge datasets. Its design emphasizes the separation of the physical file layout from the logical table structure. A distinctive feature is Hidden Partitioning. In traditional Hive-style partitioning, if data is partitioned by day, the user must query WHERE date_str = '2023-01-01'. If they query by timestamp WHERE event_time > '2023-01-01 00:00:00', the engine might not recognize the partition relationship and scan the entire table. Iceberg maintains the relationship between the column and the partition transform, allowing the engine to prune partitions correctly regardless of how the query is written.Delta LakeOriginally developed by Databricks, Delta Lake uses a transaction log (the _delta_log folder) containing JSON files that record sequential atomic commits. It functions similarly to a Write-Ahead Log (WAL) in a relational database. Delta Lake relies heavily on file system atomicity or separate coordination services (like DynamoDB on AWS S3) to ensure linear history. It is highly optimized for read performance and integrates tightly with Spark.Apache HudiHudi (Hadoop Upserts Deletes and Incrementals) was built by Uber to handle streaming data ingestion and mutable datasets. It excels in scenarios requiring frequent updates and deletes. Hudi introduces distinct storage types:Copy On Write (COW): Data is stored in columnar format (Parquet). Updates rewrite the files. Better for read-heavy workloads.Merge On Read (MOR): Data is stored in a combination of columnar (Parquet) and row-based (Avro) log files. Updates are appended to logs and merged during the query. Better for write-heavy streaming workloads.The choice between COW and MOR represents a trade-off between write latency and read latency.{"layout": {"title": "Latency Trade-offs: Copy-On-Write vs. Merge-On-Read", "xaxis": {"title": "Workload Type"}, "yaxis": {"title": "Latency (Lower is Better)"}, "barmode": "group", "width": 600, "height": 400, "colorscale": "Viridis"}, "data": [{"x": ["Write Latency", "Read Latency"], "y": [80, 20], "name": "Copy On Write (COW)", "type": "bar", "marker": {"color": "#4dabf7"}}, {"x": ["Write Latency", "Read Latency"], "y": [20, 60], "name": "Merge On Read (MOR)", "type": "bar", "marker": {"color": "#ff6b6b"}}]}Performance characteristics showing that Merge On Read optimizes for faster writes at the expense of slower reads compared to Copy On Write.Managing Small Files and CompactionA common issue in streaming ingestion is the "small file problem." If a stream writes a record every few seconds, you may end up with millions of kilobyte-sized Parquet files. This destroys read performance because the query engine spends more time opening and closing files than reading data.OTFs provide native compaction utilities (often called "bin-packing"). A background process reads these small files and rewrites them into larger, optimal files (typically 128MB to 1GB) without blocking readers. The metadata is then updated to point to the new large files and mark the small ones as logically deleted.This process is idempotent and invisible to end-users. The cost of compaction $C$ generally follows the volume of data written, but the savings during read time $R$ are exponential relative to the number of files reduced.$$ \text{ReadEfficiency} \propto \frac{1}{\text{FileCount}} $$By maintaining strict control over file layouts and metadata, Open Table Formats transform the data lake from a static file repository into a dynamic, transactional database capable of supporting reliable enterprise analytics.