Distributed data ingestion faces challenges where time is rarely a linear sequence. Network partitions, device connectivity issues, and batch processing delays often result in events arriving at the warehouse significantly later than when they occurred. A high-throughput pipeline must distinguish between two distinct temporal dimensions: Event Time (when the event actually happened) and Processing Time (when the system observed the event).Relying solely on processing time for analytics introduces significant skew. If a transaction from 23:55 is ingested at 00:05 the following day, grouping by the processing timestamp assigns revenue to the wrong fiscal day. Conversely, strictly ordering by event time requires a strategy for handling data that arrives after a window of analysis has theoretically closed.Temporal Skew and WatermarksThe difference between processing time $t_p$ and event time $t_e$ is defined as the skew $S$:$$S(e) = t_p(e) - t_e(e)$$In an ideal system, $S(e)$ is near zero. In reality, $S(e)$ is a random variable with a long tail. To manage this, we employ watermarks. A watermark is a heuristic function $W(t)$ that asserts that for a specific processing time, the system does not expect to receive any further events with timestamps older than $T$.When the watermark passes a specific timestamp, the ingestion engine can close the window and materialize the results. Any data arriving with $t_e < W(t_p)$ is classified as late-arriving data.The following chart visualizes the relationship between event time and processing time. The "Ideal" line represents zero latency. The "Watermark" indicates the system's tolerance for delay. Points below the watermark are processed normally, while points above it are considered late and trigger specific handling logic.{ "layout": { "title": "Event Time vs. Processing Time", "xaxis": { "title": "Event Time (t_e)", "showgrid": true, "zeroline": false }, "yaxis": { "title": "Processing Time (t_p)", "showgrid": true, "zeroline": false }, "showlegend": true, "plot_bgcolor": "#f8f9fa", "paper_bgcolor": "#ffffff", "font": {"family": "Arial", "color": "#495057"} }, "data": [ { "x": [0, 10, 20, 30, 40, 50], "y": [0, 10, 20, 30, 40, 50], "mode": "lines", "name": "Ideal (Zero Latency)", "line": {"color": "#12b886", "dash": "dash"} }, { "x": [0, 10, 20, 30, 40, 50], "y": [5, 15, 25, 35, 45, 55], "mode": "lines", "name": "Watermark Heuristic", "line": {"color": "#4c6ef5", "width": 3} }, { "x": [12, 22, 35, 8], "y": [14, 24, 38, 20], "mode": "markers", "name": "On-Time Events", "marker": {"color": "#228be6", "size": 8} }, { "x": [15, 25], "y": [30, 45], "mode": "markers", "name": "Late Events (Violation)", "marker": {"color": "#fa5252", "size": 10, "symbol": "x"} } ] }Strategies for Handling Late ArrivalsWhen an event violates the watermark constraint, the ingestion pipeline must execute one of three strategies based on business requirements and the cost of re-computation.1. DiscardingThe simplest approach is to drop data that arrives after the window closes. This is common in monitoring systems where real-time operational health is more valuable than 100% historical accuracy. If a CPU metric from 10 minutes ago arrives now, it may no longer be actionable.2. Side Output (Dead Letter Queue)Late data is diverted to a separate storage location, often a "cold" storage bucket or a specific late_arrivals table. This prevents the main pipeline from stalling or re-triggering expensive computations. Engineers can then run batch jobs periodically (e.g., nightly) to merge these late records into the primary warehouse tables. This balances the need for real-time stability with eventual consistency.3. Updating and RecalculationIn financial or regulatory contexts, data completeness is mandatory. When late data arrives, the system must update previously materialized aggregates. In an MPP data warehouse, this has specific implications for storage micro-partitions.If you partition your data by date, a record for 2023-10-01 arriving on 2023-11-01 forces the warehouse to un-archive the October partition, write the new record, and re-compress the micro-partition. This operation is I/O intensive. To mitigate the performance impact, engineers often use a Lookback Window.A Lookback Window limits how far back the system will check for updates. For example, an ingestion job running at hour $H$ might query source data for the window $[H-1, H]$ but also check for updates to records in $[H-24, H]$.The diagram below outlines a decision flow for routing ingestion data based on temporal constraints and partition impact.digraph G { rankdir=LR; node [shape=box, style=filled, fontname="Arial", fontsize=12]; edge [fontname="Arial", fontsize=10, color="#868e96"]; start [label="Incoming Event", fillcolor="#e7f5ff", color="#4dabf7"]; check_watermark [label="Check Watermark\n(Is t_e < W(t_p)?", fillcolor="#fff3bf", color="#fcc419", shape=diamond]; process_normal [label="Standard Ingestion\nAppend to Current Partition", fillcolor="#d3f9d8", color="#40c057"]; check_threshold [label="Within Lookback\nThreshold?", fillcolor="#fff3bf", color="#fcc419", shape=diamond]; rewrite [label="Trigger Partition\nRewrite/Merge", fillcolor="#ffec99", color="#fab005"]; side_output [label="Route to\nCold Storage/DLQ", fillcolor="#ffe3e3", color="#fa5252"]; start -> check_watermark; check_watermark -> process_normal [label="No (On Time)"]; check_watermark -> check_threshold [label="Yes (Late)"]; check_threshold -> rewrite [label="Yes"]; check_threshold -> side_output [label="No (Too Old)"]; }Bitemporal ModelingTo audit what we knew and when we knew it, advanced warehouses implement bitemporal modeling. This involves storing both the Event Time and the Ingestion Time for every record.Consider a user updating their address. The move happened on Monday (Event Time), but the system received the update on Wednesday (Ingestion Time). By storing both, you can reconstruct the state of the database as it appeared on Tuesday (showing the old address) or query the current truth (showing the new address).In Snowflake or BigQuery, this is implemented by adding _ingestion_timestamp columns to all tables and using them as secondary clustering keys. This allows for efficient queries that filter on ingestion windows for incremental loading, while analytical queries filter on event timestamps for business logic.The Lambda vs. Kappa Trade-offThe handling of late data often dictates the choice between Lambda and Kappa architectures.Lambda Architecture: Utilizes a speed layer for real-time views and a batch layer for comprehensive processing. Late data is handled naturally by the batch layer, which reprocesses the entire dataset periodically. This is effective but requires maintaining two codebases.Kappa Architecture: Treats everything as a stream. Late data is handled by reprocessing the stream or using "retractions" (emitting a negative value to cancel previous results). This is operationally simpler but requires a stream processing engine capable of managing complex state over long periods.In modern MPP warehouses, the distinction blurs. With features like Dynamic Tables (Snowflake) or Materialized Views (BigQuery), the warehouse effectively acts as a streaming engine that can handle micro-batch updates. The engine automatically handles the "retraction" or update logic when late data enters the underlying source tables, provided the refresh logic is defined correctly.Effectively managing late-arriving data requires a precise definition of latency requirements. If business stakeholders require 99.9% accuracy within 5 minutes, the infrastructure costs will rise exponentially to handle out-of-order events. Conversely, accepting a 24-hour reconciliation period allows for cost-effective batch repairs of historical partitions.