Distributed data ingestion faces challenges where time is rarely a linear sequence. Network partitions, device connectivity issues, and batch processing delays often result in events arriving at the warehouse significantly later than when they occurred. A high-throughput pipeline must distinguish between two distinct temporal dimensions: Event Time (when the event actually happened) and Processing Time (when the system observed the event).
Relying solely on processing time for analytics introduces significant skew. If a transaction from 23:55 is ingested at 00:05 the following day, grouping by the processing timestamp assigns revenue to the wrong fiscal day. Conversely, strictly ordering by event time requires a strategy for handling data that arrives after a window of analysis has theoretically closed.
The difference between processing time tp and event time te is defined as the skew S:
S(e)=tp(e)−te(e)
In an ideal system, S(e) is near zero. In reality, S(e) is a random variable with a long tail. To manage this, we employ watermarks. A watermark is a heuristic function W(t) that asserts that for a specific processing time, the system does not expect to receive any further events with timestamps older than T.
When the watermark passes a specific timestamp, the ingestion engine can close the window and materialize the results. Any data arriving with te<W(tp) is classified as late-arriving data.
The following chart visualizes the relationship between event time and processing time. The "Ideal" line represents zero latency. The "Watermark" indicates the system's tolerance for delay. Points below the watermark are processed normally, while points above it are considered late and trigger specific handling logic.
When an event violates the watermark constraint, the ingestion pipeline must execute one of three strategies based on business requirements and the cost of re-computation.
The simplest approach is to drop data that arrives after the window closes. This is common in monitoring systems where real-time operational health is more valuable than 100% historical accuracy. If a CPU metric from 10 minutes ago arrives now, it may no longer be actionable.
Late data is diverted to a separate storage location, often a "cold" storage bucket or a specific late_arrivals table. This prevents the main pipeline from stalling or re-triggering expensive computations. Engineers can then run batch jobs periodically (e.g., nightly) to merge these late records into the primary warehouse tables. This balances the need for real-time stability with eventual consistency.
In financial or regulatory contexts, data completeness is mandatory. When late data arrives, the system must update previously materialized aggregates. In an MPP data warehouse, this has specific implications for storage micro-partitions.
If you partition your data by date, a record for 2023-10-01 arriving on 2023-11-01 forces the warehouse to un-archive the October partition, write the new record, and re-compress the micro-partition. This operation is I/O intensive. To mitigate the performance impact, engineers often use a Lookback Window.
A Lookback Window limits how far back the system will check for updates. For example, an ingestion job running at hour H might query source data for the window [H−1,H] but also check for updates to records in [H−24,H].
The diagram below outlines a decision flow for routing ingestion data based on temporal constraints and partition impact.
To audit what we knew and when we knew it, advanced warehouses implement bitemporal modeling. This involves storing both the Event Time and the Ingestion Time for every record.
Consider a user updating their address. The move happened on Monday (Event Time), but the system received the update on Wednesday (Ingestion Time). By storing both, you can reconstruct the state of the database as it appeared on Tuesday (showing the old address) or query the current truth (showing the new address).
In Snowflake or BigQuery, this is implemented by adding _ingestion_timestamp columns to all tables and using them as secondary clustering keys. This allows for efficient queries that filter on ingestion windows for incremental loading, while analytical queries filter on event timestamps for business logic.
The handling of late data often dictates the choice between Lambda and Kappa architectures.
In modern MPP warehouses, the distinction blurs. With features like Dynamic Tables (Snowflake) or Materialized Views (BigQuery), the warehouse effectively acts as a streaming engine that can handle micro-batch updates. The engine automatically handles the "retraction" or update logic when late data enters the underlying source tables, provided the refresh logic is defined correctly.
Effectively managing late-arriving data requires a precise definition of latency requirements. If business stakeholders require 99.9% accuracy within 5 minutes, the infrastructure costs will rise exponentially to handle out-of-order events. Conversely, accepting a 24-hour reconciliation period allows for cost-effective batch repairs of historical partitions.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with