Accumulating snapshot fact tables solve a specific analytical problem that transactional schemas handle poorly: tracking the lifespan of a business process. While a transactional table records individual events as separate rows, an accumulating snapshot creates a single row per business entity (such as an order, a claim, or a support ticket) and updates that row as the process evolves.
This structure is useful for analyzing workflows with a defined beginning and end. It allows analysts to calculate the time duration between stages, often called "lag", using simple column arithmetic rather than complex self-joins on a massive transaction log.
In this practical exercise, we will model an e-commerce order fulfillment pipeline. An order moves through four specific milestones:
In a standard transactional model, these would be four distinct rows in an events table. In our accumulating snapshot, this will be one row with multiple date columns.
The schema requires a primary key identifying the process (Order ID) and foreign keys for relevant dimensions (Customer, Product, Geography). The distinguishing feature is the set of date/time columns representing the milestones.
CREATE TABLE fct_order_processing_snapshot (
order_key INT PRIMARY KEY,
customer_key INT,
order_id VARCHAR(50),
-- Milestone Timestamps
order_placed_at TIMESTAMP,
payment_confirmed_at TIMESTAMP,
shipped_at TIMESTAMP,
delivered_at TIMESTAMP,
-- Measure
order_amount DECIMAL(10,2),
-- Status
current_status VARCHAR(20),
last_updated_at TIMESTAMP
);
The transition from a transactional view to an accumulating snapshot involves pivoting event rows into column attributes. The following diagram illustrates how multiple event rows for "Order #101" consolidate into a single record in the snapshot table.
Transformation of discrete event rows into a single comprehensive workflow row.
Populating this table requires a different strategy than standard fact tables. Typically, data warehouses prefer append-only operations. However, the accumulating snapshot inherently requires updates. When the "Shipped" event occurs for an existing order, we must modify the existing row rather than insert a new one.
In modern data warehouses like Snowflake, BigQuery, or Databricks, the MERGE statement is the standard method for handling this logic. The logic compares incoming event data against the existing snapshot. If the order exists, it updates the specific timestamp column corresponding to the new event. If it is a new order, it inserts a new row.
Consider a source stream stg_order_events containing today's changes. We apply these to our snapshot:
MERGE INTO fct_order_processing_snapshot AS target
USING (
SELECT
order_id,
customer_key,
event_type,
event_timestamp,
amount
FROM stg_order_events
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
UPDATE SET
target.payment_confirmed_at = CASE
WHEN source.event_type = 'PAYMENT_SUCCESS' THEN source.event_timestamp
ELSE target.payment_confirmed_at
END,
target.shipped_at = CASE
WHEN source.event_type = 'SHIPPED' THEN source.event_timestamp
ELSE target.shipped_at
END,
target.delivered_at = CASE
WHEN source.event_type = 'DELIVERED' THEN source.event_timestamp
ELSE target.delivered_at
END,
target.last_updated_at = CURRENT_TIMESTAMP(),
target.current_status = source.event_type
WHEN NOT MATCHED AND source.event_type = 'ORDER_PLACED' THEN
INSERT (
order_key, customer_key, order_id,
order_placed_at, order_amount,
current_status, last_updated_at
)
VALUES (
source.order_id, -- Assuming 1:1 mapping for simplicity
source.customer_key,
source.order_id,
source.event_timestamp,
source.amount,
'ORDER_PLACED',
CURRENT_TIMESTAMP()
);
This query handles the workflow logic. If a shipment event arrives, the query locates the row initiated during the "Order Placed" phase and fills in the shipped_at null value.
The primary reason for implementing this pattern is to measure efficiency. Business stakeholders often ask, "How long does it take for a paid order to ship?" In a transactional table, answering this requires grouping by order ID and calculating the delta between the MIN(time) of two different event types. In an accumulating snapshot, the query is a simple subtraction.
The mathematical representation of the lag between payment () and shipping () is:
In SQL, this translates to a straightforward aggregation:
SELECT
AVG(DATEDIFF(hour, payment_confirmed_at, shipped_at)) as avg_hours_to_ship,
AVG(DATEDIFF(hour, shipped_at, delivered_at)) as avg_hours_transit
FROM fct_order_processing_snapshot
WHERE order_placed_at >= '2023-01-01'
AND delivered_at IS NOT NULL;
An accumulating snapshot row is rarely complete immediately. In the early stages of the workflow, downstream columns (like delivered_at) will be NULL. This is expected behavior. These nulls indicate that the process has not yet reached that milestone.
To monitor the health of the pipeline, you can visualize the volume of orders stuck at specific stages. The following chart demonstrates how orders act as a funnel, where 100% of orders are placed, but the count decreases or lags as they move through subsequent stages.
Volume of orders reaching each milestone in the accumulating snapshot.
Real-world data pipelines are noisy. An accumulating snapshot must be resilient to late-arriving data. For example, a "Shipped" event might arrive in the data warehouse before the "Payment" event due to system latency, even if the events happened sequentially in reality.
The MERGE statement provided earlier handles this naturally. Because we update columns independently based on the event type, the order of arrival in the ETL batch matters less. If the "Shipped" timestamp is written first, the payment_confirmed_at column remains NULL until the payment record arrives. Once both records process, the row is complete, regardless of the ingestion sequence.
However, you must ensure your update logic does not overwrite valid data with older data if you re-process history. A common safeguard is to add a condition checking that the incoming timestamp is indeed newer than the existing value, though for accumulating snapshots where columns represent distinct milestones, this is less of a risk than in Type 1 dimensions.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with