The theoretical necessity of preserving historical data is evident in many analytical scenarios. For example, when an analyst asks for sales performance by sales territory for the previous year, the query must reflect the territory assignments as they existed last year, not as they exist today. This requirement translates into a concrete SQL implementation.
This practical exercise guides you through building a Slowly Changing Dimension (SCD) Type 2 loader. We will focus on a dim_customer table where we track changes to the address field.
Before writing transformation logic, the physical table structure must support historical versioning. A standard Type 1 dimension table contains only the natural key and attributes. For Type 2, we add metadata columns to control the validity window of each record.
Your target dimension table requires the following structure:
customer_id).address, city).start_date): The point in time t when this version became true.end_date): The point in time when this version ceased to be true.is_current): A boolean indicator for efficient filtering of the most recent records.The following diagram illustrates the data flow and state transitions required when processing an incoming change.
Logic flow for determining whether to insert, update, or ignore incoming records based on state comparison.
First, we create the table structure. In a cloud data warehouse, you might use IDENTITY or AUTOINCREMENT for the surrogate key. We use a high-value date (e.g., '9999-12-31') to represent an active record's expiration date, ensuring it acts as "infinity" for range queries.
CREATE TABLE dim_customer (
customer_sk INT IDENTITY(1,1) PRIMARY KEY, -- Surrogate Key
customer_id VARCHAR(50), -- Natural Key
address VARCHAR(255), -- Tracking changes here
city VARCHAR(100),
start_date DATE,
end_date DATE,
is_current BOOLEAN
);
In a production pipeline, data arrives in a staging area. Let us assume we have a stg_customer_updates table containing the current state of customers from the operational system.
We need to identify three categories of records:
customer_id is in staging but not in the dimension.customer_id exists in both, but the address differs.customer_id exists in both, and the address is identical.Implementing SCD Type 2 generally involves two distinct operations within a transaction: identifying the records to "expire" and inserting the new versions.
First, we identify records that have changed. We look for rows where the natural key matches, the dimension record is currently active, and the attribute values differ. We update these rows to set the end_date to the current processing date (or the business effective date) and toggle is_current to false.
UPDATE dim_customer d
SET
end_date = CURRENT_DATE(),
is_current = FALSE
FROM stg_customer_updates s
WHERE d.customer_id = s.customer_id
AND d.is_current = TRUE
AND d.address <> s.address; -- Detects change
Next, we insert the new active records. This covers two scenarios: entirely new customers and the new versions of the customers we just expired in the previous step.
INSERT INTO dim_customer (
customer_id,
address,
city,
start_date,
end_date,
is_current
)
SELECT
s.customer_id,
s.address,
s.city,
CURRENT_DATE(), -- New version starts now
'9999-12-31', -- Future infinity date
TRUE
FROM stg_customer_updates s
LEFT JOIN dim_customer d
ON s.customer_id = d.customer_id
AND d.is_current = TRUE
WHERE
d.customer_id IS NULL -- It is a brand new customer
OR s.address <> d.address; -- OR it is a changed customer
Once the logic is applied over time, a single customer entity will span multiple rows. Each row represents a distinct time slice. Querying the table for a specific customer reveals their history.
Consider a customer who lived in New York, moved to Chicago in June, and moved to Seattle in November.
Timeline view of a single customer's active records. The different colors represent different active addresses during specific time intervals.
The power of SCD Type 2 lies in the ability to reconstruct the state at any specific moment. If you need to generate an invoice report based on where the customer lived on September 15th, you filter using the date ranges.
The SQL condition start_date≤t<end_date is the standard pattern for these lookups.
SELECT
k.customer_sk,
k.city,
f.order_amount
FROM fact_orders f
JOIN dim_customer k
ON f.customer_id = k.customer_id
WHERE f.order_date = '2023-09-15'
AND f.order_date >= k.start_date
AND f.order_date < k.end_date;
If you used the standard join on customer_id without the date logic, you would likely fan out the result set (joining one fact row to three dimension rows) or join to the wrong address, leading to incorrect tax or shipping calculations.
When implementing this in a high-volume environment, consider the following technical details:
<>) usually return NULL if one side of the comparison is NULL. If your source system allows null attributes, use null-safe comparisons (e.g., COALESCE(d.address, '') <> COALESCE(s.address, '')) to ensure changes are detected accurately.source_hash <> target_hash to detect changes.start_date and end_date align with the grain of your fact data. If your facts include timestamps, your dimension effective dates might need to be timestamps as well to avoid ambiguity for events happening on the boundary of a change.Was this section helpful?
© 2026 ApX Machine LearningEngineered with