Slowly Changing Dimensions Type 2

Strategies for managing data updates often fall short when history matters. While overwriting data (Type 1) ensures your dimension tables reflect the current state, it destroys the context required for accurate historical reporting. If a sales representative transfers from the North region to the South region, a Type 1 update retroactively moves all their past sales to the South. This distorts regional performance metrics. To maintain an accurate historical record, we employ the Slowly Changing Dimension Type 2 (SCD Type 2) strategy.

Preserving History with Row Versioning

SCD Type 2 is the most common technique for tracking historical data in a data warehouse. Instead of overwriting the existing record when an attribute changes, this method retains the old record and inserts a new record to represent the new state.

This approach relies on row versioning. A single entity from the source system, such as a specific customer or product, will span multiple rows in your dimension table. Each row represents a specific time slice in that entity's history. To manage these slices effectively, the schema requires three specific metadata columns:

Effective Start Date: The timestamp when the attribute values in this row became valid.
Effective End Date: The timestamp when the values ceased to be valid. For the current row, this is often set to NULL or a high value future date (e.g., 9999-12-31).
Current Flag: A boolean or integer indicator (1 or 0) to quickly identify the active record.

The introduction of versioning makes the surrogate key mandatory. The natural key (e.g., employee_id) is no longer unique in the dimension table because the same employee appears multiple times. The surrogate key acts as the primary key for the table and creates a unique identifier for every version of that employee.

Visualizing the Transition

For example, imagine a scenario where a customer moves locations. In an operational system, the address is simply updated. In a dimensional model using SCD Type 2, the process involves expiring the old record and creating a new active record.

The transition of a dimension row during an SCD Type 2 update. The original record (SK 101) is closed by updating its end date, and a new record (SK 102) is inserted with the new region.

Logic for Partitioning History

Implementing SCD Type 2 requires precise logic during the ETL (Extract, Transform, Load) process. When new data arrives, the system compares the incoming attributes against the current version in the dimension table.

If a change is detected, two distinct operations occur within the same transaction:

Expire the Old Record: The system locates the row where the natural key matches and the is_current flag is true. It updates the effective_end_date to the current timestamp and sets is_current to false.
Insert the New Record: A new row is inserted with a new surrogate key. The effective_start_date is set to the current timestamp, effective_end_date is set to high-value logic (e.g., 9999-12-31), and is_current is set to true.

Mathematically, for any query querying time $t$ , a dimension row is valid if:

$t_{start} \le t < t_{end}$

Where $t_{end}$ for the current active row is infinite. This logic ensures that at any specific microsecond in time, exactly one version of the dimension is valid.

Impact on Fact Tables

The strength of the SCD Type 2 pattern lies in how it interacts with Fact tables. When a transaction occurs, such as a sale or a login event, the ETL process looks up the dimension table to find the surrogate key active at that specific moment.

For example, if a sales transaction occurs on June 1st, it links to Surrogate Key 101 (Region: East). If a transaction occurs on July 1st, it links to Surrogate Key 102 (Region: West).

Once this link is established in the Fact table, the complexity of dates disappears for the data analyst. The analyst simply joins the Fact table to the Dimension table on the Surrogate Key.

SELECT 
    d.region,
    SUM(f.sales_amount) as total_revenue
FROM fact_sales f
JOIN dim_customer d ON f.customer_sk = d.customer_sk
GROUP BY d.region;

In this query, the database automatically attributes the revenue to the correct historical region. Sales made while the customer lived in the East fall under "East," and sales made after the move fall under "West." No complex WHERE clauses or date arithmetic are required in the reporting layer.

Storage and Performance

While SCD Type 2 provides the highest fidelity for historical analysis, it comes with storage and processing costs.

Table Growth: The dimension table grows with every change. Frequent changes to large dimensions (millions of customers changing profiles daily) can cause the table to balloon rapidly, degrading scan performance.
Attribute Selection: You must be selective about which attributes trigger a Type 2 change. If you track a rapidly changing attribute like last_login_time or age using Type 2, you will generate a new row every day for every user. These attributes are better suited for a Type 1 update (overwrite) or moving them to a separate fact table.

Schema Definition Example

A strong SCD Type 2 table definition typically looks like the following SQL structure. Note the use of indices to optimize the lookup of the current active record, which is the most frequent operation during ETL processing.

CREATE TABLE dim_product (
    product_sk INT PRIMARY KEY,          -- Surrogate Key
    product_id VARCHAR(50),              -- Natural Key
    product_name VARCHAR(255),
    category VARCHAR(100),
    price DECIMAL(10, 2),
    -- SCD Type 2 Metadata
    effective_from TIMESTAMP NOT NULL,
    effective_to TIMESTAMP DEFAULT '9999-12-31',
    is_current BOOLEAN DEFAULT TRUE
);

-- Index for ETL lookup performance
CREATE INDEX idx_product_lookup 
ON dim_product (product_id) 
WHERE is_current = TRUE;

This design creates a system where history is immutable. Once a record is closed, it serves as a permanent snapshot of the past, allowing for "as-was" reporting, analyzing data exactly as it appeared at a previous point in time.

Was this section helpful?

References

The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition, Ralph Kimball and Margy Ross, 2013 (John Wiley & Sons) - This seminal work establishes the principles of dimensional modeling and provides comprehensive guidance on Slowly Changing Dimensions, serving as the essential resource for data warehouse practitioners.