Restructuring raw transactional logs from an e-commerce platform into a star schema requires applying the principles of facts, dimensions, and defining the grain. This process transforms a write-optimized format into a read-optimized structure suitable for analytical reporting.
Imagine you are working with a dataset from a fictional retailer, CloudMart. The source system records transactions in a single, wide logging table or a set of normalized operational tables. For our purposes, consider a flat extract representing the raw state of a transaction:
| transaction_id | timestamp | customer_id | customer_name | product_id | product_category | quantity | unit_price |
|---|---|---|---|---|---|---|---|
| TXN-1001 | 2023-10-01 14:30:00 | C-502 | Alice Smith | P-99 | Electronics | 1 | 1200.00 |
| TXN-1001 | 2023-10-01 14:30:00 | C-502 | Alice Smith | P-12 | Accessories | 2 | 25.00 |
| TXN-1002 | 2023-10-01 15:15:00 | C-881 | Bob Jones | P-45 | Home | 1 | 55.00 |
This format is inefficient for analysis. If Alice Smith changes her name or Electronics is renamed, you might have to update millions of historical rows. Furthermore, summing revenue requires scanning text columns that are unnecessary for the calculation.
Before drawing tables, we must define the grain. Looking at the data, a single transaction (TXN-1001) splits into multiple rows because Alice bought a laptop and a mouse.
The grain for our fact table will be one row per line item in a transaction.
If we defined the grain as "one row per order", we would struggle to store the distinct product IDs and quantities without resorting to complex arrays or losing detail. By choosing the line-item grain, we ensure that our metrics remain additive.
We strip the context (text and descriptive attributes) out of the transaction log and organize them into dimension tables. Based on our raw data, we can identify three primary dimensions: Product, Customer, and Time.
dim_product)We extract product_id, product_name (implied), and product_category. In a production environment, we would add internal SKU numbers and brand information here.
dim_customer)We extract customer_id and customer_name. We would also add address, segment, and email attributes here.
dim_date)While the raw data has a timestamp, analytics often require grouping by fiscal quarter, day of the week, or holiday status. A dedicated date dimension linked by a surrogate key (e.g., 20231001) is standard practice.
The fact table, fact_sales, holds the foreign keys pointing to our dimensions and the numerical measures.
Foreign Keys:
product_keycustomer_keydate_keyMeasures:
quantity: The number of items purchased.extended_sales_amount: The total revenue for that line item.We calculate the revenue metric at the time of ingestion using the formula:
Revenue=Quantity×UnitPrice
Storing the calculated extended_sales_amount simplifies downstream queries, avoiding the need for multiplication during every aggregation.
The following diagram represents our target Star Schema. Notice how the central fact table contains only keys and numbers, while the surrounding tables hold the descriptive text.
Logical representation of the CloudMart sales schema where the central fact table references three surrounding dimensions.
To implement this design, we create the tables using Data Definition Language (DDL). The following examples use generic SQL syntax compatible with most cloud data warehouses.
First, we create the dimensions. Note the use of a product_key. This is a surrogate key, an auto-incrementing integer or hash that serves as the primary key within our warehouse, distinct from the product_id used in the source system. We will discuss the critical role of surrogate keys in depth in the next chapter.
CREATE TABLE dim_product (
product_key INT PRIMARY KEY,
product_id_natural VARCHAR(50),
product_name VARCHAR(255),
category VARCHAR(100),
current_price DECIMAL(10, 2)
);
CREATE TABLE dim_customer (
customer_key INT PRIMARY KEY,
customer_id_natural VARCHAR(50),
full_name VARCHAR(100),
segment VARCHAR(50)
);
Next, we create the fact table. The columns here are strictly numeric keys or numeric measures, with the exception of transaction_id, which is often kept as a "degenerate dimension" for auditing purposes.
CREATE TABLE fact_sales (
sales_key INT PRIMARY KEY,
date_key INT NOT NULL, -- Connects to dim_date
customer_key INT NOT NULL, -- Connects to dim_customer
product_key INT NOT NULL, -- Connects to dim_product
transaction_id VARCHAR(50),
quantity INT,
unit_price DECIMAL(10, 2),
extended_sales_amount DECIMAL(12, 2),
FOREIGN KEY (date_key) REFERENCES dim_date(date_key),
FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key),
FOREIGN KEY (product_key) REFERENCES dim_product(product_key)
);
Once the data is loaded into this structure, queries become significantly simpler and faster compared to the raw log format.
To analyze revenue by category, the database engine only needs to scan the small dim_product table and join it to the relevant columns in fact_sales. It does not need to read customer names or transaction timestamps if they are not part of the query.
SELECT
p.category,
SUM(f.extended_sales_amount) as total_revenue
FROM
fact_sales f
JOIN
dim_product p ON f.product_key = p.product_key
GROUP BY
p.category
ORDER BY
total_revenue DESC;
By organizing the data this way, we have decoupled the measurement (revenue) from the context (category). This separation allows the database to compress the data highly efficiently, as column values in the fact table (integers) are repetitive and predictable.
Was this section helpful?
© 2026 ApX Machine LearningEngineered with