Restructuring raw transactional logs from an e-commerce platform into a star schema requires applying the principles of facts, dimensions, and defining the grain. This process transforms a write-optimized format into a read-optimized structure suitable for analytical reporting.The Scenario: Raw E-Commerce DataImagine you are working with a dataset from a fictional retailer, CloudMart. The source system records transactions in a single, wide logging table or a set of normalized operational tables. For our purposes, consider a flat extract representing the raw state of a transaction:transaction_idtimestampcustomer_idcustomer_nameproduct_idproduct_categoryquantityunit_priceTXN-10012023-10-01 14:30:00C-502Alice SmithP-99Electronics11200.00TXN-10012023-10-01 14:30:00C-502Alice SmithP-12Accessories225.00TXN-10022023-10-01 15:15:00C-881Bob JonesP-45Home155.00This format is inefficient for analysis. If Alice Smith changes her name or Electronics is renamed, you might have to update millions of historical rows. Furthermore, summing revenue requires scanning text columns that are unnecessary for the calculation.Step 1: Defining the GrainBefore drawing tables, we must define the grain. Looking at the data, a single transaction (TXN-1001) splits into multiple rows because Alice bought a laptop and a mouse.The grain for our fact table will be one row per line item in a transaction.If we defined the grain as "one row per order", we would struggle to store the distinct product IDs and quantities without resorting to complex arrays or losing detail. By choosing the line-item grain, we ensure that our metrics remain additive.Step 2: Modeling DimensionsWe strip the context (text and descriptive attributes) out of the transaction log and organize them into dimension tables. Based on our raw data, we can identify three primary dimensions: Product, Customer, and Time.Product Dimension (dim_product)We extract product_id, product_name (implied), and product_category. In a production environment, we would add internal SKU numbers and brand information here.Customer Dimension (dim_customer)We extract customer_id and customer_name. We would also add address, segment, and email attributes here.Date Dimension (dim_date)While the raw data has a timestamp, analytics often require grouping by fiscal quarter, day of the week, or holiday status. A dedicated date dimension linked by a surrogate key (e.g., 20231001) is standard practice.Step 3: Designing the Fact TableThe fact table, fact_sales, holds the foreign keys pointing to our dimensions and the numerical measures.Foreign Keys:product_keycustomer_keydate_keyMeasures:quantity: The number of items purchased.extended_sales_amount: The total revenue for that line item.We calculate the revenue metric at the time of ingestion using the formula:$$Revenue = Quantity \times UnitPrice$$Storing the calculated extended_sales_amount simplifies downstream queries, avoiding the need for multiplication during every aggregation.The Logical SchemaThe following diagram represents our target Star Schema. Notice how the central fact table contains only keys and numbers, while the surrounding tables hold the descriptive text.digraph StarSchema { rankdir=TB; nodesep=0.8; ranksep=0.6; node [shape=plaintext fontname="Arial" fontsize=12]; // Fact Table fact_sales [label=< <table border="0" cellborder="1" cellspacing="0" cellpadding="4" bgcolor="#ffffff" color="#4dabf7"> <tr><td bgcolor="#4dabf7" align="center"><b>fact_sales</b></td></tr> <tr><td align="left" port="pk">sales_key (PK)</td></tr> <tr><td align="left" port="date">date_key (FK)</td></tr> <tr><td align="left" port="cust">customer_key (FK)</td></tr> <tr><td align="left" port="prod">product_key (FK)</td></tr> <tr><td align="left">transaction_id (DD)</td></tr> <tr><td align="left">quantity</td></tr> <tr><td align="left">unit_price</td></tr> <tr><td align="left">extended_sales_amount</td></tr> </table> >]; // Dimension Tables dim_date [label=< <table border="0" cellborder="1" cellspacing="0" cellpadding="4" bgcolor="#ffffff" color="#20c997"> <tr><td bgcolor="#20c997" align="center"><b>dim_date</b></td></tr> <tr><td align="left" port="pk">date_key (PK)</td></tr> <tr><td align="left">full_date</td></tr> <tr><td align="left">day_of_week</td></tr> <tr><td align="left">month</td></tr> <tr><td align="left">quarter</td></tr> <tr><td align="left">year</td></tr> </table> >]; dim_customer [label=< <table border="0" cellborder="1" cellspacing="0" cellpadding="4" bgcolor="#ffffff" color="#20c997"> <tr><td bgcolor="#20c997" align="center"><b>dim_customer</b></td></tr> <tr><td align="left" port="pk">customer_key (PK)</td></tr> <tr><td align="left">customer_id_natural</td></tr> <tr><td align="left">first_name</td></tr> <tr><td align="left">last_name</td></tr> <tr><td align="left">email</td></tr> <tr><td align="left">region</td></tr> </table> >]; dim_product [label=< <table border="0" cellborder="1" cellspacing="0" cellpadding="4" bgcolor="#ffffff" color="#20c997"> <tr><td bgcolor="#20c997" align="center"><b>dim_product</b></td></tr> <tr><td align="left" port="pk">product_key (PK)</td></tr> <tr><td align="left">sku</td></tr> <tr><td align="left">product_name</td></tr> <tr><td align="left">category</td></tr> <tr><td align="left">brand</td></tr> </table> >]; // Relationships fact_sales:date -> dim_date:pk [color="#adb5bd" penwidth=1.5]; fact_sales:cust -> dim_customer:pk [color="#adb5bd" penwidth=1.5]; fact_sales:prod -> dim_product:pk [color="#adb5bd" penwidth=1.5]; }Logical representation of the CloudMart sales schema where the central fact table references three surrounding dimensions.Implementation using SQLTo implement this design, we create the tables using Data Definition Language (DDL). The following examples use generic SQL syntax compatible with most cloud data warehouses.First, we create the dimensions. Note the use of a product_key. This is a surrogate key, an auto-incrementing integer or hash that serves as the primary key within our warehouse, distinct from the product_id used in the source system. We will discuss the critical role of surrogate keys in depth in the next chapter.CREATE TABLE dim_product ( product_key INT PRIMARY KEY, product_id_natural VARCHAR(50), product_name VARCHAR(255), category VARCHAR(100), current_price DECIMAL(10, 2) ); CREATE TABLE dim_customer ( customer_key INT PRIMARY KEY, customer_id_natural VARCHAR(50), full_name VARCHAR(100), segment VARCHAR(50) );Next, we create the fact table. The columns here are strictly numeric keys or numeric measures, with the exception of transaction_id, which is often kept as a "degenerate dimension" for auditing purposes.CREATE TABLE fact_sales ( sales_key INT PRIMARY KEY, date_key INT NOT NULL, -- Connects to dim_date customer_key INT NOT NULL, -- Connects to dim_customer product_key INT NOT NULL, -- Connects to dim_product transaction_id VARCHAR(50), quantity INT, unit_price DECIMAL(10, 2), extended_sales_amount DECIMAL(12, 2), FOREIGN KEY (date_key) REFERENCES dim_date(date_key), FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key), FOREIGN KEY (product_key) REFERENCES dim_product(product_key) );Analytical BenefitsOnce the data is loaded into this structure, queries become significantly simpler and faster compared to the raw log format.To analyze revenue by category, the database engine only needs to scan the small dim_product table and join it to the relevant columns in fact_sales. It does not need to read customer names or transaction timestamps if they are not part of the query.SELECT p.category, SUM(f.extended_sales_amount) as total_revenue FROM fact_sales f JOIN dim_product p ON f.product_key = p.product_key GROUP BY p.category ORDER BY total_revenue DESC;By organizing the data this way, we have decoupled the measurement (revenue) from the context (category). This separation allows the database to compress the data highly efficiently, as column values in the fact table (integers) are repetitive and predictable.