Kimball Fundamentals

Dimensional Modeling for Data Warehousing

Overview

Kimball dimensional modeling is the most widely used methodology for data warehouse design. Understanding star schemas, conformed dimensions, and fact tables is essential for building scalable data warehouses.

Core Concepts

Fact Tables and Dimension Tables

Fact Table: Contains quantitative measurements (foreign keys + measures).

Dimension Tables: Contains descriptive attributes (context for the facts).

Star Schema vs. Snowflake

Star Schema (Recommended)

Characteristics:

Denormalized dimensions
Simple joins (fact → dimensions)
Faster queries
More storage
Easier to understand

Snowflake Schema (Avoid When Possible)

Characteristics:

Normalized dimensions
Complex joins (fact → dimension → dimension)
Slower queries
Less storage
Harder to understand

Recommendation: Use star schema for 99% of use cases. Snowflake only when:

Extremely limited storage
Many-to-many relationships in dimensions
Shared dimensions across multiple data marts

Fact Table Types

Transactional Fact Tables

Definition: One row per transaction (most common).

CREATE TABLE fact_sales (
    sales_key BIGINT PRIMARY KEY,
    date_key INT,
    product_key INT,
    customer_key INT,
    store_key INT,
    promotion_key INT,

    -- Measures
    quantity_sold INT,
    sales_amount DECIMAL(18,2),
    cost_amount DECIMAL(18,2),
    profit_amount DECIMAL(18,2),

    -- Degenerate dimensions
    transaction_number VARCHAR(50),
    pos_number INT,

    FOREIGN KEY (date_key) REFERENCES dim_date(date_key),
    FOREIGN KEY (product_key) REFERENCES dim_product(product_key),
    FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key)
);

Periodic Snapshot Fact Tables

Definition: One row per period (day, week, month).

CREATE TABLE fact_inventory_snapshot (
    snapshot_key BIGINT PRIMARY KEY,
    date_key INT,
    product_key INT,
    store_key INT,

    -- Measures (snapshot values)
    quantity_on_hand INT,
    dollar_value_on_hand DECIMAL(18,2),

    FOREIGN KEY (date_key) REFERENCES dim_date(date_key),
    FOREIGN KEY (product_key) REFERENCES dim_product(product_key)
);

Use Cases:

Inventory levels
Account balances
Headcount

Accumulating Snapshot Fact Tables

Definition: One row per entity lifecycle (e.g., order).

CREATE TABLE fact_order_accumulating (
    order_key BIGINT PRIMARY KEY,
    customer_key INT,
    product_key INT,

    -- Milestone dates (degenerate dimensions)
    order_date TIMESTAMP,
    ship_date TIMESTAMP,
    deliver_date TIMESTAMP,
    payment_date TIMESTAMP,

    -- Measures
    order_amount DECIMAL(18,2),
    ship_amount DECIMAL(18,2),

    FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key)
);

Use Cases:

Order fulfillment
Claims processing
Loan applications

Dimension Types

Slowly Changing Dimensions (SCD)

SCD Type 1: Overwrite

-- Update existing record (no history)
UPDATE dim_customer
SET email = 'new.email@example.com'
WHERE customer_key = 123;

Use When: History not important, corrections only.

SCD Type 2: Add New Row (Most Common)

-- Add new row with effective dates
INSERT INTO dim_customer (
    customer_key,
    customer_id,
    email,
    effective_start_date,
    effective_end_date,
    is_current
) VALUES (
    DEFAULT,
    123,
    'new.email@example.com',
    '2025-01-27',
    '9999-12-31',
    true
);

-- Update old record
UPDATE dim_customer
SET effective_end_date = '2025-01-26',
    is_current = false
WHERE customer_key = 456 AND is_current = true;

Use When: Need history, audit trails, time travel.

SCD Type 3: Add New Attribute

-- Add columns for previous values
ALTER TABLE dim_customer
ADD COLUMN email_previous VARCHAR(255);

UPDATE dim_customer
SET email_previous = email,
    email = 'new.email@example.com'
WHERE customer_key = 123;

Use When: Only need current + one previous value.

Role-Playing Dimensions

Definition: Same dimension used multiple times.

CREATE TABLE fact_orders (
    order_key BIGINT PRIMARY KEY,
    order_date_key INT,
    ship_date_key INT,
    delivery_date_key INT,

    FOREIGN KEY (order_date_key) REFERENCES dim_date(date_key),
    FOREIGN KEY (ship_date_key) REFERENCES dim_date(date_key),
    FOREIGN KEY (delivery_date_key) REFERENCES dim_date(date_key)
);

Use Cases:

Multiple dates per fact
Same entity in different roles

Degenerate Dimensions

Definition: Dimension attributes stored in fact table (no dimension table).

CREATE TABLE fact_sales (
    sales_key BIGINT PRIMARY KEY,
    date_key INT,
    product_key INT,

    -- Degenerate dimensions (no dimension table)
    transaction_number VARCHAR(50),
    pos_number INT,
    customer_id VARCHAR(50),  -- Not a foreign key

    sales_amount DECIMAL(18,2)
);

Use When:

No additional attributes needed
Unique identifiers (transaction numbers)
High-cardinality, filter-heavy attributes

Conformed Dimensions

Definition

Dimensions shared across multiple fact tables or subject areas.

Benefits:

Consistent reporting across subject areas
Drill-across capability
Single version of truth

Dimensional Modeling Process

4-Step Process

Step 1: Select Business Process

Questions:

What business process are we modeling?
What decisions does this process support?
Who uses this data?

Examples:

Retail sales
Inventory management
Order fulfillment
Web analytics

Step 2: Declare Grain

Definition: What does one row in the fact table represent?

Grain Examples:

Fine grain: One row per line item of a transaction
Medium grain: One row per transaction
Coarse grain: One row per day per store

Rule: Grain should be atomic for the business process.

Step 3: Identify Dimensions

Question: What describes the business process?

Common Dimensions:

Who (customer, employee)
What (product, service)
Where (store, geography)
When (date, time)
Why (promotion, campaign)

Step 4: Identify Facts

Question: What are we measuring?

Fact Types:

Additive: Can be summed across all dimensions (sales amount)
Semi-additive: Can be summed across some dimensions (inventory level)
Non-additive: Cannot be summed (ratios, percentages)

Handling NULLs in Dimensions

Surrogate Keys

Best Practice: Always use surrogate keys (not natural keys).

CREATE TABLE dim_product (
    product_key INT PRIMARY KEY AUTO_INCREMENT,
    product_id INT,  -- Natural key
    product_name VARCHAR(255),
    category VARCHAR(100),
    brand VARCHAR(100)
);

-- Handle unknown products
INSERT INTO dim_product (product_key, product_id, product_name, category, brand)
VALUES (-1, -1, 'Unknown', 'Unknown', 'Unknown');

Foreign Key NULLs

-- Use -1 or 0 for unknown dimension keys
INSERT INTO fact_sales (date_key, product_key, customer_key, sales_amount)
VALUES (20250127, 123, -1, 100.00);  -- Unknown customer

Performance Optimization

Indexing Strategy

-- Primary key (automatic index)
PRIMARY KEY (fact_key)

-- Foreign key indexes (for joins)
CREATE INDEX idx_fact_sales_date_key ON fact_sales(date_key);
CREATE INDEX idx_fact_sales_product_key ON fact_sales(product_key);

-- Composite index for common filters
CREATE INDEX idx_fact_sales_date_product ON fact_sales(date_key, product_key);

Partitioning Strategy

-- Partition by date (most common)
CREATE TABLE fact_sales (
    ...
) PARTITION BY RANGE (date_key);

-- Create monthly partitions
CREATE TABLE fact_sales_202501 PARTITION OF fact_sales
    FOR VALUES FROM (20250101) TO (20250201);

Common Anti-Patterns

Anti-Pattern 1: Mixed Grain in Fact Table

-- Bad: Mixed grain (some rows are hourly, some daily)
date_key | product_key | sales_amount
20250127 | 123 | 100.00
2025012701 | 123 | 50.00  -- Hourly grain

-- Solution: Separate fact tables per grain

Anti-Pattern 2: Metrics in Dimensions

-- Bad: Sales total in dimension
CREATE TABLE dim_product (
    product_key INT PRIMARY KEY,
    product_name VARCHAR(255),
    total_sales DECIMAL(18,2),  -- Should be in fact table
    category VARCHAR(100)
);

-- Solution: Move to fact table or materialized view

Anti-Pattern 3: Over-Normalization

-- Bad: Snowflake dimension
dim_product → dim_brand → dim_manufacturer

-- Solution: Denormalize (star schema)
dim_product (includes brand and manufacturer)

Cost Considerations

Storage vs. Query Performance

Approach	Storage	Query Performance	Recommendation
Star schema	1.5-2x	Fast	Use for 95% of cases
Snowflake	1x	Slower	Only when storage critical

Materialized Views

-- Pre-aggregate for common queries
CREATE MATERIALIZED VIEW mv_monthly_sales AS
SELECT
    d.year,
    d.month,
    p.category,
    SUM(s.sales_amount) as total_sales,
    SUM(s.profit_amount) as total_profit
FROM fact_sales s
JOIN dim_date d ON s.date_key = d.date_key
JOIN dim_product p ON s.product_key = p.product_key
GROUP BY d.year, d.month, p.category;

Cost Impact: Additional storage but 10-100x query improvement for common aggregations.

Key Takeaways

Star schema: Default choice, denormalized dimensions
Snowflake: Only for extreme storage constraints
SCD Type 2: Most common for dimension history
Conformed dimensions: Enable cross-subject-area analysis
Grain: Define row representation, be atomic
Surrogate keys: Always use, never use natural keys
Additive facts: Design for summation across dimensions

Back to Module 4