Kimball Fundamentals
Dimensional Modeling for Data Warehousing
Overview
Kimball dimensional modeling is the most widely used methodology for data warehouse design. Understanding star schemas, conformed dimensions, and fact tables is essential for building scalable data warehouses.
Core Concepts
Fact Tables and Dimension Tables
Fact Table: Contains quantitative measurements (foreign keys + measures).
Dimension Tables: Contains descriptive attributes (context for the facts).
Star Schema vs. Snowflake
Star Schema (Recommended)
Characteristics:
- Denormalized dimensions
- Simple joins (fact → dimensions)
- Faster queries
- More storage
- Easier to understand
Snowflake Schema (Avoid When Possible)
Characteristics:
- Normalized dimensions
- Complex joins (fact → dimension → dimension)
- Slower queries
- Less storage
- Harder to understand
Recommendation: Use star schema for 99% of use cases. Snowflake only when:
- Extremely limited storage
- Many-to-many relationships in dimensions
- Shared dimensions across multiple data marts
Fact Table Types
Transactional Fact Tables
Definition: One row per transaction (most common).
CREATE TABLE fact_sales ( sales_key BIGINT PRIMARY KEY, date_key INT, product_key INT, customer_key INT, store_key INT, promotion_key INT,
-- Measures quantity_sold INT, sales_amount DECIMAL(18,2), cost_amount DECIMAL(18,2), profit_amount DECIMAL(18,2),
-- Degenerate dimensions transaction_number VARCHAR(50), pos_number INT,
FOREIGN KEY (date_key) REFERENCES dim_date(date_key), FOREIGN KEY (product_key) REFERENCES dim_product(product_key), FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key));Periodic Snapshot Fact Tables
Definition: One row per period (day, week, month).
CREATE TABLE fact_inventory_snapshot ( snapshot_key BIGINT PRIMARY KEY, date_key INT, product_key INT, store_key INT,
-- Measures (snapshot values) quantity_on_hand INT, dollar_value_on_hand DECIMAL(18,2),
FOREIGN KEY (date_key) REFERENCES dim_date(date_key), FOREIGN KEY (product_key) REFERENCES dim_product(product_key));Use Cases:
- Inventory levels
- Account balances
- Headcount
Accumulating Snapshot Fact Tables
Definition: One row per entity lifecycle (e.g., order).
CREATE TABLE fact_order_accumulating ( order_key BIGINT PRIMARY KEY, customer_key INT, product_key INT,
-- Milestone dates (degenerate dimensions) order_date TIMESTAMP, ship_date TIMESTAMP, deliver_date TIMESTAMP, payment_date TIMESTAMP,
-- Measures order_amount DECIMAL(18,2), ship_amount DECIMAL(18,2),
FOREIGN KEY (customer_key) REFERENCES dim_customer(customer_key));Use Cases:
- Order fulfillment
- Claims processing
- Loan applications
Dimension Types
Slowly Changing Dimensions (SCD)
SCD Type 1: Overwrite
-- Update existing record (no history)UPDATE dim_customerSET email = 'new.email@example.com'WHERE customer_key = 123;Use When: History not important, corrections only.
SCD Type 2: Add New Row (Most Common)
-- Add new row with effective datesINSERT INTO dim_customer ( customer_key, customer_id, email, effective_start_date, effective_end_date, is_current) VALUES ( DEFAULT, 123, 'new.email@example.com', '2025-01-27', '9999-12-31', true);
-- Update old recordUPDATE dim_customerSET effective_end_date = '2025-01-26', is_current = falseWHERE customer_key = 456 AND is_current = true;Use When: Need history, audit trails, time travel.
SCD Type 3: Add New Attribute
-- Add columns for previous valuesALTER TABLE dim_customerADD COLUMN email_previous VARCHAR(255);
UPDATE dim_customerSET email_previous = email, email = 'new.email@example.com'WHERE customer_key = 123;Use When: Only need current + one previous value.
Role-Playing Dimensions
Definition: Same dimension used multiple times.
CREATE TABLE fact_orders ( order_key BIGINT PRIMARY KEY, order_date_key INT, ship_date_key INT, delivery_date_key INT,
FOREIGN KEY (order_date_key) REFERENCES dim_date(date_key), FOREIGN KEY (ship_date_key) REFERENCES dim_date(date_key), FOREIGN KEY (delivery_date_key) REFERENCES dim_date(date_key));Use Cases:
- Multiple dates per fact
- Same entity in different roles
Degenerate Dimensions
Definition: Dimension attributes stored in fact table (no dimension table).
CREATE TABLE fact_sales ( sales_key BIGINT PRIMARY KEY, date_key INT, product_key INT,
-- Degenerate dimensions (no dimension table) transaction_number VARCHAR(50), pos_number INT, customer_id VARCHAR(50), -- Not a foreign key
sales_amount DECIMAL(18,2));Use When:
- No additional attributes needed
- Unique identifiers (transaction numbers)
- High-cardinality, filter-heavy attributes
Conformed Dimensions
Definition
Dimensions shared across multiple fact tables or subject areas.
Benefits:
- Consistent reporting across subject areas
- Drill-across capability
- Single version of truth
Dimensional Modeling Process
4-Step Process
Step 1: Select Business Process
Questions:
- What business process are we modeling?
- What decisions does this process support?
- Who uses this data?
Examples:
- Retail sales
- Inventory management
- Order fulfillment
- Web analytics
Step 2: Declare Grain
Definition: What does one row in the fact table represent?
Grain Examples:
- Fine grain: One row per line item of a transaction
- Medium grain: One row per transaction
- Coarse grain: One row per day per store
Rule: Grain should be atomic for the business process.
Step 3: Identify Dimensions
Question: What describes the business process?
Common Dimensions:
- Who (customer, employee)
- What (product, service)
- Where (store, geography)
- When (date, time)
- Why (promotion, campaign)
Step 4: Identify Facts
Question: What are we measuring?
Fact Types:
- Additive: Can be summed across all dimensions (sales amount)
- Semi-additive: Can be summed across some dimensions (inventory level)
- Non-additive: Cannot be summed (ratios, percentages)
Handling NULLs in Dimensions
Surrogate Keys
Best Practice: Always use surrogate keys (not natural keys).
CREATE TABLE dim_product ( product_key INT PRIMARY KEY AUTO_INCREMENT, product_id INT, -- Natural key product_name VARCHAR(255), category VARCHAR(100), brand VARCHAR(100));
-- Handle unknown productsINSERT INTO dim_product (product_key, product_id, product_name, category, brand)VALUES (-1, -1, 'Unknown', 'Unknown', 'Unknown');Foreign Key NULLs
-- Use -1 or 0 for unknown dimension keysINSERT INTO fact_sales (date_key, product_key, customer_key, sales_amount)VALUES (20250127, 123, -1, 100.00); -- Unknown customerPerformance Optimization
Indexing Strategy
-- Primary key (automatic index)PRIMARY KEY (fact_key)
-- Foreign key indexes (for joins)CREATE INDEX idx_fact_sales_date_key ON fact_sales(date_key);CREATE INDEX idx_fact_sales_product_key ON fact_sales(product_key);
-- Composite index for common filtersCREATE INDEX idx_fact_sales_date_product ON fact_sales(date_key, product_key);Partitioning Strategy
-- Partition by date (most common)CREATE TABLE fact_sales ( ...) PARTITION BY RANGE (date_key);
-- Create monthly partitionsCREATE TABLE fact_sales_202501 PARTITION OF fact_sales FOR VALUES FROM (20250101) TO (20250201);Common Anti-Patterns
Anti-Pattern 1: Mixed Grain in Fact Table
-- Bad: Mixed grain (some rows are hourly, some daily)date_key | product_key | sales_amount20250127 | 123 | 100.002025012701 | 123 | 50.00 -- Hourly grain
-- Solution: Separate fact tables per grainAnti-Pattern 2: Metrics in Dimensions
-- Bad: Sales total in dimensionCREATE TABLE dim_product ( product_key INT PRIMARY KEY, product_name VARCHAR(255), total_sales DECIMAL(18,2), -- Should be in fact table category VARCHAR(100));
-- Solution: Move to fact table or materialized viewAnti-Pattern 3: Over-Normalization
-- Bad: Snowflake dimensiondim_product → dim_brand → dim_manufacturer
-- Solution: Denormalize (star schema)dim_product (includes brand and manufacturer)Cost Considerations
Storage vs. Query Performance
| Approach | Storage | Query Performance | Recommendation |
|---|---|---|---|
| Star schema | 1.5-2x | Fast | Use for 95% of cases |
| Snowflake | 1x | Slower | Only when storage critical |
Materialized Views
-- Pre-aggregate for common queriesCREATE MATERIALIZED VIEW mv_monthly_sales ASSELECT d.year, d.month, p.category, SUM(s.sales_amount) as total_sales, SUM(s.profit_amount) as total_profitFROM fact_sales sJOIN dim_date d ON s.date_key = d.date_keyJOIN dim_product p ON s.product_key = p.product_keyGROUP BY d.year, d.month, p.category;Cost Impact: Additional storage but 10-100x query improvement for common aggregations.
Key Takeaways
- Star schema: Default choice, denormalized dimensions
- Snowflake: Only for extreme storage constraints
- SCD Type 2: Most common for dimension history
- Conformed dimensions: Enable cross-subject-area analysis
- Grain: Define row representation, be atomic
- Surrogate keys: Always use, never use natural keys
- Additive facts: Design for summation across dimensions
Back to Module 4