dbt Best Practices
Modern Transformation with dbt
Overview
dbt (data build tool) has become the standard for data transformation in modern data platforms. It enables SQL-based transformations with testing, documentation, and version control. This document covers dbt best practices for production-grade pipelines.
Core Concepts
dbt Workflow
Project Structure
dbt_project/├── models/│ ├── staging/│ │ ├── stg_users.sql│ │ ├── stg_orders.sql│ │ └── staging.yml # Source definitions│ ├── intermediate/│ │ ├── int_user_orders.sql│ │ └── int_order_metrics.sql│ └── marts/│ ├── dim_customer.sql│ ├── fact_sales.sql│ └── marts.yml # Model documentation├── tests/│ ├── schema_tests/│ └── data_tests/├── snapshots/│ └── snap_customer_history.sql├── macros/│ └── central_billing.sql├── seeds/│ └── reference_data.csv├── analyses/│ └── ad_hoc_analysis.sql├── dbt_project.yml└── profiles.ymlModeling Best Practices
Staging Layer
Purpose: Raw to clean, cast columns, rename.
-- models/staging/stg_orders.sqlWITH source AS ( SELECT * FROM {{ source('raw', 'orders') }}),
renamed AS ( SELECT id AS order_id, customer_id, order_date, status, total_amount FROM source),
typed AS ( SELECT order_id, customer_id, CAST(order_date AS TIMESTAMP) AS order_date, status, CAST(total_amount AS DECIMAL(18,2)) AS total_amount FROM renamed)
SELECT * FROM typedBest Practices:
- One model per source table
- Rename columns to be consistent
- Cast to correct types
- No business logic
- No joins (keep it simple)
Intermediate Layer
Purpose: Business logic, transformations, joins.
-- models/intermediate/int_user_orders.sqlWITH orders AS ( SELECT * FROM {{ ref('stg_orders') }}),
users AS ( SELECT * FROM {{ ref('stg_users') }}),
customer_orders AS ( SELECT o.order_id, u.customer_id, u.customer_name, o.order_date, o.total_amount, ROW_NUMBER() OVER ( PARTITION BY u.customer_id ORDER BY o.order_date DESC ) AS order_rank FROM orders o JOIN users u ON o.customer_id = u.customer_id)
SELECT * FROM customer_ordersBest Practices:
- Business logic here
- Join staging models
- Reusable transformations
- Document complex logic
Marts Layer
Purpose: Dimensional models for consumption.
-- models/marts/dim_customer.sqlWITH customer_orders AS ( SELECT * FROM {{ ref('int_user_orders') }}),
customer_metrics AS ( SELECT customer_id, customer_name, MIN(order_date) AS first_order_date, MAX(order_date) AS last_order_date, COUNT(*) AS total_orders, SUM(total_amount) AS lifetime_value FROM customer_orders GROUP BY customer_id, customer_name)
SELECT * FROM customer_metricsBest Practices:
- Star schema (fact + dimensions)
- BI-ready
- Aggregations appropriate for consumption
- Conformed dimensions across marts
Source Configuration
sources.yml
version: 2
sources: - name: raw database: raw_data schema: public tables: - name: orders description: "Raw orders from production database" columns: - name: id description: "Primary key" tests: - unique - not_null - name: customer_id description: "Foreign key to customers" tests: - not_null - name: order_date description: "When order was placed" - name: total_amount description: "Order total in USD" - name: status description: "Order status" tests: - accepted_values: values: ['placed', 'shipped', 'completed', 'cancelled']Testing Strategy
Schema Tests
version: 2
models: - name: stg_orders columns: - name: order_id tests: - unique - not_null
- name: order_date tests: - not_null
- name: total_amount tests: - not_null - dbt_utils.expression_is_true: expression: "total_amount >= 0"Data Tests
-- tests/data_tests/order_amounts.sql{{ config( tags = ['data_quality', 'business_logic'] )}}
-- Test: Average order amount should be reasonableSELECT CASE WHEN AVG(total_amount) BETWEEN 10 AND 1000 THEN 1 ELSE 0 END AS test_passedFROM {{ ref('stg_orders') }}
-- Test: Total orders should not decrease significantlySELECT CASE WHEN COUNT(*) >= LAG(COUNT(*)) OVER (ORDER BY order_date) * 0.9 -- Allow 10% decrease THEN 1 ELSE 0 END AS test_passedFROM {{ ref('stg_orders') }}GROUP BY DATE_TRUNC('day', order_date)Generic Tests
version: 2
models: - name: dim_customer tests: - dbt_utils.relationships_where: field_to_test: customer_id from: ref('dim_customer') to: ref('stg_users') from_condition: customer_id IS NOT NULL to_condition: id IS NOT NULL
- dbt_utils.recency: datepart: day field: last_order_date interval: 30Documentation
Model Documentation
version: 2
models: - name: dim_customer description: "Customer dimension for analytics" columns: - name: customer_id description: "Unique identifier for customer" - name: customer_name description: "Full customer name" - name: first_order_date description: "Date of first order" - name: last_order_date description: "Date of most recent order" - name: total_orders description: "Total number of orders placed" - name: lifetime_value description: "Total revenue from customer"Documentation Blocks
-- models/staging/stg_orders.sql{{ config( materialized = 'view', tags = ['staging', 'orders'] )}}
/* Staging model for orders
This model: - Renames columns to be consistent with data warehouse - Casts columns to appropriate types - Filters out test data
Input: raw.orders Output: stg_orders*/
WITH source AS ( SELECT * FROM {{ source('raw', 'orders') }}),
filtered AS ( SELECT * FROM source WHERE is_test = false -- Exclude test data),
renamed AS ( SELECT id AS order_id, customer_id, order_date, status, total_amount FROM filtered)
SELECT * FROM renamedMaterialization Strategies
View Materialization (Default)
{{ config( materialized = 'view' )}}Use When:
- Staging models (no need to persist)
- Rapid iteration (development)
- Real-time requirements
Table Materialization
{{ config( materialized = 'table', sort = 'order_date', dist = 'customer_id' )}}Use When:
- Large tables (improve query performance)
- Frequently queried
- Downstream dependencies
Incremental Materialization
{{ config( materialized = 'incremental', incremental_strategy = 'insert_overwrite', unique_key = 'order_id', partition_by = ['order_date'] )}}
-- SQLSELECT * FROM {{ source('stg_orders') }}
{% if is_incremental() %} -- Only insert new records WHERE order_date > (SELECT MAX(order_date) FROM {{ this }}){% endif %}Use When:
- Large tables where full rebuild is expensive
- Streaming data
- Time-series data
Snapshots
SCD Type 2 with Snapshots
-- snapshots/snap_customers.sql{% snapshot snap_customers %}
{{ config( target_schema = 'snapshots', unique_key = 'customer_id', strategy = 'timestamp', updated_at = 'updated_at', )}}
SELECT * FROM {{ source('raw', 'customers') }}
{% endsnapshot %}Use When:
- Need history of dimension changes
- SCD Type 2 requirements
- Audit trails
Macros
Reusable Macros
-- macros/central_billing.sql{% macro central_billing(schema) %}
{% set billing_tables = [ 'fact_sales', 'fact_inventory', 'fact_shipments' ] %}
{% for table in billing_tables %} GRANT SELECT ON {{ schema }}.{{ table }} TO billing_role; {% endfor %}
{% endmacro %}Utility Macros
-- utilities/get_column_values.sql{% macro get_column_values(table, column) %} {% set sql %} SELECT DISTINCT {{ column }} FROM {{ table }} ORDER BY {{ column }} {% endset %}
{% set results = run_query(sql) %} {{ return(results) }}{% endmacro %}CI/CD Integration
dbt Cloud vs. dbt Core
| Feature | dbt Cloud | dbt Core |
|---|---|---|
| Hosting | Managed | Self-hosted |
| CI/CD | Built-in | Requires setup |
| Integration | Native with data platforms | Custom |
| Cost | $100+/team/month | Free |
| Best For | Teams wanting managed service | Teams wanting full control |
dbt Core CI/CD
name: dbt CI
on: push: branches: [main] pull_request:
jobs: dbt: runs-on: ubuntu-latest
steps: - uses: actions/checkout@v2
- name: Setup Python uses: actions/setup-python@v2 with: python-version: '3.9'
- name: Install dbt run: pip install dbt-postgres
- name: Run dbt run: dbt run --profiles-dir profiles/ env: DBT_ENV_SECRET_DATABASE_URL: ${{ secrets.DATABASE_URL }}Performance Optimization
Partitioning
{{ config( materialized = 'table', partition_by = ['order_date'], )}}Sorting
{{ config( materialized = 'table', sort = ['order_date', 'customer_id'], dist = 'customer_id' )}}Incremental Strategies
{{ config( materialized = 'incremental', incremental_strategy = 'insert_overwrite', # or 'delete+insert' unique_key = 'order_id' )}}Cost Considerations
Materialization Cost
| Materialization | Storage Cost | Compute Cost | When to Use |
|---|---|---|---|
| View | None | Compute on query | Staging, dev |
| Table | High | Low on query | Production marts |
| Incremental | High | Low on rebuild | Large tables |
Optimizing Costs
# Use views for staging (no storage)staging/models: +materialized: view
# Use tables for marts (storage cost, query savings)marts/models: +materialized: table
# Use incremental for large tableslarge_tables: +materialized: incrementalSenior Level Considerations
Anti-Patterns
Anti-Pattern 1: Business logic in staging
-- Bad: Complex logic in stagingWITH complex AS ( -- 100 lines of complex logic)
-- Good: Keep staging simple, move logic to intermediateAnti-Pattern 2: No tests
# Bad: No testsmodels: - name: important_model
# Good: Comprehensive testsmodels: - name: important_model tests: - dbt_utils.recency: datepart: day field: updated_at - not_null: column_name: idAnti-Pattern 3: Hardcoding values
-- Bad: HardcodedSELECT * FROM orders WHERE status = 'completed'
-- Good: Use variables or dbt_utilsSELECT * FROM orders WHERE status = '{{ var("order_status") }}'Key Takeaways
- Layering: Staging → Intermediate → Marts
- Staging: Rename, cast, no business logic
- Intermediate: Business logic, joins, reusability
- Marts: BI-ready, star schema, aggregated
- Testing: Comprehensive tests at every layer
- Documentation: Document models, columns, sources
- Materialization: Choose based on use case (view vs. table vs. incremental)
- Cost: Views for staging, tables for production
Back to Module 4