dbt Best Practices

Modern Transformation with dbt

Overview

dbt (data build tool) has become the standard for data transformation in modern data platforms. It enables SQL-based transformations with testing, documentation, and version control. This document covers dbt best practices for production-grade pipelines.

Core Concepts

dbt Workflow

Project Structure

dbt_project/
├── models/
│   ├── staging/
│   │   ├── stg_users.sql
│   │   ├── stg_orders.sql
│   │   └── staging.yml  # Source definitions
│   ├── intermediate/
│   │   ├── int_user_orders.sql
│   │   └── int_order_metrics.sql
│   └── marts/
│       ├── dim_customer.sql
│       ├── fact_sales.sql
│       └── marts.yml  # Model documentation
├── tests/
│   ├── schema_tests/
│   └── data_tests/
├── snapshots/
│   └── snap_customer_history.sql
├── macros/
│   └── central_billing.sql
├── seeds/
│   └── reference_data.csv
├── analyses/
│   └── ad_hoc_analysis.sql
├── dbt_project.yml
└── profiles.yml

Modeling Best Practices

Staging Layer

Purpose: Raw to clean, cast columns, rename.

-- models/staging/stg_orders.sql
WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        order_date,
        status,
        total_amount
    FROM source
),

typed AS (
    SELECT
        order_id,
        customer_id,
        CAST(order_date AS TIMESTAMP) AS order_date,
        status,
        CAST(total_amount AS DECIMAL(18,2)) AS total_amount
    FROM renamed
)

SELECT * FROM typed

Best Practices:

One model per source table
Rename columns to be consistent
Cast to correct types
No business logic
No joins (keep it simple)

Intermediate Layer

Purpose: Business logic, transformations, joins.

-- models/intermediate/int_user_orders.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),

users AS (
    SELECT * FROM {{ ref('stg_users') }}
),

customer_orders AS (
    SELECT
        o.order_id,
        u.customer_id,
        u.customer_name,
        o.order_date,
        o.total_amount,
        ROW_NUMBER() OVER (
            PARTITION BY u.customer_id
            ORDER BY o.order_date DESC
        ) AS order_rank
    FROM orders o
    JOIN users u ON o.customer_id = u.customer_id
)

SELECT * FROM customer_orders

Best Practices:

Business logic here
Join staging models
Reusable transformations
Document complex logic

Marts Layer

Purpose: Dimensional models for consumption.

-- models/marts/dim_customer.sql
WITH customer_orders AS (
    SELECT * FROM {{ ref('int_user_orders') }}
),

customer_metrics AS (
    SELECT
        customer_id,
        customer_name,
        MIN(order_date) AS first_order_date,
        MAX(order_date) AS last_order_date,
        COUNT(*) AS total_orders,
        SUM(total_amount) AS lifetime_value
    FROM customer_orders
    GROUP BY customer_id, customer_name
)

SELECT * FROM customer_metrics

Best Practices:

Star schema (fact + dimensions)
BI-ready
Aggregations appropriate for consumption
Conformed dimensions across marts

Source Configuration

sources.yml

version: 2

sources:
  - name: raw
    database: raw_data
    schema: public
    tables:
      - name: orders
        description: "Raw orders from production database"
        columns:
          - name: id
            description: "Primary key"
            tests:
              - unique
              - not_null
          - name: customer_id
            description: "Foreign key to customers"
            tests:
              - not_null
          - name: order_date
            description: "When order was placed"
          - name: total_amount
            description: "Order total in USD"
          - name: status
            description: "Order status"
            tests:
              - accepted_values:
                  values: ['placed', 'shipped', 'completed', 'cancelled']

Testing Strategy

Schema Tests

version: 2

models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null

      - name: order_date
        tests:
          - not_null

      - name: total_amount
        tests:
          - not_null
          - dbt_utils.expression_is_true:
              expression: "total_amount >= 0"

Data Tests

-- tests/data_tests/order_amounts.sql
{{
  config(
    tags = ['data_quality', 'business_logic']
  )
}}

-- Test: Average order amount should be reasonable
SELECT
    CASE
        WHEN AVG(total_amount) BETWEEN 10 AND 1000 THEN 1
        ELSE 0
    END AS test_passed
FROM {{ ref('stg_orders') }}

-- Test: Total orders should not decrease significantly
SELECT
    CASE
        WHEN COUNT(*) >= LAG(COUNT(*)) OVER (ORDER BY order_date)
             * 0.9  -- Allow 10% decrease
        THEN 1
        ELSE 0
    END AS test_passed
FROM {{ ref('stg_orders') }}
GROUP BY DATE_TRUNC('day', order_date)

Generic Tests

version: 2

models:
  - name: dim_customer
    tests:
      - dbt_utils.relationships_where:
          field_to_test: customer_id
          from: ref('dim_customer')
          to: ref('stg_users')
          from_condition: customer_id IS NOT NULL
          to_condition: id IS NOT NULL

      - dbt_utils.recency:
          datepart: day
          field: last_order_date
          interval: 30

Documentation

Model Documentation

version: 2

models:
  - name: dim_customer
    description: "Customer dimension for analytics"
    columns:
      - name: customer_id
        description: "Unique identifier for customer"
      - name: customer_name
        description: "Full customer name"
      - name: first_order_date
        description: "Date of first order"
      - name: last_order_date
        description: "Date of most recent order"
      - name: total_orders
        description: "Total number of orders placed"
      - name: lifetime_value
        description: "Total revenue from customer"

Documentation Blocks

-- models/staging/stg_orders.sql
{{
  config(
    materialized = 'view',
    tags = ['staging', 'orders']
  )
}}

/*
  Staging model for orders

  This model:
  - Renames columns to be consistent with data warehouse
  - Casts columns to appropriate types
  - Filters out test data

  Input: raw.orders
  Output: stg_orders
*/

WITH source AS (
    SELECT * FROM {{ source('raw', 'orders') }}
),

filtered AS (
    SELECT * FROM source
    WHERE is_test = false  -- Exclude test data
),

renamed AS (
    SELECT
        id AS order_id,
        customer_id,
        order_date,
        status,
        total_amount
    FROM filtered
)

SELECT * FROM renamed

Materialization Strategies

View Materialization (Default)

{{
  config(
    materialized = 'view'
  )
}}

Use When:

Staging models (no need to persist)
Rapid iteration (development)
Real-time requirements

Table Materialization

{{
  config(
    materialized = 'table',
    sort = 'order_date',
    dist = 'customer_id'
  )
}}

Use When:

Large tables (improve query performance)
Frequently queried
Downstream dependencies

Incremental Materialization

{{
  config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',
    unique_key = 'order_id',
    partition_by = ['order_date']
  )
}}

-- SQL
SELECT * FROM {{ source('stg_orders') }}

{% if is_incremental() %}
  -- Only insert new records
  WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Use When:

Large tables where full rebuild is expensive
Streaming data
Time-series data

Snapshots

SCD Type 2 with Snapshots

-- snapshots/snap_customers.sql
{% snapshot snap_customers %}

{{
  config(
    target_schema = 'snapshots',
    unique_key = 'customer_id',
    strategy = 'timestamp',
    updated_at = 'updated_at',
  )
}}

SELECT * FROM {{ source('raw', 'customers') }}

{% endsnapshot %}

Use When:

Need history of dimension changes
SCD Type 2 requirements
Audit trails

Macros

Reusable Macros

-- macros/central_billing.sql
{% macro central_billing(schema) %}

  {% set billing_tables = [
    'fact_sales',
    'fact_inventory',
    'fact_shipments'
  ] %}

  {% for table in billing_tables %}
    GRANT SELECT ON {{ schema }}.{{ table }} TO billing_role;
  {% endfor %}

{% endmacro %}

Utility Macros

-- utilities/get_column_values.sql
{% macro get_column_values(table, column) %}
  {% set sql %}
    SELECT DISTINCT {{ column }}
    FROM {{ table }}
    ORDER BY {{ column }}
  {% endset %}

  {% set results = run_query(sql) %}
  {{ return(results) }}
{% endmacro %}

CI/CD Integration

dbt Cloud vs. dbt Core

Feature	dbt Cloud	dbt Core
Hosting	Managed	Self-hosted
CI/CD	Built-in	Requires setup
Integration	Native with data platforms	Custom
Cost	$100+/team/month	Free
Best For	Teams wanting managed service	Teams wanting full control

dbt Core CI/CD

name: dbt CI

on:
  push:
    branches: [main]
  pull_request:

jobs:
  dbt:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v2

      - name: Setup Python
        uses: actions/setup-python@v2
        with:
          python-version: '3.9'

      - name: Install dbt
        run: pip install dbt-postgres

      - name: Run dbt
        run: dbt run --profiles-dir profiles/
        env:
          DBT_ENV_SECRET_DATABASE_URL: ${{ secrets.DATABASE_URL }}

Performance Optimization

Partitioning

{{
  config(
    materialized = 'table',
    partition_by = ['order_date'],
  )
}}

Sorting

{{
  config(
    materialized = 'table',
    sort = ['order_date', 'customer_id'],
    dist = 'customer_id'
  )
}}

Incremental Strategies

{{
  config(
    materialized = 'incremental',
    incremental_strategy = 'insert_overwrite',  # or 'delete+insert'
    unique_key = 'order_id'
  )
}}

Cost Considerations

Materialization Cost

Materialization	Storage Cost	Compute Cost	When to Use
View	None	Compute on query	Staging, dev
Table	High	Low on query	Production marts
Incremental	High	Low on rebuild	Large tables

Optimizing Costs

# Use views for staging (no storage)
staging/models:
  +materialized: view

# Use tables for marts (storage cost, query savings)
marts/models:
  +materialized: table

# Use incremental for large tables
large_tables:
  +materialized: incremental

Senior Level Considerations

Anti-Patterns

Anti-Pattern 1: Business logic in staging

-- Bad: Complex logic in staging
WITH complex AS (
    -- 100 lines of complex logic
)

-- Good: Keep staging simple, move logic to intermediate

Anti-Pattern 2: No tests

# Bad: No tests
models:
  - name: important_model

# Good: Comprehensive tests
models:
  - name: important_model
    tests:
      - dbt_utils.recency:
          datepart: day
          field: updated_at
      - not_null:
          column_name: id

Anti-Pattern 3: Hardcoding values

-- Bad: Hardcoded
SELECT * FROM orders WHERE status = 'completed'

-- Good: Use variables or dbt_utils
SELECT * FROM orders WHERE status = '{{ var("order_status") }}'

Key Takeaways

Layering: Staging → Intermediate → Marts
Staging: Rename, cast, no business logic
Intermediate: Business logic, joins, reusability
Marts: BI-ready, star schema, aggregated
Testing: Comprehensive tests at every layer
Documentation: Document models, columns, sources
Materialization: Choose based on use case (view vs. table vs. incremental)
Cost: Views for staging, tables for production

Back to Module 4