Skip to content

dbt Best Practices

Modern Transformation with dbt


Overview

dbt (data build tool) has become the standard for data transformation in modern data platforms. It enables SQL-based transformations with testing, documentation, and version control. This document covers dbt best practices for production-grade pipelines.


Core Concepts

dbt Workflow

Project Structure

dbt_project/
├── models/
│ ├── staging/
│ │ ├── stg_users.sql
│ │ ├── stg_orders.sql
│ │ └── staging.yml # Source definitions
│ ├── intermediate/
│ │ ├── int_user_orders.sql
│ │ └── int_order_metrics.sql
│ └── marts/
│ ├── dim_customer.sql
│ ├── fact_sales.sql
│ └── marts.yml # Model documentation
├── tests/
│ ├── schema_tests/
│ └── data_tests/
├── snapshots/
│ └── snap_customer_history.sql
├── macros/
│ └── central_billing.sql
├── seeds/
│ └── reference_data.csv
├── analyses/
│ └── ad_hoc_analysis.sql
├── dbt_project.yml
└── profiles.yml

Modeling Best Practices

Staging Layer

Purpose: Raw to clean, cast columns, rename.

-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
renamed AS (
SELECT
id AS order_id,
customer_id,
order_date,
status,
total_amount
FROM source
),
typed AS (
SELECT
order_id,
customer_id,
CAST(order_date AS TIMESTAMP) AS order_date,
status,
CAST(total_amount AS DECIMAL(18,2)) AS total_amount
FROM renamed
)
SELECT * FROM typed

Best Practices:

  • One model per source table
  • Rename columns to be consistent
  • Cast to correct types
  • No business logic
  • No joins (keep it simple)

Intermediate Layer

Purpose: Business logic, transformations, joins.

-- models/intermediate/int_user_orders.sql
WITH orders AS (
SELECT * FROM {{ ref('stg_orders') }}
),
users AS (
SELECT * FROM {{ ref('stg_users') }}
),
customer_orders AS (
SELECT
o.order_id,
u.customer_id,
u.customer_name,
o.order_date,
o.total_amount,
ROW_NUMBER() OVER (
PARTITION BY u.customer_id
ORDER BY o.order_date DESC
) AS order_rank
FROM orders o
JOIN users u ON o.customer_id = u.customer_id
)
SELECT * FROM customer_orders

Best Practices:

  • Business logic here
  • Join staging models
  • Reusable transformations
  • Document complex logic

Marts Layer

Purpose: Dimensional models for consumption.

-- models/marts/dim_customer.sql
WITH customer_orders AS (
SELECT * FROM {{ ref('int_user_orders') }}
),
customer_metrics AS (
SELECT
customer_id,
customer_name,
MIN(order_date) AS first_order_date,
MAX(order_date) AS last_order_date,
COUNT(*) AS total_orders,
SUM(total_amount) AS lifetime_value
FROM customer_orders
GROUP BY customer_id, customer_name
)
SELECT * FROM customer_metrics

Best Practices:

  • Star schema (fact + dimensions)
  • BI-ready
  • Aggregations appropriate for consumption
  • Conformed dimensions across marts

Source Configuration

sources.yml

models/staging/staging.yml
version: 2
sources:
- name: raw
database: raw_data
schema: public
tables:
- name: orders
description: "Raw orders from production database"
columns:
- name: id
description: "Primary key"
tests:
- unique
- not_null
- name: customer_id
description: "Foreign key to customers"
tests:
- not_null
- name: order_date
description: "When order was placed"
- name: total_amount
description: "Order total in USD"
- name: status
description: "Order status"
tests:
- accepted_values:
values: ['placed', 'shipped', 'completed', 'cancelled']

Testing Strategy

Schema Tests

tests/schema_tests/orders.yml
version: 2
models:
- name: stg_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: order_date
tests:
- not_null
- name: total_amount
tests:
- not_null
- dbt_utils.expression_is_true:
expression: "total_amount >= 0"

Data Tests

-- tests/data_tests/order_amounts.sql
{{
config(
tags = ['data_quality', 'business_logic']
)
}}
-- Test: Average order amount should be reasonable
SELECT
CASE
WHEN AVG(total_amount) BETWEEN 10 AND 1000 THEN 1
ELSE 0
END AS test_passed
FROM {{ ref('stg_orders') }}
-- Test: Total orders should not decrease significantly
SELECT
CASE
WHEN COUNT(*) >= LAG(COUNT(*)) OVER (ORDER BY order_date)
* 0.9 -- Allow 10% decrease
THEN 1
ELSE 0
END AS test_passed
FROM {{ ref('stg_orders') }}
GROUP BY DATE_TRUNC('day', order_date)

Generic Tests

tests/generic/tests.yml
version: 2
models:
- name: dim_customer
tests:
- dbt_utils.relationships_where:
field_to_test: customer_id
from: ref('dim_customer')
to: ref('stg_users')
from_condition: customer_id IS NOT NULL
to_condition: id IS NOT NULL
- dbt_utils.recency:
datepart: day
field: last_order_date
interval: 30

Documentation

Model Documentation

models/marts/marts.yml
version: 2
models:
- name: dim_customer
description: "Customer dimension for analytics"
columns:
- name: customer_id
description: "Unique identifier for customer"
- name: customer_name
description: "Full customer name"
- name: first_order_date
description: "Date of first order"
- name: last_order_date
description: "Date of most recent order"
- name: total_orders
description: "Total number of orders placed"
- name: lifetime_value
description: "Total revenue from customer"

Documentation Blocks

-- models/staging/stg_orders.sql
{{
config(
materialized = 'view',
tags = ['staging', 'orders']
)
}}
/*
Staging model for orders
This model:
- Renames columns to be consistent with data warehouse
- Casts columns to appropriate types
- Filters out test data
Input: raw.orders
Output: stg_orders
*/
WITH source AS (
SELECT * FROM {{ source('raw', 'orders') }}
),
filtered AS (
SELECT * FROM source
WHERE is_test = false -- Exclude test data
),
renamed AS (
SELECT
id AS order_id,
customer_id,
order_date,
status,
total_amount
FROM filtered
)
SELECT * FROM renamed

Materialization Strategies

View Materialization (Default)

models/staging/stg_orders.yml
{{
config(
materialized = 'view'
)
}}

Use When:

  • Staging models (no need to persist)
  • Rapid iteration (development)
  • Real-time requirements

Table Materialization

models/marts/fact_sales.yml
{{
config(
materialized = 'table',
sort = 'order_date',
dist = 'customer_id'
)
}}

Use When:

  • Large tables (improve query performance)
  • Frequently queried
  • Downstream dependencies

Incremental Materialization

models/marts/fact_sales_incremental.yml
{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite',
unique_key = 'order_id',
partition_by = ['order_date']
)
}}
-- SQL
SELECT * FROM {{ source('stg_orders') }}
{% if is_incremental() %}
-- Only insert new records
WHERE order_date > (SELECT MAX(order_date) FROM {{ this }})
{% endif %}

Use When:

  • Large tables where full rebuild is expensive
  • Streaming data
  • Time-series data

Snapshots

SCD Type 2 with Snapshots

-- snapshots/snap_customers.sql
{% snapshot snap_customers %}
{{
config(
target_schema = 'snapshots',
unique_key = 'customer_id',
strategy = 'timestamp',
updated_at = 'updated_at',
)
}}
SELECT * FROM {{ source('raw', 'customers') }}
{% endsnapshot %}

Use When:

  • Need history of dimension changes
  • SCD Type 2 requirements
  • Audit trails

Macros

Reusable Macros

-- macros/central_billing.sql
{% macro central_billing(schema) %}
{% set billing_tables = [
'fact_sales',
'fact_inventory',
'fact_shipments'
] %}
{% for table in billing_tables %}
GRANT SELECT ON {{ schema }}.{{ table }} TO billing_role;
{% endfor %}
{% endmacro %}

Utility Macros

-- utilities/get_column_values.sql
{% macro get_column_values(table, column) %}
{% set sql %}
SELECT DISTINCT {{ column }}
FROM {{ table }}
ORDER BY {{ column }}
{% endset %}
{% set results = run_query(sql) %}
{{ return(results) }}
{% endmacro %}

CI/CD Integration

dbt Cloud vs. dbt Core

Featuredbt Clouddbt Core
HostingManagedSelf-hosted
CI/CDBuilt-inRequires setup
IntegrationNative with data platformsCustom
Cost$100+/team/monthFree
Best ForTeams wanting managed serviceTeams wanting full control

dbt Core CI/CD

.github/workflows/dbt.yml
name: dbt CI
on:
push:
branches: [main]
pull_request:
jobs:
dbt:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Setup Python
uses: actions/setup-python@v2
with:
python-version: '3.9'
- name: Install dbt
run: pip install dbt-postgres
- name: Run dbt
run: dbt run --profiles-dir profiles/
env:
DBT_ENV_SECRET_DATABASE_URL: ${{ secrets.DATABASE_URL }}

Performance Optimization

Partitioning

models/marts/fact_sales.yml
{{
config(
materialized = 'table',
partition_by = ['order_date'],
)
}}

Sorting

{{
config(
materialized = 'table',
sort = ['order_date', 'customer_id'],
dist = 'customer_id'
)
}}

Incremental Strategies

{{
config(
materialized = 'incremental',
incremental_strategy = 'insert_overwrite', # or 'delete+insert'
unique_key = 'order_id'
)
}}

Cost Considerations

Materialization Cost

MaterializationStorage CostCompute CostWhen to Use
ViewNoneCompute on queryStaging, dev
TableHighLow on queryProduction marts
IncrementalHighLow on rebuildLarge tables

Optimizing Costs

# Use views for staging (no storage)
staging/models:
+materialized: view
# Use tables for marts (storage cost, query savings)
marts/models:
+materialized: table
# Use incremental for large tables
large_tables:
+materialized: incremental

Senior Level Considerations

Anti-Patterns

Anti-Pattern 1: Business logic in staging

-- Bad: Complex logic in staging
WITH complex AS (
-- 100 lines of complex logic
)
-- Good: Keep staging simple, move logic to intermediate

Anti-Pattern 2: No tests

# Bad: No tests
models:
- name: important_model
# Good: Comprehensive tests
models:
- name: important_model
tests:
- dbt_utils.recency:
datepart: day
field: updated_at
- not_null:
column_name: id

Anti-Pattern 3: Hardcoding values

-- Bad: Hardcoded
SELECT * FROM orders WHERE status = 'completed'
-- Good: Use variables or dbt_utils
SELECT * FROM orders WHERE status = '{{ var("order_status") }}'

Key Takeaways

  1. Layering: Staging → Intermediate → Marts
  2. Staging: Rename, cast, no business logic
  3. Intermediate: Business logic, joins, reusability
  4. Marts: BI-ready, star schema, aggregated
  5. Testing: Comprehensive tests at every layer
  6. Documentation: Document models, columns, sources
  7. Materialization: Choose based on use case (view vs. table vs. incremental)
  8. Cost: Views for staging, tables for production

Back to Module 4