Skip to content

Soda Data

Data Quality Monitoring for Data Platforms


Overview

Soda (Soda Data Observability) is an open-source data quality monitoring tool that integrates with modern data platforms. Soda provides checks, monitors, and alerts to ensure data quality throughout your pipeline.


Soda Architecture


Soda Installation

Installing Soda CLI

Terminal window
# Install Soda CLI
pip install soda-core
# Install Soda for specific warehouse
pip install "soda-postgres"
pip install "soda-snowflake"
pip install "soda-bigquery"
pip install "soda-spark"
# Verify installation
soda --version

Soda Cloud Setup

Terminal window
# Login to Soda Cloud
soda login
# Create Soda Cloud account
# https://cloud.soda.io/signup

Soda Configuration

Configuration Files

configuration.yaml
data_source name: data_warehouse:
type: postgres
connection:
host: localhost
port: 5432
username: data_user
password: ${DBT_PROD_PASSWORD}
database: data_prod
schema: analytics
schema: public
data_source name: snowflake_edw:
type: snowflake
connection:
account: my-account
username: data_user
password: ${SNOWFLAKE_PASSWORD}
warehouse: compute_wh
database: data_prod
schema: analytics
role: data_engineer_role
data_source name: bigquery_lakehouse:
type: bigquery
connection:
project_id: my-project
dataset_id: analytics
credentials_path: /path/to/service-account.json
checks.yml
# Check definitions for data quality
checks for dim_customers:
- rows_count:
name: Customer count should not be zero
warn: when < 1000
fail: when < 100
- schema:
name: Schema should match expected schema
warn: when schema changes
fail: when schema violation
- freshness:
name: Data should be fresh
warn: when > 24h after {{ latest_load() }}
fail: when > 48h after {{ latest_load() }}
- row_count_comparison:
name: Customer count should match source
check: customers
compute:
method: absolute
diff_percent: 5
warn: when diff_percent > 5
fail: when diff_percent > 10
- numeric_metrics:
name: Numeric values in range
columns:
age:
- name: Age should be 18-100
valid_min: 18
valid_max: 100
credit_score:
- name: Credit score should be 300-850
valid_min: 300
valid_max: 850
- missing_values:
name: Missing values check
columns:
customer_email:
- name: Email should not be null
warn: when > 1% missing
fail: when > 5% missing
customer_phone:
- name: Phone can be 10% missing
warn: when > 10% missing
fail: when > 20% missing
checks for fact_orders:
- freshness:
name: Orders should be fresh
warn: when > 1h after {{ latest_load() }}
fail: when > 6h after {{ latest_load() }}
- duplicate_count:
name: Duplicate orders should be minimal
warn: when duplicate_count > 10
fail: when duplicate_count > 100
- numeric_metrics:
name: Order amounts should be positive
columns:
order_amount:
- name: Amount should be positive
valid_min: 0
warn: when 5% invalid
fail: when 10% invalid
- anomaly_detection:
name: Detect anomalies in order amounts
check: order_amount
compute:
method: is_outside_bounds
standard_deviations: 3
warn: when > 100
fail: when > 500
- volume_checks:
name: Daily order volume should be consistent
check: order_count
compute:
method: relative_diff
other_entity: average_daily_orders
warn: when diff_percent > 20
fail: when diff_percent > 40

Soda CLI

Running Checks

Terminal window
# Run all checks
soda scan
# Run specific checks
soda scan -c checks.yml
# Run checks with verbose output
soda scan -c checks.yml -v
# Run checks and upload to Soda Cloud
soda scan -c checks.yml -d soda_cloud
# Run checks for specific table
soda scan -c checks.yml -c dim_customers
# Dry run (show checks without running)
soda scan -c checks.yml --dry-run

Soda CLI Output

Terminal window
$ soda scan -c checks.yml
Soda scan time: 2025-01-27 10:00:00
Scan summary:
checks_for_dim_customers: 8 checks evaluated
checks_for_fact_orders: 6 checks evaluated
Total: 14 checks evaluated
Results by table:
dim_customers:
Customer count should not be zero (rows_count)
Schema should match expected schema (schema)
Data should be fresh (freshness)
Customer count should match source (row_count_comparison)
Age should be 18-100 (numeric_metrics)
Credit score should be 300-850 (numeric_metrics)
Email should not be null (missing_values)
Phone can be 10% missing (missing_values)
fact_orders:
Orders should be fresh (freshness)
Duplicate orders should be minimal (duplicate_count)
Amount should be positive (numeric_metrics)
Detect anomalies in order amounts (anomaly_detection)
Daily order volume should be consistent (volume_checks)
Summary:
Passing: 9
Failing: 3
Warning: 2
Skipped: 0

Soda Cloud

Monitoring Dashboard

# Cloud configuration
soda_cloud:
api_key_id: ${SODA_CLOUD_API_KEY}
api_key_secret: ${SODA_CLOUD_API_SECRET}
# Deploy checks to Soda Cloud
soda scan -c checks.yml -d soda_cloud

Soda Cloud Features:

  • Real-time monitoring: Automated checks run continuously
  • Alerts: Slack, email, PagerDuty integration
  • Dashboards: Visualize data quality metrics
  • Incident tracking: Track and investigate data quality incidents
  • Historical analysis: Track data quality trends over time

Soda Checks

Freshness Checks

checks for orders:
- freshness:
name: Orders should be fresh
warn: when > 1h after {{ latest_load() }}
fail: when > 6h after {{ latest_load() }}
filter: "order_date >= CURRENT_DATE - INTERVAL '7 days'"
- freshness:
name: Recent orders should be very fresh
warn: when > 5m after {{ latest_load() }}
fail: when > 15m after {{ latest_load() }}
filter: "order_date >= CURRENT_DATE"

Schema Checks

checks for customers:
- schema:
name: Schema should match
warn: when schema drift
fail: when schema violation
good_columns:
- customer_id
- customer_name
- customer_email
- customer_phone
- customer_segment
good_data_types:
customer_id: BIGINT
customer_name: VARCHAR
customer_email: VARCHAR

Row Count Checks

checks for customers:
- row_count:
name: Customer count should be stable
warn: when < 100000
fail: when < 50000
- row_count:
name: Customer count should match expectations
warn: when < 0.95 * 1000000 # 5% variance
fail: when < 0.90 * 1000000 # 10% variance

Numeric Value Checks

checks for products:
- numeric_metrics:
name: Product prices should be positive
columns:
unit_price:
- name: Price should be positive
valid_min: 0
warn: when 100 invalid
fail: when 1000 invalid
- numeric_metrics:
name: Product quantities should be reasonable
columns:
quantity_on_hand:
- name: Quantity should be 0-10000
valid_min: 0
valid_max: 10000
- numeric_metrics:
name: Revenue should be within expected range
columns:
total_revenue:
- name: Revenue should be reasonable
valid_min: 0
valid_max: 1000000
fail: when > 10 invalid

Missing Value Checks

checks for customers:
- missing_values:
name: Missing values check
columns:
customer_email:
- name: Email should never be null
fail: when > 0% missing
customer_phone:
- name: Phone can be 10% missing
warn: when > 10% missing
fail: when > 20% missing
customer_segment:
- name: Segment can be 5% missing
warn: when > 5% missing
fail: when > 15% missing

Duplicate Checks

checks for customers:
- duplicate_count:
name: Duplicate customers should be minimal
warn: when duplicate_count > 100
fail: when duplicate_count > 500
checks for orders:
- uniqueness:
name: Orders should be unique
column: order_id
warn: when duplicates > 0
fail: when duplicates > 10

Soda Alerts

Alert Configuration

alerts.yaml
api_key_id: ${SODA_CLOUD_API_KEY}
api_key_secret: ${SODA_CLOUD_API_SECRET}
alerts:
- name: slack_alert
type: slack
send_on: blocking
channel: "#data-quality"
- name: email_alert
type: email
send_on: blocking
recipients:
- data-engineering@my-company.com
- name: pagerduty_alert
type: pagerduty
send_on: blocking
api_key: ${PAGERDUTY_API_KEY}

Alert Conditions

checks.yml
checks for customers:
- rows_count:
name: Critical customer count
fail: when < 10000
alerts:
- slack_alert
- email_alert
- pagerduty_alert

Soda Best Practices

DO

# 1. Use meaningful check names
name: Customer count should not be zero # Good
name: Test 1 # Bad
# 2. Use appropriate thresholds
warn: when < 1000 # Reasonable
fail: when < 100 # Critical
# 3. Group checks by table/severity
checks_for_critical_tables:
- rows_count:
fail: when < 1000
# 4. Use filters for performance
filter: "created_at >= CURRENT_DATE - INTERVAL '7 days'"
# 5. Monitor Soda Cloud alerts
# Integrate with incident management

DON’T

# 1. Don't ignore failing checks
# Always investigate
# 2. Don't set too strict thresholds
# Avoid alert fatigue
# 3. Don't check everything every minute
# Performance impact
# 4. Don't ignore documentation
# Document check purpose
# 5. Don't forget to test Soda checks
# Ensure checks work as expected

Key Takeaways

  1. CLI: Run checks locally with soda scan
  2. Cloud: Monitor with Soda Cloud (dashboards, alerts)
  3. Checks: Rows count, freshness, schema, numeric, missing values, duplicates
  4. Alerts: Slack, email, PagerDuty integration
  5. CI/CD: Integrate Soda checks into CI/CD pipeline
  6. Documentation: Document check purpose and thresholds
  7. Monitoring: Track data quality trends over time
  8. Use When: Continuous data quality monitoring

Back to Module 4