Soda Data

Data Quality Monitoring for Data Platforms

Overview

Soda (Soda Data Observability) is an open-source data quality monitoring tool that integrates with modern data platforms. Soda provides checks, monitors, and alerts to ensure data quality throughout your pipeline.

Soda Architecture

Soda Installation

Installing Soda CLI

# Install Soda CLI
pip install soda-core

# Install Soda for specific warehouse
pip install "soda-postgres"
pip install "soda-snowflake"
pip install "soda-bigquery"
pip install "soda-spark"

# Verify installation
soda --version

Soda Cloud Setup

# Login to Soda Cloud
soda login

# Create Soda Cloud account
# https://cloud.soda.io/signup

Soda Configuration

Configuration Files

data_source name: data_warehouse:
  type: postgres
  connection:
    host: localhost
    port: 5432
    username: data_user
    password: ${DBT_PROD_PASSWORD}
    database: data_prod
    schema: analytics
    schema: public

data_source name: snowflake_edw:
  type: snowflake
  connection:
    account: my-account
    username: data_user
    password: ${SNOWFLAKE_PASSWORD}
    warehouse: compute_wh
    database: data_prod
    schema: analytics
    role: data_engineer_role

data_source name: bigquery_lakehouse:
  type: bigquery
  connection:
    project_id: my-project
    dataset_id: analytics
    credentials_path: /path/to/service-account.json

# Check definitions for data quality

checks for dim_customers:
  - rows_count:
      name: Customer count should not be zero
      warn: when < 1000
      fail: when < 100

  - schema:
      name: Schema should match expected schema
      warn: when schema changes
      fail: when schema violation

  - freshness:
      name: Data should be fresh
      warn: when > 24h after {{ latest_load() }}
      fail: when > 48h after {{ latest_load() }}

  - row_count_comparison:
      name: Customer count should match source
      check: customers
      compute:
        method: absolute
        diff_percent: 5
      warn: when diff_percent > 5
      fail: when diff_percent > 10

  - numeric_metrics:
      name: Numeric values in range
      columns:
        age:
          - name: Age should be 18-100
            valid_min: 18
            valid_max: 100
        credit_score:
          - name: Credit score should be 300-850
            valid_min: 300
            valid_max: 850

  - missing_values:
      name: Missing values check
      columns:
        customer_email:
          - name: Email should not be null
            warn: when > 1% missing
            fail: when > 5% missing
        customer_phone:
          - name: Phone can be 10% missing
            warn: when > 10% missing
            fail: when > 20% missing

checks for fact_orders:
  - freshness:
      name: Orders should be fresh
      warn: when > 1h after {{ latest_load() }}
      fail: when > 6h after {{ latest_load() }}

  - duplicate_count:
      name: Duplicate orders should be minimal
      warn: when duplicate_count > 10
      fail: when duplicate_count > 100

  - numeric_metrics:
      name: Order amounts should be positive
      columns:
        order_amount:
          - name: Amount should be positive
            valid_min: 0
            warn: when 5% invalid
            fail: when 10% invalid

  - anomaly_detection:
      name: Detect anomalies in order amounts
      check: order_amount
      compute:
        method: is_outside_bounds
        standard_deviations: 3
      warn: when > 100
      fail: when > 500

  - volume_checks:
      name: Daily order volume should be consistent
      check: order_count
      compute:
        method: relative_diff
        other_entity: average_daily_orders
      warn: when diff_percent > 20
      fail: when diff_percent > 40

Soda CLI

Running Checks

# Run all checks
soda scan

# Run specific checks
soda scan -c checks.yml

# Run checks with verbose output
soda scan -c checks.yml -v

# Run checks and upload to Soda Cloud
soda scan -c checks.yml -d soda_cloud

# Run checks for specific table
soda scan -c checks.yml -c dim_customers

# Dry run (show checks without running)
soda scan -c checks.yml --dry-run

Soda CLI Output

$ soda scan -c checks.yml

Soda scan time: 2025-01-27 10:00:00

Scan summary:
  checks_for_dim_customers: 8 checks evaluated
  checks_for_fact_orders: 6 checks evaluated
  Total: 14 checks evaluated

Results by table:
  dim_customers:
    ✓ Customer count should not be zero (rows_count)
    ✓ Schema should match expected schema (schema)
    ✓ Data should be fresh (freshness)
    ✗ Customer count should match source (row_count_comparison)
    ✓ Age should be 18-100 (numeric_metrics)
    ✓ Credit score should be 300-850 (numeric_metrics)
    ✗ Email should not be null (missing_values)
    ✓ Phone can be 10% missing (missing_values)

  fact_orders:
    ✓ Orders should be fresh (freshness)
    ✓ Duplicate orders should be minimal (duplicate_count)
    ✗ Amount should be positive (numeric_metrics)
    ✓ Detect anomalies in order amounts (anomaly_detection)
    ✗ Daily order volume should be consistent (volume_checks)

Summary:
  Passing: 9
  Failing: 3
  Warning: 2
  Skipped: 0

Soda Cloud

Monitoring Dashboard

# Cloud configuration

soda_cloud:
  api_key_id: ${SODA_CLOUD_API_KEY}
  api_key_secret: ${SODA_CLOUD_API_SECRET}

# Deploy checks to Soda Cloud
soda scan -c checks.yml -d soda_cloud

Soda Cloud Features:

Real-time monitoring: Automated checks run continuously
Alerts: Slack, email, PagerDuty integration
Dashboards: Visualize data quality metrics
Incident tracking: Track and investigate data quality incidents
Historical analysis: Track data quality trends over time

Soda Checks

Freshness Checks

checks for orders:
  - freshness:
      name: Orders should be fresh
      warn: when > 1h after {{ latest_load() }}
      fail: when > 6h after {{ latest_load() }}
      filter: "order_date >= CURRENT_DATE - INTERVAL '7 days'"

  - freshness:
      name: Recent orders should be very fresh
      warn: when > 5m after {{ latest_load() }}
      fail: when > 15m after {{ latest_load() }}
      filter: "order_date >= CURRENT_DATE"

Schema Checks

checks for customers:
  - schema:
      name: Schema should match
      warn: when schema drift
      fail: when schema violation
      good_columns:
        - customer_id
        - customer_name
        - customer_email
        - customer_phone
        - customer_segment
      good_data_types:
        customer_id: BIGINT
        customer_name: VARCHAR
        customer_email: VARCHAR

Row Count Checks

checks for customers:
  - row_count:
      name: Customer count should be stable
      warn: when < 100000
      fail: when < 50000

  - row_count:
      name: Customer count should match expectations
      warn: when < 0.95 * 1000000  # 5% variance
      fail: when < 0.90 * 1000000  # 10% variance

Numeric Value Checks

checks for products:
  - numeric_metrics:
      name: Product prices should be positive
      columns:
        unit_price:
          - name: Price should be positive
            valid_min: 0
            warn: when 100 invalid
            fail: when 1000 invalid

  - numeric_metrics:
      name: Product quantities should be reasonable
      columns:
        quantity_on_hand:
          - name: Quantity should be 0-10000
            valid_min: 0
            valid_max: 10000

  - numeric_metrics:
      name: Revenue should be within expected range
      columns:
        total_revenue:
          - name: Revenue should be reasonable
            valid_min: 0
            valid_max: 1000000
            fail: when > 10 invalid

Missing Value Checks

checks for customers:
  - missing_values:
      name: Missing values check
      columns:
        customer_email:
          - name: Email should never be null
            fail: when > 0% missing

        customer_phone:
          - name: Phone can be 10% missing
            warn: when > 10% missing
            fail: when > 20% missing

        customer_segment:
          - name: Segment can be 5% missing
            warn: when > 5% missing
            fail: when > 15% missing

Duplicate Checks

checks for customers:
  - duplicate_count:
      name: Duplicate customers should be minimal
      warn: when duplicate_count > 100
      fail: when duplicate_count > 500

checks for orders:
  - uniqueness:
      name: Orders should be unique
      column: order_id
      warn: when duplicates > 0
      fail: when duplicates > 10

Soda Alerts

Alert Configuration

api_key_id: ${SODA_CLOUD_API_KEY}
api_key_secret: ${SODA_CLOUD_API_SECRET}

alerts:
  - name: slack_alert
    type: slack
    send_on: blocking
    channel: "#data-quality"

  - name: email_alert
    type: email
    send_on: blocking
    recipients:
      - data-engineering@my-company.com

  - name: pagerduty_alert
    type: pagerduty
    send_on: blocking
    api_key: ${PAGERDUTY_API_KEY}

Alert Conditions

checks for customers:
  - rows_count:
      name: Critical customer count
      fail: when < 10000
      alerts:
        - slack_alert
        - email_alert
        - pagerduty_alert

Soda Best Practices

DO

# 1. Use meaningful check names
name: Customer count should not be zero  # Good
name: Test 1  # Bad

# 2. Use appropriate thresholds
warn: when < 1000  # Reasonable
fail: when < 100  # Critical

# 3. Group checks by table/severity
checks_for_critical_tables:
  - rows_count:
      fail: when < 1000

# 4. Use filters for performance
filter: "created_at >= CURRENT_DATE - INTERVAL '7 days'"

# 5. Monitor Soda Cloud alerts
# Integrate with incident management

DON’T

# 1. Don't ignore failing checks
# Always investigate

# 2. Don't set too strict thresholds
# Avoid alert fatigue

# 3. Don't check everything every minute
# Performance impact

# 4. Don't ignore documentation
# Document check purpose

# 5. Don't forget to test Soda checks
# Ensure checks work as expected

Key Takeaways

CLI: Run checks locally with soda scan
Cloud: Monitor with Soda Cloud (dashboards, alerts)
Checks: Rows count, freshness, schema, numeric, missing values, duplicates
Alerts: Slack, email, PagerDuty integration
CI/CD: Integrate Soda checks into CI/CD pipeline
Documentation: Document check purpose and thresholds
Monitoring: Track data quality trends over time
Use When: Continuous data quality monitoring

Back to Module 4