Soda Data
Data Quality Monitoring for Data Platforms
Overview
Soda (Soda Data Observability) is an open-source data quality monitoring tool that integrates with modern data platforms. Soda provides checks, monitors, and alerts to ensure data quality throughout your pipeline.
Soda Architecture
Soda Installation
Installing Soda CLI
# Install Soda CLIpip install soda-core
# Install Soda for specific warehousepip install "soda-postgres"pip install "soda-snowflake"pip install "soda-bigquery"pip install "soda-spark"
# Verify installationsoda --versionSoda Cloud Setup
# Login to Soda Cloudsoda login
# Create Soda Cloud account# https://cloud.soda.io/signupSoda Configuration
Configuration Files
data_source name: data_warehouse: type: postgres connection: host: localhost port: 5432 username: data_user password: ${DBT_PROD_PASSWORD} database: data_prod schema: analytics schema: public
data_source name: snowflake_edw: type: snowflake connection: account: my-account username: data_user password: ${SNOWFLAKE_PASSWORD} warehouse: compute_wh database: data_prod schema: analytics role: data_engineer_role
data_source name: bigquery_lakehouse: type: bigquery connection: project_id: my-project dataset_id: analytics credentials_path: /path/to/service-account.json# Check definitions for data quality
checks for dim_customers: - rows_count: name: Customer count should not be zero warn: when < 1000 fail: when < 100
- schema: name: Schema should match expected schema warn: when schema changes fail: when schema violation
- freshness: name: Data should be fresh warn: when > 24h after {{ latest_load() }} fail: when > 48h after {{ latest_load() }}
- row_count_comparison: name: Customer count should match source check: customers compute: method: absolute diff_percent: 5 warn: when diff_percent > 5 fail: when diff_percent > 10
- numeric_metrics: name: Numeric values in range columns: age: - name: Age should be 18-100 valid_min: 18 valid_max: 100 credit_score: - name: Credit score should be 300-850 valid_min: 300 valid_max: 850
- missing_values: name: Missing values check columns: customer_email: - name: Email should not be null warn: when > 1% missing fail: when > 5% missing customer_phone: - name: Phone can be 10% missing warn: when > 10% missing fail: when > 20% missing
checks for fact_orders: - freshness: name: Orders should be fresh warn: when > 1h after {{ latest_load() }} fail: when > 6h after {{ latest_load() }}
- duplicate_count: name: Duplicate orders should be minimal warn: when duplicate_count > 10 fail: when duplicate_count > 100
- numeric_metrics: name: Order amounts should be positive columns: order_amount: - name: Amount should be positive valid_min: 0 warn: when 5% invalid fail: when 10% invalid
- anomaly_detection: name: Detect anomalies in order amounts check: order_amount compute: method: is_outside_bounds standard_deviations: 3 warn: when > 100 fail: when > 500
- volume_checks: name: Daily order volume should be consistent check: order_count compute: method: relative_diff other_entity: average_daily_orders warn: when diff_percent > 20 fail: when diff_percent > 40Soda CLI
Running Checks
# Run all checkssoda scan
# Run specific checkssoda scan -c checks.yml
# Run checks with verbose outputsoda scan -c checks.yml -v
# Run checks and upload to Soda Cloudsoda scan -c checks.yml -d soda_cloud
# Run checks for specific tablesoda scan -c checks.yml -c dim_customers
# Dry run (show checks without running)soda scan -c checks.yml --dry-runSoda CLI Output
$ soda scan -c checks.yml
Soda scan time: 2025-01-27 10:00:00
Scan summary: checks_for_dim_customers: 8 checks evaluated checks_for_fact_orders: 6 checks evaluated Total: 14 checks evaluated
Results by table: dim_customers: ✓ Customer count should not be zero (rows_count) ✓ Schema should match expected schema (schema) ✓ Data should be fresh (freshness) ✗ Customer count should match source (row_count_comparison) ✓ Age should be 18-100 (numeric_metrics) ✓ Credit score should be 300-850 (numeric_metrics) ✗ Email should not be null (missing_values) ✓ Phone can be 10% missing (missing_values)
fact_orders: ✓ Orders should be fresh (freshness) ✓ Duplicate orders should be minimal (duplicate_count) ✗ Amount should be positive (numeric_metrics) ✓ Detect anomalies in order amounts (anomaly_detection) ✗ Daily order volume should be consistent (volume_checks)
Summary: Passing: 9 Failing: 3 Warning: 2 Skipped: 0Soda Cloud
Monitoring Dashboard
# Cloud configuration
soda_cloud: api_key_id: ${SODA_CLOUD_API_KEY} api_key_secret: ${SODA_CLOUD_API_SECRET}
# Deploy checks to Soda Cloudsoda scan -c checks.yml -d soda_cloudSoda Cloud Features:
- Real-time monitoring: Automated checks run continuously
- Alerts: Slack, email, PagerDuty integration
- Dashboards: Visualize data quality metrics
- Incident tracking: Track and investigate data quality incidents
- Historical analysis: Track data quality trends over time
Soda Checks
Freshness Checks
checks for orders: - freshness: name: Orders should be fresh warn: when > 1h after {{ latest_load() }} fail: when > 6h after {{ latest_load() }} filter: "order_date >= CURRENT_DATE - INTERVAL '7 days'"
- freshness: name: Recent orders should be very fresh warn: when > 5m after {{ latest_load() }} fail: when > 15m after {{ latest_load() }} filter: "order_date >= CURRENT_DATE"Schema Checks
checks for customers: - schema: name: Schema should match warn: when schema drift fail: when schema violation good_columns: - customer_id - customer_name - customer_email - customer_phone - customer_segment good_data_types: customer_id: BIGINT customer_name: VARCHAR customer_email: VARCHARRow Count Checks
checks for customers: - row_count: name: Customer count should be stable warn: when < 100000 fail: when < 50000
- row_count: name: Customer count should match expectations warn: when < 0.95 * 1000000 # 5% variance fail: when < 0.90 * 1000000 # 10% varianceNumeric Value Checks
checks for products: - numeric_metrics: name: Product prices should be positive columns: unit_price: - name: Price should be positive valid_min: 0 warn: when 100 invalid fail: when 1000 invalid
- numeric_metrics: name: Product quantities should be reasonable columns: quantity_on_hand: - name: Quantity should be 0-10000 valid_min: 0 valid_max: 10000
- numeric_metrics: name: Revenue should be within expected range columns: total_revenue: - name: Revenue should be reasonable valid_min: 0 valid_max: 1000000 fail: when > 10 invalidMissing Value Checks
checks for customers: - missing_values: name: Missing values check columns: customer_email: - name: Email should never be null fail: when > 0% missing
customer_phone: - name: Phone can be 10% missing warn: when > 10% missing fail: when > 20% missing
customer_segment: - name: Segment can be 5% missing warn: when > 5% missing fail: when > 15% missingDuplicate Checks
checks for customers: - duplicate_count: name: Duplicate customers should be minimal warn: when duplicate_count > 100 fail: when duplicate_count > 500
checks for orders: - uniqueness: name: Orders should be unique column: order_id warn: when duplicates > 0 fail: when duplicates > 10Soda Alerts
Alert Configuration
api_key_id: ${SODA_CLOUD_API_KEY}api_key_secret: ${SODA_CLOUD_API_SECRET}
alerts: - name: slack_alert type: slack send_on: blocking channel: "#data-quality"
- name: email_alert type: email send_on: blocking recipients: - data-engineering@my-company.com
- name: pagerduty_alert type: pagerduty send_on: blocking api_key: ${PAGERDUTY_API_KEY}Alert Conditions
checks for customers: - rows_count: name: Critical customer count fail: when < 10000 alerts: - slack_alert - email_alert - pagerduty_alertSoda Best Practices
DO
# 1. Use meaningful check namesname: Customer count should not be zero # Goodname: Test 1 # Bad
# 2. Use appropriate thresholdswarn: when < 1000 # Reasonablefail: when < 100 # Critical
# 3. Group checks by table/severitychecks_for_critical_tables: - rows_count: fail: when < 1000
# 4. Use filters for performancefilter: "created_at >= CURRENT_DATE - INTERVAL '7 days'"
# 5. Monitor Soda Cloud alerts# Integrate with incident managementDON’T
# 1. Don't ignore failing checks# Always investigate
# 2. Don't set too strict thresholds# Avoid alert fatigue
# 3. Don't check everything every minute# Performance impact
# 4. Don't ignore documentation# Document check purpose
# 5. Don't forget to test Soda checks# Ensure checks work as expectedKey Takeaways
- CLI: Run checks locally with soda scan
- Cloud: Monitor with Soda Cloud (dashboards, alerts)
- Checks: Rows count, freshness, schema, numeric, missing values, duplicates
- Alerts: Slack, email, PagerDuty integration
- CI/CD: Integrate Soda checks into CI/CD pipeline
- Documentation: Document check purpose and thresholds
- Monitoring: Track data quality trends over time
- Use When: Continuous data quality monitoring
Back to Module 4