Data Mesh
Decentralized Data Architecture
Overview
Data Mesh is a decentralized data architecture approach that treats data as a product, with domain-oriented ownership and self-serve infrastructure. It addresses the bottlenecks of centralized data platforms by distributing ownership and enabling scaling.
Core Principles
Principle 1: Domain Ownership
Centralized vs. Decentralized
Domain Boundaries
Example Domains:
- Orders Domain: Orders, order items, returns
- Customers Domain: Customer profiles, preferences
- Inventory Domain: Products, stock, warehouses
- Marketing Domain: Campaigns, attribution, leads
- Finance Domain: Transactions, billing, accounting
Domain Ownership Criteria:
- Bounded context (DDD)
- Data ownership
- Business function
- Team structure
Principle 2: Data as a Product
Data Product Definition
data_product: name: "orders" version: "1.0.0" owner: "Orders Team" description: "Complete order data for analytics"
# Product metadata schema: "orders_schema.yml" quality_sla: "99% complete, 95% accurate" freshness_sla: "Available within 1 hour" retention: "3 years"
# Portals documentation: "https://data.company.com/orders" support: "orders-data@company.com" slack: "#orders-data"
# Access access_model: "public" # public, restricted, confidential consumers: - "Analytics Team" - "Finance Team" - "Data Science Team"
# Contracts data_contract: "orders_contract_v1.yml"Data Product Portals
Principle 3: Self-Serve Infrastructure
Platform Team Responsibilities
infrastructure_platform: provided_by: "Data Platform Team"
services: - name: "Storage" description: "Object storage with OTF" technology: "S3 + Delta Lake"
- name: "Ingestion" description: "Streaming and batch ingestion" technology: "Kafka + Spark Streaming"
- name: "Transformation" description: "dbt compute" technology: "dbt + Spark"
- name: "Quality" description: "Data testing framework" technology: "Great Expectations"
- name: "Discovery" description: "Data catalog" technology: "DataHub / Glue Catalog"
- name: "Observability" description: "Monitoring and alerting" technology: "Prometheus + Grafana"
self_service: - "Spin up data product" - "Run dbt transformations" - "Run data quality tests" - "Monitor product health" - "Manage access"Self-Serve Portal
Principle 4: Federated Governance
Automated Governance
Governance Council
Composition:
- Representatives from each domain
- Data platform team
- Security & compliance
- Legal (privacy)
Responsibilities:
- Define global standards
- Approve tools and technologies
- Resolve cross-domain issues
- Maintain catalog
Data Mesh vs. Data Warehouse vs. Data Lake
| Dimension | Data Warehouse | Data Lake | Data Mesh |
|---|---|---|---|
| Architecture | Centralized | Centralized | Decentralized |
| Ownership | Central team | Central team | Domain teams |
| Structure | Schema-on-write | Schema-on-read | Product-based |
| Quality | Centralized | Ad-hoc | Product-level |
| Scalability | Limited | High (but chaotic) | High |
| Time to value | Slow | Fast (but low quality) | Fast |
| Governance | Centralized | Minimal | Federated |
| Best for | Predictable use | Exploratory | Scaling orgs |
Implementation Patterns
Pattern 1: Domain Isolation
-- Namespace by domainorders_db.ordersorders_db.order_itemscustomers_db.customerscustomers_db.customer_preferencesinventory_db.productsinventory_db.stock_levels
-- Catalog by domainorders.catalog.orderscustomers.catalog.customersinventory.catalog.productsPattern 2: Product Standardization
# All data products must havestandard_fields: - metadata: - ingested_at - source_system - source_row_count - processed_at
- quality: - completeness_score - validity_score - last_test_run
- documentation: - description - owner - sla - schema_url
- monitoring: - freshness_metric - volume_metric - quality_metricPattern 3: Cross-Domain Access
Data Product Lifecycle
Cost Considerations
Platform Costs
| Component | Cost Model | Optimization |
|---|---|---|
| Storage | Per domain | Chargeback to domains |
| Compute | Platform team | Showback to domains |
| Ingestion | Per volume | Usage-based pricing |
| Quality | Fixed platform cost | Included in platform |
| Discovery | Fixed platform cost | Included in platform |
Chargeback Model
chargeback: storage: model: "usage-based" rate: "$0.023 per GB/month" attribution: "domain"
compute: model: "usage-based" rate: "$0.50 per TB processed" attribution: "domain"
platform: model: "fixed" rate: "$10K/month" split: "proportional to data volume"Migration Strategy
Phase 1: Pilot (3 months)
Phase 2: Expansion (6 months)
Phase 3: Full Migration (6 months)
Senior Level Considerations
When to Use Data Mesh
Good fit:
- Large organizations (100+ data engineers)
- Multiple domains with independent teams
- Scaling challenges with centralized model
- Need for faster time-to-value
Not a good fit:
- Small organizations (< 20 data engineers)
- Limited data complexity
- Centralized model working well
- Limited resources for platform team
Common Pitfalls
Pitfall 1: Data mesh without platform team
# Bad: Just distribute ownership, no supportdomains: - name: "orders" owner: "Orders Team" # No platform support
# Good: Platform team provides self-servedomains: - name: "orders" owner: "Orders Team" platform: - "Self-serve ingestion" - "Automated quality" - "Monitoring"Pitfall 2: Data products without consumers
# Bad: Build products, no consumersdata_product: name: "orders" consumers: [] # No one using it
# Good: Validate consumer needdata_product: name: "orders" consumers: - "Analytics: Confirmed" - "Finance: Confirmed" validation: "Consumer sign-off required"Pitfall 3: No standardization
# Bad: Every domain differentorders_domain: naming: "orders_order_items" format: "JSON"
customers_domain: naming: "cust_profile" format: "Parquet"
# Good: Standardized across domainsstandard: naming: "{domain}_{entity}" format: "Delta Lake + Parquet" contracts: "Required"Key Takeaways
- Decentralized ownership: Domain teams own data products
- Data as product: SLAs, quality, documentation for each product
- Self-serve platform: Platform team provides infrastructure
- Federated governance: Automated policies, standards
- Not for everyone: Large organizations only (100+ data engineers)
- Investment required: Platform team needed (5-10 people)
- Standardization: Essential across domains
- Cultural shift: Biggest challenge, not technical
Back to Module 4