Skip to content

Data Mesh

Decentralized Data Architecture


Overview

Data Mesh is a decentralized data architecture approach that treats data as a product, with domain-oriented ownership and self-serve infrastructure. It addresses the bottlenecks of centralized data platforms by distributing ownership and enabling scaling.


Core Principles


Principle 1: Domain Ownership

Centralized vs. Decentralized

Domain Boundaries

Example Domains:

  • Orders Domain: Orders, order items, returns
  • Customers Domain: Customer profiles, preferences
  • Inventory Domain: Products, stock, warehouses
  • Marketing Domain: Campaigns, attribution, leads
  • Finance Domain: Transactions, billing, accounting

Domain Ownership Criteria:

  • Bounded context (DDD)
  • Data ownership
  • Business function
  • Team structure

Principle 2: Data as a Product

Data Product Definition

data_products/orders.yaml
data_product:
name: "orders"
version: "1.0.0"
owner: "Orders Team"
description: "Complete order data for analytics"
# Product metadata
schema: "orders_schema.yml"
quality_sla: "99% complete, 95% accurate"
freshness_sla: "Available within 1 hour"
retention: "3 years"
# Portals
documentation: "https://data.company.com/orders"
support: "orders-data@company.com"
slack: "#orders-data"
# Access
access_model: "public" # public, restricted, confidential
consumers:
- "Analytics Team"
- "Finance Team"
- "Data Science Team"
# Contracts
data_contract: "orders_contract_v1.yml"

Data Product Portals


Principle 3: Self-Serve Infrastructure

Platform Team Responsibilities

infrastructure_platform:
provided_by: "Data Platform Team"
services:
- name: "Storage"
description: "Object storage with OTF"
technology: "S3 + Delta Lake"
- name: "Ingestion"
description: "Streaming and batch ingestion"
technology: "Kafka + Spark Streaming"
- name: "Transformation"
description: "dbt compute"
technology: "dbt + Spark"
- name: "Quality"
description: "Data testing framework"
technology: "Great Expectations"
- name: "Discovery"
description: "Data catalog"
technology: "DataHub / Glue Catalog"
- name: "Observability"
description: "Monitoring and alerting"
technology: "Prometheus + Grafana"
self_service:
- "Spin up data product"
- "Run dbt transformations"
- "Run data quality tests"
- "Monitor product health"
- "Manage access"

Self-Serve Portal


Principle 4: Federated Governance

Automated Governance

Governance Council

Composition:

  • Representatives from each domain
  • Data platform team
  • Security & compliance
  • Legal (privacy)

Responsibilities:

  • Define global standards
  • Approve tools and technologies
  • Resolve cross-domain issues
  • Maintain catalog

Data Mesh vs. Data Warehouse vs. Data Lake

DimensionData WarehouseData LakeData Mesh
ArchitectureCentralizedCentralizedDecentralized
OwnershipCentral teamCentral teamDomain teams
StructureSchema-on-writeSchema-on-readProduct-based
QualityCentralizedAd-hocProduct-level
ScalabilityLimitedHigh (but chaotic)High
Time to valueSlowFast (but low quality)Fast
GovernanceCentralizedMinimalFederated
Best forPredictable useExploratoryScaling orgs

Implementation Patterns

Pattern 1: Domain Isolation

-- Namespace by domain
orders_db.orders
orders_db.order_items
customers_db.customers
customers_db.customer_preferences
inventory_db.products
inventory_db.stock_levels
-- Catalog by domain
orders.catalog.orders
customers.catalog.customers
inventory.catalog.products

Pattern 2: Product Standardization

# All data products must have
standard_fields:
- metadata:
- ingested_at
- source_system
- source_row_count
- processed_at
- quality:
- completeness_score
- validity_score
- last_test_run
- documentation:
- description
- owner
- sla
- schema_url
- monitoring:
- freshness_metric
- volume_metric
- quality_metric

Pattern 3: Cross-Domain Access


Data Product Lifecycle


Cost Considerations

Platform Costs

ComponentCost ModelOptimization
StoragePer domainChargeback to domains
ComputePlatform teamShowback to domains
IngestionPer volumeUsage-based pricing
QualityFixed platform costIncluded in platform
DiscoveryFixed platform costIncluded in platform

Chargeback Model

chargeback:
storage:
model: "usage-based"
rate: "$0.023 per GB/month"
attribution: "domain"
compute:
model: "usage-based"
rate: "$0.50 per TB processed"
attribution: "domain"
platform:
model: "fixed"
rate: "$10K/month"
split: "proportional to data volume"

Migration Strategy

Phase 1: Pilot (3 months)

Phase 2: Expansion (6 months)

Phase 3: Full Migration (6 months)


Senior Level Considerations

When to Use Data Mesh

Good fit:

  • Large organizations (100+ data engineers)
  • Multiple domains with independent teams
  • Scaling challenges with centralized model
  • Need for faster time-to-value

Not a good fit:

  • Small organizations (< 20 data engineers)
  • Limited data complexity
  • Centralized model working well
  • Limited resources for platform team

Common Pitfalls

Pitfall 1: Data mesh without platform team

# Bad: Just distribute ownership, no support
domains:
- name: "orders"
owner: "Orders Team"
# No platform support
# Good: Platform team provides self-serve
domains:
- name: "orders"
owner: "Orders Team"
platform:
- "Self-serve ingestion"
- "Automated quality"
- "Monitoring"

Pitfall 2: Data products without consumers

# Bad: Build products, no consumers
data_product:
name: "orders"
consumers: [] # No one using it
# Good: Validate consumer need
data_product:
name: "orders"
consumers:
- "Analytics: Confirmed"
- "Finance: Confirmed"
validation: "Consumer sign-off required"

Pitfall 3: No standardization

# Bad: Every domain different
orders_domain:
naming: "orders_order_items"
format: "JSON"
customers_domain:
naming: "cust_profile"
format: "Parquet"
# Good: Standardized across domains
standard:
naming: "{domain}_{entity}"
format: "Delta Lake + Parquet"
contracts: "Required"

Key Takeaways

  1. Decentralized ownership: Domain teams own data products
  2. Data as product: SLAs, quality, documentation for each product
  3. Self-serve platform: Platform team provides infrastructure
  4. Federated governance: Automated policies, standards
  5. Not for everyone: Large organizations only (100+ data engineers)
  6. Investment required: Platform team needed (5-10 people)
  7. Standardization: Essential across domains
  8. Cultural shift: Biggest challenge, not technical

Back to Module 4