Data Fabric
Unified Data Architecture
Overview
Data Fabric is a unified architecture that provides seamless data access, integration, and management across distributed data environments. It combines metadata, integration, and governance to create a “fabric” that connects all data assets.
Data Fabric Architecture
Key Components:
- Metadata Engine: Data catalog, lineage, glossary, semantic modeling
- Integration Layer: Virtualization, replication, synchronization
- Data Services: Data access, preparation, orchestration
- Security & Governance: Access control, audit logging, policy enforcement
Data Fabric Patterns
Data Virtualization
Virtualization Benefits:
- Query without moving data: Federated queries across sources
- Real-time access: No ETL latency
- Cost optimization: No duplication
- Flexibility: Add new sources easily
Data Fabric Technologies
Denodo Virtualization
-- Denodo: Data virtualization platform
-- Virtual layer: Union multiple sourcesCREATE VIEW v_customer_360 ASSELECT c.customer_id, c.customer_name, c.customer_email, o.order_count, o.total_spend, s.support_ticketsFROM ( -- Warehouse: Customer data SELECT customer_id, customer_name, customer_email FROM postgres_dwh.public.customers
UNION ALL
-- Data Lake: Customer interactions SELECT customer_id, customer_name, customer_email FROM s3_data_lake.customers
UNION ALL
-- CRM: Customer accounts SELECT account_id AS customer_id, account_name AS customer_name, email AS customer_email FROM salesforce.accounts) cLEFT JOIN ( -- Orders summary from warehouse SELECT customer_id, COUNT(*) AS order_count, SUM(total_amount) AS total_spend FROM postgres_dwh.public.orders GROUP BY customer_id) o ON c.customer_id = o.customer_idLEFT JOIN ( -- Support tickets from CRM SELECT customer_id, COUNT(*) AS support_tickets FROM salesforce.support_tickets GROUP BY customer_id) s ON c.customer_id = s.customer_id;Trino/Presto Federation
-- Trino: Distributed SQL engine for federation
-- Catalog configurationCREATE CATALOG postgres USING postgresPROPERTIES ( "connection-url" = "jdbc:postgresql://postgres:5432/dwh");
CREATE CATALOG hive USING hivePROPERTIES ( "hive.metastore.uri" = "thrift://hive-metastore:9083");
CREATE CATALOG delta USING deltaPROPERTIES ( "delta.catalog" = "s3a://my-bucket/delta/");
-- Federated querySELECT c.customer_id, c.customer_name, o.order_id, o.order_dateFROM postgres.public.customers cJOIN hive.analytics.orders o ON c.customer_id = o.customer_idWHERE c.created_at >= CURRENT_DATE - INTERVAL '7 days';Data Fabric Features
Data Catalog
Data Catalog Features:
- Asset discovery: Automatic metadata extraction
- Schema metadata: Tables, columns, data types
- Business metadata: Owner, steward, description
- Tags and labels: PII, sensitive, tier classifications
- Data lineage: End-to-end data flow tracking
- Glossary: Business definitions and terms
Data Lineage
Lineage Features:
- Column-level: Track column transformations
- End-to-end: Source to consumption
- Impact analysis: What breaks if table changes?
- Data dependencies: Upstream and downstream
Data Synchronization
Sync Patterns:
- CDC (Change Data Capture): Real-time sync from databases
- Batch sync: Scheduled sync (hourly, daily)
- Event-driven: Sync on event triggers
- Bidirectional: Sync between multiple targets
Data Fabric Implementation
Technology Stack
| Component | Technology | Use Case |
|---|---|---|
| Virtualization | Denodo, Redshift Spectrum, Trino | Federated queries |
| Catalog | Alation, Collibra, DataHub | Metadata management |
| Lineage | OpenLineage, Marquez, DataHub | End-to-end tracking |
| Sync | Fivetran, Airbyte, Matillion | Data replication |
| Governance | Collibra, Alation, Purview | Access control, policies |
Data Fabric vs. Data Mesh
| Dimension | Data Fabric | Data Mesh |
|---|---|---|
| Architecture | Unified, centralized | Decentralized, domain-oriented |
| Ownership | Centralized data team | Domain ownership |
| Access | Virtualization (no copying) | Data products (copied) |
| Governance | Centralized | Federated |
| Best for | Enterprise integration | Domain autonomy |
| Complexity | High | Medium |
Data Fabric Best Practices
DO
# 1. Start with metadata# Catalog all data assets before building fabric
# 2. Use virtualization for prototyping# Test queries before building ETL
# 3. Implement data lineage# Track data flow end-to-end
# 4. Use sync for performance# Copy hot data for performance
# 5. Centralize governance# Single pane of glass for all dataDON’T
# 1. Don't virtualize everything# Virtualization has overhead
# 2. Don't ignore data ownership# Clear ownership and stewardship
# 3. Don't skip lineage# Essential for impact analysis
# 4. Don't forget security# Apply policies consistently
# 5. Don't build monolith# Use modular, scalable architectureKey Takeaways
- Unified architecture: Seamless access across all data
- Virtualization: Query without moving data
- Metadata: Data catalog, lineage, glossary
- Integration: Virtualization, replication, sync
- Governance: Centralized security and policies
- Performance: Sync hot data for performance
- Complexity: High complexity, high value
- Use When: Enterprise integration, unified access, governance
Back to Module 4