Skip to content

Data Fabric

Unified Data Architecture


Overview

Data Fabric is a unified architecture that provides seamless data access, integration, and management across distributed data environments. It combines metadata, integration, and governance to create a “fabric” that connects all data assets.


Data Fabric Architecture

Key Components:

  • Metadata Engine: Data catalog, lineage, glossary, semantic modeling
  • Integration Layer: Virtualization, replication, synchronization
  • Data Services: Data access, preparation, orchestration
  • Security & Governance: Access control, audit logging, policy enforcement

Data Fabric Patterns

Data Virtualization

Virtualization Benefits:

  • Query without moving data: Federated queries across sources
  • Real-time access: No ETL latency
  • Cost optimization: No duplication
  • Flexibility: Add new sources easily

Data Fabric Technologies

Denodo Virtualization

-- Denodo: Data virtualization platform
-- Virtual layer: Union multiple sources
CREATE VIEW v_customer_360 AS
SELECT
c.customer_id,
c.customer_name,
c.customer_email,
o.order_count,
o.total_spend,
s.support_tickets
FROM (
-- Warehouse: Customer data
SELECT
customer_id,
customer_name,
customer_email
FROM postgres_dwh.public.customers
UNION ALL
-- Data Lake: Customer interactions
SELECT
customer_id,
customer_name,
customer_email
FROM s3_data_lake.customers
UNION ALL
-- CRM: Customer accounts
SELECT
account_id AS customer_id,
account_name AS customer_name,
email AS customer_email
FROM salesforce.accounts
) c
LEFT JOIN (
-- Orders summary from warehouse
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(total_amount) AS total_spend
FROM postgres_dwh.public.orders
GROUP BY customer_id
) o ON c.customer_id = o.customer_id
LEFT JOIN (
-- Support tickets from CRM
SELECT
customer_id,
COUNT(*) AS support_tickets
FROM salesforce.support_tickets
GROUP BY customer_id
) s ON c.customer_id = s.customer_id;

Trino/Presto Federation

-- Trino: Distributed SQL engine for federation
-- Catalog configuration
CREATE CATALOG postgres USING postgres
PROPERTIES (
"connection-url" = "jdbc:postgresql://postgres:5432/dwh"
);
CREATE CATALOG hive USING hive
PROPERTIES (
"hive.metastore.uri" = "thrift://hive-metastore:9083"
);
CREATE CATALOG delta USING delta
PROPERTIES (
"delta.catalog" = "s3a://my-bucket/delta/"
);
-- Federated query
SELECT
c.customer_id,
c.customer_name,
o.order_id,
o.order_date
FROM postgres.public.customers c
JOIN hive.analytics.orders o ON c.customer_id = o.customer_id
WHERE c.created_at >= CURRENT_DATE - INTERVAL '7 days';

Data Fabric Features

Data Catalog

Data Catalog Features:

  • Asset discovery: Automatic metadata extraction
  • Schema metadata: Tables, columns, data types
  • Business metadata: Owner, steward, description
  • Tags and labels: PII, sensitive, tier classifications
  • Data lineage: End-to-end data flow tracking
  • Glossary: Business definitions and terms

Data Lineage

Lineage Features:

  • Column-level: Track column transformations
  • End-to-end: Source to consumption
  • Impact analysis: What breaks if table changes?
  • Data dependencies: Upstream and downstream

Data Synchronization

Sync Patterns:

  • CDC (Change Data Capture): Real-time sync from databases
  • Batch sync: Scheduled sync (hourly, daily)
  • Event-driven: Sync on event triggers
  • Bidirectional: Sync between multiple targets

Data Fabric Implementation

Technology Stack

ComponentTechnologyUse Case
VirtualizationDenodo, Redshift Spectrum, TrinoFederated queries
CatalogAlation, Collibra, DataHubMetadata management
LineageOpenLineage, Marquez, DataHubEnd-to-end tracking
SyncFivetran, Airbyte, MatillionData replication
GovernanceCollibra, Alation, PurviewAccess control, policies

Data Fabric vs. Data Mesh

DimensionData FabricData Mesh
ArchitectureUnified, centralizedDecentralized, domain-oriented
OwnershipCentralized data teamDomain ownership
AccessVirtualization (no copying)Data products (copied)
GovernanceCentralizedFederated
Best forEnterprise integrationDomain autonomy
ComplexityHighMedium

Data Fabric Best Practices

DO

# 1. Start with metadata
# Catalog all data assets before building fabric
# 2. Use virtualization for prototyping
# Test queries before building ETL
# 3. Implement data lineage
# Track data flow end-to-end
# 4. Use sync for performance
# Copy hot data for performance
# 5. Centralize governance
# Single pane of glass for all data

DON’T

# 1. Don't virtualize everything
# Virtualization has overhead
# 2. Don't ignore data ownership
# Clear ownership and stewardship
# 3. Don't skip lineage
# Essential for impact analysis
# 4. Don't forget security
# Apply policies consistently
# 5. Don't build monolith
# Use modular, scalable architecture

Key Takeaways

  1. Unified architecture: Seamless access across all data
  2. Virtualization: Query without moving data
  3. Metadata: Data catalog, lineage, glossary
  4. Integration: Virtualization, replication, sync
  5. Governance: Centralized security and policies
  6. Performance: Sync hot data for performance
  7. Complexity: High complexity, high value
  8. Use When: Enterprise integration, unified access, governance

Back to Module 4