Principal Data Engineering Glossary

Essential Terminology for Architecture Discussions

Using This Glossary

Definitions are concise, Principal-level explanations
Context indicates when/why the term matters
Related Terms link to associated concepts
Cost Implications highlight financial considerations

A

ACID Transactions

Definition: Atomicity, Consistency, Isolation, Durability - guarantees for reliable database transactions.

Context: Critical for financial systems, data warehousing, and any scenario requiring exact-once semantics. Open Table Formats (Delta, Iceberg, Hudi) bring ACID to data lakes.

Trade-offs: Strong consistency comes with performance overhead. At PB scale, isolation levels become expensive.

Cost Impact: Higher compute overhead, potential for hotspotting in distributed systems.

Related: BASE, Open Table Formats

Apache Arrow

Definition: In-memory columnar format enabling zero-copy data sharing between systems.

Context: Eliminates serialization overhead between Pandas, Polars, DuckDB, Spark. Foundation for modern data interoperability.

Cost Impact: Reduced memory pressure, faster query startup times.

Related: Parquet, Vectorization

Autoloader / Auto-ingestion

Definition: Incremental data ingestion that automatically detects new files without manual schema management.

Context: Essential for streaming ingestion from cloud storage (S3, GCS, ADLS). Reduces operational overhead.

Senior Gotcha: File listing operations at scale can become expensive. Use directory listing optimizations or queue-based notification (SQS, Pub/Sub).

Cost Impact: Reduced API calls vs. full scans, but requires queue infrastructure.

B

BASE Transactions

Definition: Basically Available, Soft state, Eventually consistent - alternative to ACID for distributed systems.

Context: Acceptable for analytics, clickstream, and scenarios where eventual consistency is tolerable.

Trade-offs: Higher availability and partition tolerance at the cost of immediate consistency.

Cost Impact: Lower coordination overhead, better horizontal scalability.

Related: ACID, CAP Theorem

Bronze/Silver/Gold Tables

Definition: Multi-stage data quality pattern where data progresses through quality tiers.

Tier	Purpose	Quality	Use Case
Bronze	Raw ingestion	As-landed	Audit, replay capability
Silver	Cleaned/typed	Validated	Analytics, ML features
Gold	Aggregated/business	Certified	BI, dashboards, products

Context: Medallion architecture popularized by Databricks. Enables data quality gradual improvement.

Cost Impact: Multiple storage copies (2-3x storage cost). Optimize with views where appropriate.

Data Bricks Migration

Definition: Multi-cloud platform for data engineering, analytics, and ML.

Context: Unified platform for Spark, Delta Lake, MLflow, and SQL warehouses.

Cost Impact: Premium pricing over open-source equivalents. Justified for unified experience and reduced operational overhead.

C

CAP Theorem

Definition: In a distributed system, you can only guarantee 2 of 3: Consistency, Availability, Partition tolerance.

Context: Every distributed data system makes CAP trade-offs. Understanding them is critical for architecture decisions.

Principal Insight: In cloud systems, P is guaranteed (networks fail). The real choice is C vs A.

Related: ACID, BASE

Cloud-Native Architecture

Definition: System design leveraging cloud-specific capabilities (object storage, managed services, auto-scaling).

Context: NOT just “running in the cloud.” True cloud-native uses storage disaggregation, serverless compute, and managed services.

Cost Impact: OpEx over CapEx, pay-for-use models, reduced operational overhead.

Columnar Storage

Definition: Data stored by column rather than row, enabling efficient compression and predicate pushdown.

Context: Foundation of modern analytics (Parquet, ORC, Kudu). Essential for query performance.

Cost Impact: Better compression = lower storage costs. Faster queries = lower compute costs.

Related: Parquet, Predicate Pushdown

Compaction

Definition: Merging small files into larger ones to address the small files problem.

Context: Continuous background process in streaming systems. Critical for query performance and metadata management.

Senior Gotcha: Aggressive compaction increases write amplification. Too passive = thousands of small files = slow queries and NameNode overload.

Cost Impact: Compute trade-off for query performance. Spot instances ideal for compaction workloads.

Related: Small Files Problem, Z-Ordering

Compute-Storage Separation

Definition: Architecture where compute and storage scale independently.

Context: Fundamental to cloud data platforms. Enables storage in S3/GCS with compute from Spark/Trino/Presto.

Cost Impact: Optimal for variable workloads. Can provision compute for query spikes without over-provisioning storage.

Related: Lakehouse

Containerization

Definition: Packaging applications with dependencies into containers (Docker) for consistent execution.

Context: Enables reproducible data pipelines, easier local development, and Kubernetes orchestration.

Cost Impact: Better resource utilization, easier rightsizing.

D

Data Contract

Definition: Formal agreement between data producer and consumer defining schema, quality SLAs, and update frequency.

Context: Evolution of data governance. Shifts quality enforcement from downstream to upstream.

Components:

Schema definition
Data quality rules
Update SLA (latency, frequency)
Notification on failure

Cost Impact: Reduced downstream reprocessing costs, earlier failure detection.

Related: Data Quality

Data Fabric

Definition: Unified architecture connecting data across on-prem and multi-cloud environments.

Context: Enterprise data integration pattern for hybrid scenarios. More centralized than Data Mesh.

Cost Impact: Reduced data movement, better utilization of existing infrastructure.

Data Lake

Definition: Centralized repository storing raw data in native format (object storage).

Context: Evolution from data warehouse. Enables schema-on-read, stores all data types.

Problems: Lack of transactions, schema enforcement, quality controls → led to Data Swamp.

Evolution: Lakehouse architecture adds warehouse capabilities to lakes.

Data Lakehouse

Definition: Architecture combining data lake flexibility with warehouse ACID transactions and schema enforcement.

Context: Modern data architecture standard. Enabled by Open Table Formats.

Benefits: Single copy of data, warehouse reliability at lake cost.

Cost Impact: Eliminates duplicate storage (lake + warehouse). Primary cost saving of modern architecture.

Related: Delta Lake, Iceberg, Hudi

Data Mesh

Definition: Decentralized data architecture treating data as a product with domain-oriented ownership.

Context: Alternative to centralized data platform. Empowers domains to own their data products.

Principles:

Domain ownership
Data as a product
Self-serve infrastructure platform
Federated governance

Trade-offs: Better agility/scale vs. governance complexity. Not ideal for small organizations.

Cost Impact: Can reduce central team costs but increases cross-domain coordination overhead.

Data Vault 2.0

Definition: Data modeling methodology with Hub, Link, and Satellite structures for auditability and flexibility.

Context: Alternative to Kimball for enterprise warehousing with strong audit trails.

Cost Impact: More storage overhead than star schema, but better adaptability to change.

Delta Lake

Definition: Open table format bringing ACID transactions, schema enforcement, and time travel to data lakes.

Context: Most popular OTF (Databricks native, open source). Optimized for Databricks but works anywhere.

Key Features:

ACID transactions (optimized for concurrency)
Schema enforcement and evolution
Time travel (data versioning)
Deletion vectors (for merge/update performance)

Cost Impact: Small file management reduces compute costs. Deletion vectors drastically reduce merge overhead.

Related: Apache Iceberg, Apache Hudi

E

Event Sourcing

Definition: Storing state as a sequence of events rather than current state.

Context: Essential for event-driven architectures, streaming systems, and audit trails.

Benefits: Complete audit history, replay capability, temporal queries.

Cost Impact: Higher storage (event log + projected state). Enables efficient compression with columnar formats.

Exactly-Once Semantics

Definition: Guarantee that each event is processed exactly once (no duplicates, no losses).

Context: Holy grail of streaming. Achieved with idempotent sinks + checkpointing.

Trade-offs: Requires coordination, checkpointing overhead.

Cost Impact: Checkpoint storage, coordination overhead. Worth it for financial/transactional systems.

F

Feature Store

Definition: Centralized repository for ML features enabling feature reuse across training and serving.

Context: Critical for MLOps at scale. Prevents training-serving skew.

Types:

Offline store (historical features for training)
Online store (low-latency serving)

Cost Impact: Duplicate storage for offline/online. Reduces feature engineering overhead.

Related: Feast

H

Handler-Based Architecture

Definition: Streaming pattern where handlers process state changes per entity.

Context: Common in real-time enrichment, personalization.

Cost Impact: State management overhead at scale. Consider stateful vs stateless processing.

Horizontal Partitioning (Sharding)

Definition: Splitting data across multiple nodes by partition key.

Context: Enables linear scalability. Critical for high-write throughput.

Challenges: Choosing partition key, hotspots, rebalancing.

Cost Impact: Enables cheaper horizontal scale vs. vertical scale.

I

Idempotency

Definition: Property where an operation can be applied multiple times with same result as single application.

Context: Essential for fault tolerance, exactly-once semantics, and retry logic.

Examples: Upserts, conditional writes, deduplication.

Cost Impact: Enables safe use of cheaper (unreliable) infrastructure like spot instances.

Iceberg Specification

Definition: Open table format for large analytic datasets with metadata evolution.

Context: Strong community adoption (Snowflake, Databricks, BigQuery all support). Focus on metadata scalability.

Key Differentiator: No lock-based metadata reads. Scales to billions of partitions.

Cost Impact: Reduced metadata query costs. Better for high-partition-count tables.

Related: Delta Lake, Apache Hudi

L

Late Data

Definition: Data arriving after the expected window (common in mobile/web events).

Context: Streaming challenge requiring handling strategies: drop, buffer, or backfill.

Solutions:

Watermarks with allowed lateness
Buffering windows
Re-processing with correction

Cost Impact: Buffering increases storage and compute. Re-processing can be expensive.

LSM Tree (Log-Structured Merge Tree)

Definition: Data structure optimized for write-heavy workloads with background compaction.

Context: Foundation of Kafka, HBase, Cassandra, and many modern databases.

Trade-offs: Excellent write performance, read ampllication from multiple levels.

Cost Impact: Compaction overhead required. Ideal for append-heavy workloads.

M

Materialized View

Definition: Pre-computed query result stored as a table, refreshed on schedule or trigger.

Context: Essential for query performance optimization. Enables complex queries to execute instantly.

Trade-offs: Storage cost vs. compute cost. Refresh overhead vs. query speed.

Cost Impact: High-value optimization for frequently accessed aggregations. Reduces expensive compute.

Metadata Catalog

Definition: Central repository of data assets, schemas, lineage, and ownership.

Context: Essential for data discovery, governance, and democratization.

Examples: Glue Data Catalog, Hive Metastore, Unity Catalog, Atlas.

Cost Impact: Small cost relative to value. Enables chargeback and usage-based allocation.

O

Open Table Formats (OTF)

Definition: Table abstraction layers bringing database-like capabilities to object storage.

The Big Three:

Format	Best For	Owner
Delta Lake	Databricks, high concurrency	Delta
Iceberg	Multi-cloud, high partition count	Apache
Hudi	Streaming upserts, CDC	Apache

Context: Enabler of Lakehouse architecture. Replaces HDFS for analytics.

Cost Impact: Enables single-copy storage (eliminates warehouse duplication).

Related: Lakehouse, Module 1

P

Parquet

Definition: Columnar storage format with compression and encoding optimizations.

Context: De facto standard for analytics. Efficient compression (10-100x vs CSV).

Advantages:

Columnar projection (read only needed columns)
Predicate pushdown (skip row groups)
Compression (Snappy, Zstd, Gzip)
Schema evolution (add columns)

Cost Impact: 80-90% storage cost reduction vs. CSV/JSON. Critical for cloud storage economics.

Related: ORC, Avro

Partitioning

Definition: Splitting data into subdirectories based on column values.

Common Strategies:

Date-based: dt=2025-01-27/ (most common)
Hash-based: Consistent distribution
Range-based: Sequential data
Bucket: Fixed number of partitions

Context: Query performance optimization. Enables partition pruning (skip irrelevant data).

Senior Gotcha: Over-partitioning creates small files. Under-partitioning causes full scans.

Cost Impact: Proper partitioning reduces query costs by 10-100x. Wrong strategy increases costs.

Predicate Pushdown

Definition: Pushing filter operations to storage layer to minimize data read.

Context: Parquet/Iceberg/Delta support. Minimizes I/O for analytical queries.

Example: SELECT * FROM table WHERE date > '2025-01-01' reads only matching files.

Cost Impact: Can reduce query costs by 90%+ for selective queries.

S

Schema Evolution

Definition: Ability to modify table schema without breaking existing queries.

Types:

Add column: Safe (backward compatible)
Drop column: Risky (breaking change)
Rename: Complex (requires metadata)
Type change: Risky (may break)

Context: Critical for long-lived data products. OTFs support safe evolution.

Cost Impact: Enables gradual migration vs. expensive full rewrites.

Schema-on-Read vs Schema-on-Write

Approach	Definition	Best For
Schema-on-Read	Schema applied when querying	Data lakes, flexibility
Schema-on-Write	Schema enforced at write	Warehouses, quality

Context: Trade-off between flexibility (lake) and quality (warehouse).

Serverless Data Processing

Definition: Compute that automatically scales to zero when not in use.

Context: AWS Lambda, Glue Serverless, Databricks Serverless. Ideal for intermittent workloads.

Cost Impact: Pay-per-query vs. provisioned clusters. Optimal for variable/low-concurrency workloads.

Small Files Problem

Definition: Performance degradation from having too many small files.

Impacts:

Metadata overhead (NameNode, Hive Metastore)
Query planning overhead
Inefficient I/O (many small reads)

Solutions:

Compaction (merge small files)
Proper partitioning
Optimal file size (128MB-1GB Parquet)

Cost Impact: Can increase query costs 10-100x. Critical for cost optimization.

Related: Compaction, Module 7

Snowflake

Definition: Cloud-native data warehouse separating compute and storage.

Context: Popular for simplicity and performance. Premium pricing.

Architecture:

Multi-cluster warehouses (compute)
Micro-partitions (storage)
Automatic clustering (optimization)

Cost Impact: Premium vs. open-source stack. Value proposition in reduced ops.

Structured Streaming

Definition: Streaming API treating streams as unbounded tables.

Context: Spark Structured Streaming, Flink SQL. Unifies batch and streaming.

Benefits: Single codebase for batch and streaming. Easier mental model.

Streaming vs Batch

Dimension	Batch	Streaming
Latency	Hours/Minutes	Seconds/Milliseconds
Complexity	Lower	Higher
Cost	Generally cheaper	Generally more expensive
Use Case	Historical analysis, ETL	Real-time analytics, CDC

Context: Choice depends on business requirements, not technology preference.

Cost Impact: Streaming has constant compute overhead. Batch can use spot instances aggressively.

T

Time Travel

Definition: Ability to query previous versions of data.

Context: Feature of Delta Lake, Iceberg, Hudi. Critical for audit, debugging, ML reproducibility.

Retention: Typically 30 days (configurable). Long retention increases storage.

Cost Impact: Storage cost for retained versions. Often worth it for debugging/audit savings.

Transformations

Definition: Data processing logic converting raw to curated data.

Patterns:

ELT: Extract, Load, Transform (warehouse-first)
ETL: Extract, Transform, Load (transform-first)
ELT: Modern approach leveraging warehouse compute

Context: ELT dominates in modern lakehouse (dbt, Spark SQL).

V

Vectorization

Definition: Processing data in batches (CPU cache lines) rather than row-by-row.

Context: CPU-level optimization in Parquet, Arrow, DuckDB, Spark.

Benefits: 10-100x performance improvement for analytical queries.

Cost Impact: Better hardware utilization = lower compute cost per query.

View vs Materialized View

Type	Storage	Refresh	Use Case
View	None (logical)	Real-time	Simplification, security
Materialized View	Yes (physical)	On schedule	Performance optimization

Context: Views are free. Materialized views have storage cost.

W

Watermark

Definition: Timestamp threshold in streaming defining when to stop waiting for late data.

Context: Balances completeness vs. latency. Required for windowed aggregations.

Trade-offs:

Tight watermark: Faster results, more late data
Loose watermark: More complete, slower results

Cost Impact: Buffering for late data increases state size and compute.

Windowing Strategies

Types:

Tumbling: Fixed-size, non-overlapping
Sliding: Fixed-size, overlapping
Session: Activity-based, variable size

Context: Streaming aggregation patterns. Choice affects latency and resource usage.

Cost Impact: Sliding windows = 2-10x state vs. tumbling.

Z

Z-Ordering

Definition: Multi-dimensional clustering technique optimizing for common query patterns.

Context: Alternative to Hive-style partitioning. Enables efficient multi-column queries.

How it Works:

Interleaves bits from multiple columns
Sorts data by Z-order value
Enables data skipping on multiple columns

Example: Z-order on (country, date) enables efficient queries filtering on either or both.

Cost Impact: Write overhead for query performance. 10-100x query speedup for filtered queries.

Cost Consideration: 1.5-2x write cost for 5-10x query improvement. Worth it for read-heavy workloads.

Related: Partitioning, Module 7

Acronym Quick Reference

Acronym	Full Term	Context
ACID	Atomicity, Consistency, Isolation, Durability	Transactional guarantees
BASE	Basically Available, Soft state, Eventually consistent	Distributed system trade-offs
CAP	Consistency, Availability, Partition tolerance	Distributed system theorem
CDC	Change Data Capture	Real-time data replication
CI/CD	Continuous Integration/Deployment	Automation for data pipelines
COW	Copy-on-Write	Storage strategy (immutable updates)
MOR	Merge-on-Read	Storage strategy (deferred merge)
OTF	Open Table Format	Lakehouse enabler
RAG	Retrieval Augmented Generation	LLM architecture
SLA/SLO/SLI	Service Level Agreement/Objective/Indicator	Reliability metrics
WAP	Write-Audit-Publish	Data quality pattern

Cost Optimization Keywords

Terms that trigger cost considerations:

Term	Cost Dimension
Hot/Warm/Cold	Storage tiering
Spot/Preemptible	Compute discount (60-90%)
Compaction	Write amplification
Partitioning	Query cost multiplier
Z-Ordering	Write overhead, query savings
Caching	Storage vs compute trade-off
Serverless	OpEx model, variable cost
Concurrency	Cluster sizing
Chaos Engineering	Failure cost vs. prevention
FinOps	Cost governance

Last Updated: 2025 | Terms evolve with technology