Principal Data Engineering Glossary
Essential Terminology for Architecture Discussions
Using This Glossary
- Definitions are concise, Principal-level explanations
- Context indicates when/why the term matters
- Related Terms link to associated concepts
- Cost Implications highlight financial considerations
A
ACID Transactions
Definition: Atomicity, Consistency, Isolation, Durability - guarantees for reliable database transactions.
Context: Critical for financial systems, data warehousing, and any scenario requiring exact-once semantics. Open Table Formats (Delta, Iceberg, Hudi) bring ACID to data lakes.
Trade-offs: Strong consistency comes with performance overhead. At PB scale, isolation levels become expensive.
Cost Impact: Higher compute overhead, potential for hotspotting in distributed systems.
Related: BASE, Open Table Formats
Apache Arrow
Definition: In-memory columnar format enabling zero-copy data sharing between systems.
Context: Eliminates serialization overhead between Pandas, Polars, DuckDB, Spark. Foundation for modern data interoperability.
Cost Impact: Reduced memory pressure, faster query startup times.
Related: Parquet, Vectorization
Autoloader / Auto-ingestion
Definition: Incremental data ingestion that automatically detects new files without manual schema management.
Context: Essential for streaming ingestion from cloud storage (S3, GCS, ADLS). Reduces operational overhead.
Senior Gotcha: File listing operations at scale can become expensive. Use directory listing optimizations or queue-based notification (SQS, Pub/Sub).
Cost Impact: Reduced API calls vs. full scans, but requires queue infrastructure.
B
BASE Transactions
Definition: Basically Available, Soft state, Eventually consistent - alternative to ACID for distributed systems.
Context: Acceptable for analytics, clickstream, and scenarios where eventual consistency is tolerable.
Trade-offs: Higher availability and partition tolerance at the cost of immediate consistency.
Cost Impact: Lower coordination overhead, better horizontal scalability.
Related: ACID, CAP Theorem
Bronze/Silver/Gold Tables
Definition: Multi-stage data quality pattern where data progresses through quality tiers.
| Tier | Purpose | Quality | Use Case |
|---|---|---|---|
| Bronze | Raw ingestion | As-landed | Audit, replay capability |
| Silver | Cleaned/typed | Validated | Analytics, ML features |
| Gold | Aggregated/business | Certified | BI, dashboards, products |
Context: Medallion architecture popularized by Databricks. Enables data quality gradual improvement.
Cost Impact: Multiple storage copies (2-3x storage cost). Optimize with views where appropriate.
Data Bricks Migration
Definition: Multi-cloud platform for data engineering, analytics, and ML.
Context: Unified platform for Spark, Delta Lake, MLflow, and SQL warehouses.
Cost Impact: Premium pricing over open-source equivalents. Justified for unified experience and reduced operational overhead.
C
CAP Theorem
Definition: In a distributed system, you can only guarantee 2 of 3: Consistency, Availability, Partition tolerance.
Context: Every distributed data system makes CAP trade-offs. Understanding them is critical for architecture decisions.
Principal Insight: In cloud systems, P is guaranteed (networks fail). The real choice is C vs A.
Cloud-Native Architecture
Definition: System design leveraging cloud-specific capabilities (object storage, managed services, auto-scaling).
Context: NOT just “running in the cloud.” True cloud-native uses storage disaggregation, serverless compute, and managed services.
Cost Impact: OpEx over CapEx, pay-for-use models, reduced operational overhead.
Columnar Storage
Definition: Data stored by column rather than row, enabling efficient compression and predicate pushdown.
Context: Foundation of modern analytics (Parquet, ORC, Kudu). Essential for query performance.
Cost Impact: Better compression = lower storage costs. Faster queries = lower compute costs.
Related: Parquet, Predicate Pushdown
Compaction
Definition: Merging small files into larger ones to address the small files problem.
Context: Continuous background process in streaming systems. Critical for query performance and metadata management.
Senior Gotcha: Aggressive compaction increases write amplification. Too passive = thousands of small files = slow queries and NameNode overload.
Cost Impact: Compute trade-off for query performance. Spot instances ideal for compaction workloads.
Related: Small Files Problem, Z-Ordering
Compute-Storage Separation
Definition: Architecture where compute and storage scale independently.
Context: Fundamental to cloud data platforms. Enables storage in S3/GCS with compute from Spark/Trino/Presto.
Cost Impact: Optimal for variable workloads. Can provision compute for query spikes without over-provisioning storage.
Related: Lakehouse
Containerization
Definition: Packaging applications with dependencies into containers (Docker) for consistent execution.
Context: Enables reproducible data pipelines, easier local development, and Kubernetes orchestration.
Cost Impact: Better resource utilization, easier rightsizing.
D
Data Contract
Definition: Formal agreement between data producer and consumer defining schema, quality SLAs, and update frequency.
Context: Evolution of data governance. Shifts quality enforcement from downstream to upstream.
Components:
- Schema definition
- Data quality rules
- Update SLA (latency, frequency)
- Notification on failure
Cost Impact: Reduced downstream reprocessing costs, earlier failure detection.
Related: Data Quality
Data Fabric
Definition: Unified architecture connecting data across on-prem and multi-cloud environments.
Context: Enterprise data integration pattern for hybrid scenarios. More centralized than Data Mesh.
Cost Impact: Reduced data movement, better utilization of existing infrastructure.
Data Lake
Definition: Centralized repository storing raw data in native format (object storage).
Context: Evolution from data warehouse. Enables schema-on-read, stores all data types.
Problems: Lack of transactions, schema enforcement, quality controls → led to Data Swamp.
Evolution: Lakehouse architecture adds warehouse capabilities to lakes.
Data Lakehouse
Definition: Architecture combining data lake flexibility with warehouse ACID transactions and schema enforcement.
Context: Modern data architecture standard. Enabled by Open Table Formats.
Benefits: Single copy of data, warehouse reliability at lake cost.
Cost Impact: Eliminates duplicate storage (lake + warehouse). Primary cost saving of modern architecture.
Related: Delta Lake, Iceberg, Hudi
Data Mesh
Definition: Decentralized data architecture treating data as a product with domain-oriented ownership.
Context: Alternative to centralized data platform. Empowers domains to own their data products.
Principles:
- Domain ownership
- Data as a product
- Self-serve infrastructure platform
- Federated governance
Trade-offs: Better agility/scale vs. governance complexity. Not ideal for small organizations.
Cost Impact: Can reduce central team costs but increases cross-domain coordination overhead.
Data Vault 2.0
Definition: Data modeling methodology with Hub, Link, and Satellite structures for auditability and flexibility.
Context: Alternative to Kimball for enterprise warehousing with strong audit trails.
Cost Impact: More storage overhead than star schema, but better adaptability to change.
Delta Lake
Definition: Open table format bringing ACID transactions, schema enforcement, and time travel to data lakes.
Context: Most popular OTF (Databricks native, open source). Optimized for Databricks but works anywhere.
Key Features:
- ACID transactions (optimized for concurrency)
- Schema enforcement and evolution
- Time travel (data versioning)
- Deletion vectors (for merge/update performance)
Cost Impact: Small file management reduces compute costs. Deletion vectors drastically reduce merge overhead.
Related: Apache Iceberg, Apache Hudi
E
Event Sourcing
Definition: Storing state as a sequence of events rather than current state.
Context: Essential for event-driven architectures, streaming systems, and audit trails.
Benefits: Complete audit history, replay capability, temporal queries.
Cost Impact: Higher storage (event log + projected state). Enables efficient compression with columnar formats.
Exactly-Once Semantics
Definition: Guarantee that each event is processed exactly once (no duplicates, no losses).
Context: Holy grail of streaming. Achieved with idempotent sinks + checkpointing.
Trade-offs: Requires coordination, checkpointing overhead.
Cost Impact: Checkpoint storage, coordination overhead. Worth it for financial/transactional systems.
F
Feature Store
Definition: Centralized repository for ML features enabling feature reuse across training and serving.
Context: Critical for MLOps at scale. Prevents training-serving skew.
Types:
- Offline store (historical features for training)
- Online store (low-latency serving)
Cost Impact: Duplicate storage for offline/online. Reduces feature engineering overhead.
Related: Feast
H
Handler-Based Architecture
Definition: Streaming pattern where handlers process state changes per entity.
Context: Common in real-time enrichment, personalization.
Cost Impact: State management overhead at scale. Consider stateful vs stateless processing.
Horizontal Partitioning (Sharding)
Definition: Splitting data across multiple nodes by partition key.
Context: Enables linear scalability. Critical for high-write throughput.
Challenges: Choosing partition key, hotspots, rebalancing.
Cost Impact: Enables cheaper horizontal scale vs. vertical scale.
I
Idempotency
Definition: Property where an operation can be applied multiple times with same result as single application.
Context: Essential for fault tolerance, exactly-once semantics, and retry logic.
Examples: Upserts, conditional writes, deduplication.
Cost Impact: Enables safe use of cheaper (unreliable) infrastructure like spot instances.
Iceberg Specification
Definition: Open table format for large analytic datasets with metadata evolution.
Context: Strong community adoption (Snowflake, Databricks, BigQuery all support). Focus on metadata scalability.
Key Differentiator: No lock-based metadata reads. Scales to billions of partitions.
Cost Impact: Reduced metadata query costs. Better for high-partition-count tables.
Related: Delta Lake, Apache Hudi
L
Late Data
Definition: Data arriving after the expected window (common in mobile/web events).
Context: Streaming challenge requiring handling strategies: drop, buffer, or backfill.
Solutions:
- Watermarks with allowed lateness
- Buffering windows
- Re-processing with correction
Cost Impact: Buffering increases storage and compute. Re-processing can be expensive.
LSM Tree (Log-Structured Merge Tree)
Definition: Data structure optimized for write-heavy workloads with background compaction.
Context: Foundation of Kafka, HBase, Cassandra, and many modern databases.
Trade-offs: Excellent write performance, read ampllication from multiple levels.
Cost Impact: Compaction overhead required. Ideal for append-heavy workloads.
M
Materialized View
Definition: Pre-computed query result stored as a table, refreshed on schedule or trigger.
Context: Essential for query performance optimization. Enables complex queries to execute instantly.
Trade-offs: Storage cost vs. compute cost. Refresh overhead vs. query speed.
Cost Impact: High-value optimization for frequently accessed aggregations. Reduces expensive compute.
Metadata Catalog
Definition: Central repository of data assets, schemas, lineage, and ownership.
Context: Essential for data discovery, governance, and democratization.
Examples: Glue Data Catalog, Hive Metastore, Unity Catalog, Atlas.
Cost Impact: Small cost relative to value. Enables chargeback and usage-based allocation.
O
Open Table Formats (OTF)
Definition: Table abstraction layers bringing database-like capabilities to object storage.
The Big Three:
| Format | Best For | Owner |
|---|---|---|
| Delta Lake | Databricks, high concurrency | Delta |
| Iceberg | Multi-cloud, high partition count | Apache |
| Hudi | Streaming upserts, CDC | Apache |
Context: Enabler of Lakehouse architecture. Replaces HDFS for analytics.
Cost Impact: Enables single-copy storage (eliminates warehouse duplication).
P
Parquet
Definition: Columnar storage format with compression and encoding optimizations.
Context: De facto standard for analytics. Efficient compression (10-100x vs CSV).
Advantages:
- Columnar projection (read only needed columns)
- Predicate pushdown (skip row groups)
- Compression (Snappy, Zstd, Gzip)
- Schema evolution (add columns)
Cost Impact: 80-90% storage cost reduction vs. CSV/JSON. Critical for cloud storage economics.
Partitioning
Definition: Splitting data into subdirectories based on column values.
Common Strategies:
- Date-based:
dt=2025-01-27/(most common) - Hash-based: Consistent distribution
- Range-based: Sequential data
- Bucket: Fixed number of partitions
Context: Query performance optimization. Enables partition pruning (skip irrelevant data).
Senior Gotcha: Over-partitioning creates small files. Under-partitioning causes full scans.
Cost Impact: Proper partitioning reduces query costs by 10-100x. Wrong strategy increases costs.
Predicate Pushdown
Definition: Pushing filter operations to storage layer to minimize data read.
Context: Parquet/Iceberg/Delta support. Minimizes I/O for analytical queries.
Example: SELECT * FROM table WHERE date > '2025-01-01' reads only matching files.
Cost Impact: Can reduce query costs by 90%+ for selective queries.
S
Schema Evolution
Definition: Ability to modify table schema without breaking existing queries.
Types:
- Add column: Safe (backward compatible)
- Drop column: Risky (breaking change)
- Rename: Complex (requires metadata)
- Type change: Risky (may break)
Context: Critical for long-lived data products. OTFs support safe evolution.
Cost Impact: Enables gradual migration vs. expensive full rewrites.
Schema-on-Read vs Schema-on-Write
| Approach | Definition | Best For |
|---|---|---|
| Schema-on-Read | Schema applied when querying | Data lakes, flexibility |
| Schema-on-Write | Schema enforced at write | Warehouses, quality |
Context: Trade-off between flexibility (lake) and quality (warehouse).
Serverless Data Processing
Definition: Compute that automatically scales to zero when not in use.
Context: AWS Lambda, Glue Serverless, Databricks Serverless. Ideal for intermittent workloads.
Cost Impact: Pay-per-query vs. provisioned clusters. Optimal for variable/low-concurrency workloads.
Small Files Problem
Definition: Performance degradation from having too many small files.
Impacts:
- Metadata overhead (NameNode, Hive Metastore)
- Query planning overhead
- Inefficient I/O (many small reads)
Solutions:
- Compaction (merge small files)
- Proper partitioning
- Optimal file size (128MB-1GB Parquet)
Cost Impact: Can increase query costs 10-100x. Critical for cost optimization.
Related: Compaction, Module 7
Snowflake
Definition: Cloud-native data warehouse separating compute and storage.
Context: Popular for simplicity and performance. Premium pricing.
Architecture:
- Multi-cluster warehouses (compute)
- Micro-partitions (storage)
- Automatic clustering (optimization)
Cost Impact: Premium vs. open-source stack. Value proposition in reduced ops.
Structured Streaming
Definition: Streaming API treating streams as unbounded tables.
Context: Spark Structured Streaming, Flink SQL. Unifies batch and streaming.
Benefits: Single codebase for batch and streaming. Easier mental model.
Streaming vs Batch
| Dimension | Batch | Streaming |
|---|---|---|
| Latency | Hours/Minutes | Seconds/Milliseconds |
| Complexity | Lower | Higher |
| Cost | Generally cheaper | Generally more expensive |
| Use Case | Historical analysis, ETL | Real-time analytics, CDC |
Context: Choice depends on business requirements, not technology preference.
Cost Impact: Streaming has constant compute overhead. Batch can use spot instances aggressively.
T
Time Travel
Definition: Ability to query previous versions of data.
Context: Feature of Delta Lake, Iceberg, Hudi. Critical for audit, debugging, ML reproducibility.
Retention: Typically 30 days (configurable). Long retention increases storage.
Cost Impact: Storage cost for retained versions. Often worth it for debugging/audit savings.
Transformations
Definition: Data processing logic converting raw to curated data.
Patterns:
- ELT: Extract, Load, Transform (warehouse-first)
- ETL: Extract, Transform, Load (transform-first)
- ELT: Modern approach leveraging warehouse compute
Context: ELT dominates in modern lakehouse (dbt, Spark SQL).
V
Vectorization
Definition: Processing data in batches (CPU cache lines) rather than row-by-row.
Context: CPU-level optimization in Parquet, Arrow, DuckDB, Spark.
Benefits: 10-100x performance improvement for analytical queries.
Cost Impact: Better hardware utilization = lower compute cost per query.
View vs Materialized View
| Type | Storage | Refresh | Use Case |
|---|---|---|---|
| View | None (logical) | Real-time | Simplification, security |
| Materialized View | Yes (physical) | On schedule | Performance optimization |
Context: Views are free. Materialized views have storage cost.
W
Watermark
Definition: Timestamp threshold in streaming defining when to stop waiting for late data.
Context: Balances completeness vs. latency. Required for windowed aggregations.
Trade-offs:
- Tight watermark: Faster results, more late data
- Loose watermark: More complete, slower results
Cost Impact: Buffering for late data increases state size and compute.
Windowing Strategies
Types:
- Tumbling: Fixed-size, non-overlapping
- Sliding: Fixed-size, overlapping
- Session: Activity-based, variable size
Context: Streaming aggregation patterns. Choice affects latency and resource usage.
Cost Impact: Sliding windows = 2-10x state vs. tumbling.
Z
Z-Ordering
Definition: Multi-dimensional clustering technique optimizing for common query patterns.
Context: Alternative to Hive-style partitioning. Enables efficient multi-column queries.
How it Works:
- Interleaves bits from multiple columns
- Sorts data by Z-order value
- Enables data skipping on multiple columns
Example: Z-order on (country, date) enables efficient queries filtering on either or both.
Cost Impact: Write overhead for query performance. 10-100x query speedup for filtered queries.
Cost Consideration: 1.5-2x write cost for 5-10x query improvement. Worth it for read-heavy workloads.
Related: Partitioning, Module 7
Acronym Quick Reference
| Acronym | Full Term | Context |
|---|---|---|
| ACID | Atomicity, Consistency, Isolation, Durability | Transactional guarantees |
| BASE | Basically Available, Soft state, Eventually consistent | Distributed system trade-offs |
| CAP | Consistency, Availability, Partition tolerance | Distributed system theorem |
| CDC | Change Data Capture | Real-time data replication |
| CI/CD | Continuous Integration/Deployment | Automation for data pipelines |
| COW | Copy-on-Write | Storage strategy (immutable updates) |
| MOR | Merge-on-Read | Storage strategy (deferred merge) |
| OTF | Open Table Format | Lakehouse enabler |
| RAG | Retrieval Augmented Generation | LLM architecture |
| SLA/SLO/SLI | Service Level Agreement/Objective/Indicator | Reliability metrics |
| WAP | Write-Audit-Publish | Data quality pattern |
Cost Optimization Keywords
Terms that trigger cost considerations:
| Term | Cost Dimension |
|---|---|
| Hot/Warm/Cold | Storage tiering |
| Spot/Preemptible | Compute discount (60-90%) |
| Compaction | Write amplification |
| Partitioning | Query cost multiplier |
| Z-Ordering | Write overhead, query savings |
| Caching | Storage vs compute trade-off |
| Serverless | OpEx model, variable cost |
| Concurrency | Cluster sizing |
| Chaos Engineering | Failure cost vs. prevention |
| FinOps | Cost governance |
Last Updated: 2025 | Terms evolve with technology