Skip to content

Principal Data Engineering Glossary

Essential Terminology for Architecture Discussions


Using This Glossary

  • Definitions are concise, Principal-level explanations
  • Context indicates when/why the term matters
  • Related Terms link to associated concepts
  • Cost Implications highlight financial considerations

A

ACID Transactions

Definition: Atomicity, Consistency, Isolation, Durability - guarantees for reliable database transactions.

Context: Critical for financial systems, data warehousing, and any scenario requiring exact-once semantics. Open Table Formats (Delta, Iceberg, Hudi) bring ACID to data lakes.

Trade-offs: Strong consistency comes with performance overhead. At PB scale, isolation levels become expensive.

Cost Impact: Higher compute overhead, potential for hotspotting in distributed systems.

Related: BASE, Open Table Formats

Apache Arrow

Definition: In-memory columnar format enabling zero-copy data sharing between systems.

Context: Eliminates serialization overhead between Pandas, Polars, DuckDB, Spark. Foundation for modern data interoperability.

Cost Impact: Reduced memory pressure, faster query startup times.

Related: Parquet, Vectorization

Autoloader / Auto-ingestion

Definition: Incremental data ingestion that automatically detects new files without manual schema management.

Context: Essential for streaming ingestion from cloud storage (S3, GCS, ADLS). Reduces operational overhead.

Senior Gotcha: File listing operations at scale can become expensive. Use directory listing optimizations or queue-based notification (SQS, Pub/Sub).

Cost Impact: Reduced API calls vs. full scans, but requires queue infrastructure.


B

BASE Transactions

Definition: Basically Available, Soft state, Eventually consistent - alternative to ACID for distributed systems.

Context: Acceptable for analytics, clickstream, and scenarios where eventual consistency is tolerable.

Trade-offs: Higher availability and partition tolerance at the cost of immediate consistency.

Cost Impact: Lower coordination overhead, better horizontal scalability.

Related: ACID, CAP Theorem

Bronze/Silver/Gold Tables

Definition: Multi-stage data quality pattern where data progresses through quality tiers.

TierPurposeQualityUse Case
BronzeRaw ingestionAs-landedAudit, replay capability
SilverCleaned/typedValidatedAnalytics, ML features
GoldAggregated/businessCertifiedBI, dashboards, products

Context: Medallion architecture popularized by Databricks. Enables data quality gradual improvement.

Cost Impact: Multiple storage copies (2-3x storage cost). Optimize with views where appropriate.

Data Bricks Migration

Definition: Multi-cloud platform for data engineering, analytics, and ML.

Context: Unified platform for Spark, Delta Lake, MLflow, and SQL warehouses.

Cost Impact: Premium pricing over open-source equivalents. Justified for unified experience and reduced operational overhead.


C

CAP Theorem

Definition: In a distributed system, you can only guarantee 2 of 3: Consistency, Availability, Partition tolerance.

Context: Every distributed data system makes CAP trade-offs. Understanding them is critical for architecture decisions.

Principal Insight: In cloud systems, P is guaranteed (networks fail). The real choice is C vs A.

Related: ACID, BASE

Cloud-Native Architecture

Definition: System design leveraging cloud-specific capabilities (object storage, managed services, auto-scaling).

Context: NOT just “running in the cloud.” True cloud-native uses storage disaggregation, serverless compute, and managed services.

Cost Impact: OpEx over CapEx, pay-for-use models, reduced operational overhead.

Columnar Storage

Definition: Data stored by column rather than row, enabling efficient compression and predicate pushdown.

Context: Foundation of modern analytics (Parquet, ORC, Kudu). Essential for query performance.

Cost Impact: Better compression = lower storage costs. Faster queries = lower compute costs.

Related: Parquet, Predicate Pushdown

Compaction

Definition: Merging small files into larger ones to address the small files problem.

Context: Continuous background process in streaming systems. Critical for query performance and metadata management.

Senior Gotcha: Aggressive compaction increases write amplification. Too passive = thousands of small files = slow queries and NameNode overload.

Cost Impact: Compute trade-off for query performance. Spot instances ideal for compaction workloads.

Related: Small Files Problem, Z-Ordering

Compute-Storage Separation

Definition: Architecture where compute and storage scale independently.

Context: Fundamental to cloud data platforms. Enables storage in S3/GCS with compute from Spark/Trino/Presto.

Cost Impact: Optimal for variable workloads. Can provision compute for query spikes without over-provisioning storage.

Related: Lakehouse

Containerization

Definition: Packaging applications with dependencies into containers (Docker) for consistent execution.

Context: Enables reproducible data pipelines, easier local development, and Kubernetes orchestration.

Cost Impact: Better resource utilization, easier rightsizing.


D

Data Contract

Definition: Formal agreement between data producer and consumer defining schema, quality SLAs, and update frequency.

Context: Evolution of data governance. Shifts quality enforcement from downstream to upstream.

Components:

  • Schema definition
  • Data quality rules
  • Update SLA (latency, frequency)
  • Notification on failure

Cost Impact: Reduced downstream reprocessing costs, earlier failure detection.

Related: Data Quality

Data Fabric

Definition: Unified architecture connecting data across on-prem and multi-cloud environments.

Context: Enterprise data integration pattern for hybrid scenarios. More centralized than Data Mesh.

Cost Impact: Reduced data movement, better utilization of existing infrastructure.

Data Lake

Definition: Centralized repository storing raw data in native format (object storage).

Context: Evolution from data warehouse. Enables schema-on-read, stores all data types.

Problems: Lack of transactions, schema enforcement, quality controls → led to Data Swamp.

Evolution: Lakehouse architecture adds warehouse capabilities to lakes.

Data Lakehouse

Definition: Architecture combining data lake flexibility with warehouse ACID transactions and schema enforcement.

Context: Modern data architecture standard. Enabled by Open Table Formats.

Benefits: Single copy of data, warehouse reliability at lake cost.

Cost Impact: Eliminates duplicate storage (lake + warehouse). Primary cost saving of modern architecture.

Related: Delta Lake, Iceberg, Hudi

Data Mesh

Definition: Decentralized data architecture treating data as a product with domain-oriented ownership.

Context: Alternative to centralized data platform. Empowers domains to own their data products.

Principles:

  1. Domain ownership
  2. Data as a product
  3. Self-serve infrastructure platform
  4. Federated governance

Trade-offs: Better agility/scale vs. governance complexity. Not ideal for small organizations.

Cost Impact: Can reduce central team costs but increases cross-domain coordination overhead.

Data Vault 2.0

Definition: Data modeling methodology with Hub, Link, and Satellite structures for auditability and flexibility.

Context: Alternative to Kimball for enterprise warehousing with strong audit trails.

Cost Impact: More storage overhead than star schema, but better adaptability to change.

Delta Lake

Definition: Open table format bringing ACID transactions, schema enforcement, and time travel to data lakes.

Context: Most popular OTF (Databricks native, open source). Optimized for Databricks but works anywhere.

Key Features:

  • ACID transactions (optimized for concurrency)
  • Schema enforcement and evolution
  • Time travel (data versioning)
  • Deletion vectors (for merge/update performance)

Cost Impact: Small file management reduces compute costs. Deletion vectors drastically reduce merge overhead.

Related: Apache Iceberg, Apache Hudi


E

Event Sourcing

Definition: Storing state as a sequence of events rather than current state.

Context: Essential for event-driven architectures, streaming systems, and audit trails.

Benefits: Complete audit history, replay capability, temporal queries.

Cost Impact: Higher storage (event log + projected state). Enables efficient compression with columnar formats.

Exactly-Once Semantics

Definition: Guarantee that each event is processed exactly once (no duplicates, no losses).

Context: Holy grail of streaming. Achieved with idempotent sinks + checkpointing.

Trade-offs: Requires coordination, checkpointing overhead.

Cost Impact: Checkpoint storage, coordination overhead. Worth it for financial/transactional systems.


F

Feature Store

Definition: Centralized repository for ML features enabling feature reuse across training and serving.

Context: Critical for MLOps at scale. Prevents training-serving skew.

Types:

  • Offline store (historical features for training)
  • Online store (low-latency serving)

Cost Impact: Duplicate storage for offline/online. Reduces feature engineering overhead.

Related: Feast


H

Handler-Based Architecture

Definition: Streaming pattern where handlers process state changes per entity.

Context: Common in real-time enrichment, personalization.

Cost Impact: State management overhead at scale. Consider stateful vs stateless processing.

Horizontal Partitioning (Sharding)

Definition: Splitting data across multiple nodes by partition key.

Context: Enables linear scalability. Critical for high-write throughput.

Challenges: Choosing partition key, hotspots, rebalancing.

Cost Impact: Enables cheaper horizontal scale vs. vertical scale.


I

Idempotency

Definition: Property where an operation can be applied multiple times with same result as single application.

Context: Essential for fault tolerance, exactly-once semantics, and retry logic.

Examples: Upserts, conditional writes, deduplication.

Cost Impact: Enables safe use of cheaper (unreliable) infrastructure like spot instances.

Iceberg Specification

Definition: Open table format for large analytic datasets with metadata evolution.

Context: Strong community adoption (Snowflake, Databricks, BigQuery all support). Focus on metadata scalability.

Key Differentiator: No lock-based metadata reads. Scales to billions of partitions.

Cost Impact: Reduced metadata query costs. Better for high-partition-count tables.

Related: Delta Lake, Apache Hudi


L

Late Data

Definition: Data arriving after the expected window (common in mobile/web events).

Context: Streaming challenge requiring handling strategies: drop, buffer, or backfill.

Solutions:

  • Watermarks with allowed lateness
  • Buffering windows
  • Re-processing with correction

Cost Impact: Buffering increases storage and compute. Re-processing can be expensive.

LSM Tree (Log-Structured Merge Tree)

Definition: Data structure optimized for write-heavy workloads with background compaction.

Context: Foundation of Kafka, HBase, Cassandra, and many modern databases.

Trade-offs: Excellent write performance, read ampllication from multiple levels.

Cost Impact: Compaction overhead required. Ideal for append-heavy workloads.


M

Materialized View

Definition: Pre-computed query result stored as a table, refreshed on schedule or trigger.

Context: Essential for query performance optimization. Enables complex queries to execute instantly.

Trade-offs: Storage cost vs. compute cost. Refresh overhead vs. query speed.

Cost Impact: High-value optimization for frequently accessed aggregations. Reduces expensive compute.

Metadata Catalog

Definition: Central repository of data assets, schemas, lineage, and ownership.

Context: Essential for data discovery, governance, and democratization.

Examples: Glue Data Catalog, Hive Metastore, Unity Catalog, Atlas.

Cost Impact: Small cost relative to value. Enables chargeback and usage-based allocation.


O

Open Table Formats (OTF)

Definition: Table abstraction layers bringing database-like capabilities to object storage.

The Big Three:

FormatBest ForOwner
Delta LakeDatabricks, high concurrencyDelta
IcebergMulti-cloud, high partition countApache
HudiStreaming upserts, CDCApache

Context: Enabler of Lakehouse architecture. Replaces HDFS for analytics.

Cost Impact: Enables single-copy storage (eliminates warehouse duplication).

Related: Lakehouse, Module 1


P

Parquet

Definition: Columnar storage format with compression and encoding optimizations.

Context: De facto standard for analytics. Efficient compression (10-100x vs CSV).

Advantages:

  • Columnar projection (read only needed columns)
  • Predicate pushdown (skip row groups)
  • Compression (Snappy, Zstd, Gzip)
  • Schema evolution (add columns)

Cost Impact: 80-90% storage cost reduction vs. CSV/JSON. Critical for cloud storage economics.

Related: ORC, Avro

Partitioning

Definition: Splitting data into subdirectories based on column values.

Common Strategies:

  • Date-based: dt=2025-01-27/ (most common)
  • Hash-based: Consistent distribution
  • Range-based: Sequential data
  • Bucket: Fixed number of partitions

Context: Query performance optimization. Enables partition pruning (skip irrelevant data).

Senior Gotcha: Over-partitioning creates small files. Under-partitioning causes full scans.

Cost Impact: Proper partitioning reduces query costs by 10-100x. Wrong strategy increases costs.

Predicate Pushdown

Definition: Pushing filter operations to storage layer to minimize data read.

Context: Parquet/Iceberg/Delta support. Minimizes I/O for analytical queries.

Example: SELECT * FROM table WHERE date > '2025-01-01' reads only matching files.

Cost Impact: Can reduce query costs by 90%+ for selective queries.


S

Schema Evolution

Definition: Ability to modify table schema without breaking existing queries.

Types:

  • Add column: Safe (backward compatible)
  • Drop column: Risky (breaking change)
  • Rename: Complex (requires metadata)
  • Type change: Risky (may break)

Context: Critical for long-lived data products. OTFs support safe evolution.

Cost Impact: Enables gradual migration vs. expensive full rewrites.

Schema-on-Read vs Schema-on-Write

ApproachDefinitionBest For
Schema-on-ReadSchema applied when queryingData lakes, flexibility
Schema-on-WriteSchema enforced at writeWarehouses, quality

Context: Trade-off between flexibility (lake) and quality (warehouse).

Serverless Data Processing

Definition: Compute that automatically scales to zero when not in use.

Context: AWS Lambda, Glue Serverless, Databricks Serverless. Ideal for intermittent workloads.

Cost Impact: Pay-per-query vs. provisioned clusters. Optimal for variable/low-concurrency workloads.

Small Files Problem

Definition: Performance degradation from having too many small files.

Impacts:

  • Metadata overhead (NameNode, Hive Metastore)
  • Query planning overhead
  • Inefficient I/O (many small reads)

Solutions:

  • Compaction (merge small files)
  • Proper partitioning
  • Optimal file size (128MB-1GB Parquet)

Cost Impact: Can increase query costs 10-100x. Critical for cost optimization.

Related: Compaction, Module 7

Snowflake

Definition: Cloud-native data warehouse separating compute and storage.

Context: Popular for simplicity and performance. Premium pricing.

Architecture:

  • Multi-cluster warehouses (compute)
  • Micro-partitions (storage)
  • Automatic clustering (optimization)

Cost Impact: Premium vs. open-source stack. Value proposition in reduced ops.

Structured Streaming

Definition: Streaming API treating streams as unbounded tables.

Context: Spark Structured Streaming, Flink SQL. Unifies batch and streaming.

Benefits: Single codebase for batch and streaming. Easier mental model.

Streaming vs Batch

DimensionBatchStreaming
LatencyHours/MinutesSeconds/Milliseconds
ComplexityLowerHigher
CostGenerally cheaperGenerally more expensive
Use CaseHistorical analysis, ETLReal-time analytics, CDC

Context: Choice depends on business requirements, not technology preference.

Cost Impact: Streaming has constant compute overhead. Batch can use spot instances aggressively.


T

Time Travel

Definition: Ability to query previous versions of data.

Context: Feature of Delta Lake, Iceberg, Hudi. Critical for audit, debugging, ML reproducibility.

Retention: Typically 30 days (configurable). Long retention increases storage.

Cost Impact: Storage cost for retained versions. Often worth it for debugging/audit savings.

Transformations

Definition: Data processing logic converting raw to curated data.

Patterns:

  • ELT: Extract, Load, Transform (warehouse-first)
  • ETL: Extract, Transform, Load (transform-first)
  • ELT: Modern approach leveraging warehouse compute

Context: ELT dominates in modern lakehouse (dbt, Spark SQL).


V

Vectorization

Definition: Processing data in batches (CPU cache lines) rather than row-by-row.

Context: CPU-level optimization in Parquet, Arrow, DuckDB, Spark.

Benefits: 10-100x performance improvement for analytical queries.

Cost Impact: Better hardware utilization = lower compute cost per query.

View vs Materialized View

TypeStorageRefreshUse Case
ViewNone (logical)Real-timeSimplification, security
Materialized ViewYes (physical)On schedulePerformance optimization

Context: Views are free. Materialized views have storage cost.


W

Watermark

Definition: Timestamp threshold in streaming defining when to stop waiting for late data.

Context: Balances completeness vs. latency. Required for windowed aggregations.

Trade-offs:

  • Tight watermark: Faster results, more late data
  • Loose watermark: More complete, slower results

Cost Impact: Buffering for late data increases state size and compute.

Windowing Strategies

Types:

  • Tumbling: Fixed-size, non-overlapping
  • Sliding: Fixed-size, overlapping
  • Session: Activity-based, variable size

Context: Streaming aggregation patterns. Choice affects latency and resource usage.

Cost Impact: Sliding windows = 2-10x state vs. tumbling.


Z

Z-Ordering

Definition: Multi-dimensional clustering technique optimizing for common query patterns.

Context: Alternative to Hive-style partitioning. Enables efficient multi-column queries.

How it Works:

  1. Interleaves bits from multiple columns
  2. Sorts data by Z-order value
  3. Enables data skipping on multiple columns

Example: Z-order on (country, date) enables efficient queries filtering on either or both.

Cost Impact: Write overhead for query performance. 10-100x query speedup for filtered queries.

Cost Consideration: 1.5-2x write cost for 5-10x query improvement. Worth it for read-heavy workloads.

Related: Partitioning, Module 7


Acronym Quick Reference

AcronymFull TermContext
ACIDAtomicity, Consistency, Isolation, DurabilityTransactional guarantees
BASEBasically Available, Soft state, Eventually consistentDistributed system trade-offs
CAPConsistency, Availability, Partition toleranceDistributed system theorem
CDCChange Data CaptureReal-time data replication
CI/CDContinuous Integration/DeploymentAutomation for data pipelines
COWCopy-on-WriteStorage strategy (immutable updates)
MORMerge-on-ReadStorage strategy (deferred merge)
OTFOpen Table FormatLakehouse enabler
RAGRetrieval Augmented GenerationLLM architecture
SLA/SLO/SLIService Level Agreement/Objective/IndicatorReliability metrics
WAPWrite-Audit-PublishData quality pattern

Cost Optimization Keywords

Terms that trigger cost considerations:

TermCost Dimension
Hot/Warm/ColdStorage tiering
Spot/PreemptibleCompute discount (60-90%)
CompactionWrite amplification
PartitioningQuery cost multiplier
Z-OrderingWrite overhead, query savings
CachingStorage vs compute trade-off
ServerlessOpEx model, variable cost
ConcurrencyCluster sizing
Chaos EngineeringFailure cost vs. prevention
FinOpsCost governance

Last Updated: 2025 | Terms evolve with technology