Skip to content

Z-Ordering Clustering

Multi-Dimensional Data Skipping


Overview

Z-Ordering clusters data by multiple columns to maximize data skipping for multi-dimensional queries. It’s especially effective for point queries and range queries across multiple columns.


Z-Ordering Architecture

How Z-Ordering Works

Z-Order Benefits:

  • Multi-dimensional: Cluster by multiple columns simultaneously
  • Data skipping: 50-80% I/O reduction
  • Point queries: Fast exact match lookups
  • Range queries: Efficient range scans

Z-Order vs. Sorting

Comparison

Trade-offs:

AspectSingle-Column SortZ-Order
Primary columnOptimalGood
Secondary columnsPoorGood
ComplexityLowMedium
MaintenanceSimpleRequires re-Z-Order
Use caseSingle column queriesMulti-column queries

Z-Order Implementation

Delta Lake Z-Ordering

# Z-Ordering with Delta Lake
from delta import *
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ZOrdering") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()
def zorder_delta_table(
table_path: str,
zorder_columns: list
):
"""Z-Order Delta Lake table"""
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, table_path)
# Z-Order by multiple columns
delta_table.optimize() \
.executeZOrderBy(zorder_columns)
print(f"Z-Ordering complete for {table_path}")
print(f"Z-Order columns: {zorder_columns}")
# Example usage
zorder_delta_table(
table_path="s3://bucket/delta/sales/",
zorder_columns=["order_date", "customer_id", "product_id"]
)
# Query benefits
# Fast queries filtering on any Z-Order column combination
# - WHERE order_date = '2025-01-27'
# - WHERE customer_id = 12345
# - WHERE order_date = '2025-01-27' AND customer_id = 12345
# - WHERE order_date >= '2025-01-01' AND customer_id IN (...)

Iceberg Z-Ordering

# Z-Ordering with Apache Iceberg
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("IcebergZOrdering") \
.config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.my_catalog.type", "hadoop") \
.config("spark.sql.catalog.my_catalog.warehouse", "s3://bucket/warehouse") \
.getOrCreate()
def zorder_iceberg_table(
table: str,
zorder_columns: list
):
"""Z-Order Iceberg table"""
# Rewrite data files with Z-Order
spark.read \
.format("iceberg") \
.load(table) \
.sortWithinPartitions(*zorder_columns) \
.write \
.format("iceberg") \
.mode("overwrite") \
.save(table)
print(f"Z-Ordering complete for {table}")
print(f"Z-Order columns: {zorder_columns}")
# Example usage
zorder_iceberg_table(
table="my_catalog.analytics.sales",
zorder_columns=["order_date", "customer_id"]
)

Z-Order Column Selection

Column Selection Strategy

Selection Guidelines

Query PatternZ-Order ColumnsExample
Time-seriestimestamp, idevent_time, user_id
Point queriesid1, id2, id3customer_id, product_id, region
Range queriesdate, categoryorder_date, product_category
Multi-filtertop 3 filter colscol_a, col_b, col_c

Column Priority

# Z-Order column priority
# 1. Most frequently filtered columns
# - WHERE clauses
# - JOIN keys
# - GROUP BY columns
# 2. High-cardinality columns
# - More selective = better skipping
# 3. Query column combinations
# - Which columns are queried together?
# Example query analysis
"""
Frequent queries:
1. SELECT * FROM sales WHERE order_date = ? AND customer_id = ?
2. SELECT * FROM sales WHERE product_id = ? AND region = ?
3. SELECT * FROM sales WHERE order_date BETWEEN ? AND ?
Z-Order columns: order_date, customer_id, product_id, region
"""

Z-Order Maintenance

Continuous Z-Ordering

# Continuous Z-Ordering strategy
from pyspark.sql import SparkSession
from datetime import datetime, timedelta
spark = SparkSession.builder \
.appName("ContinuousZOrdering") \
.getOrCreate()
def schedule_zordering(
table_path: str,
zorder_columns: list,
threshold_pct: float = 0.3 # 30% fragmentation
):
"""Schedule Z-Ordering when fragmentation exceeds threshold"""
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, table_path)
# Check fragmentation
history = delta_table.history()
last_optimize = history.filter("operation = 'OPTIMIZE')")
if last_optimize.count() == 0:
# Never optimized, run now
should_optimize = True
else:
# Check if new data exceeds threshold
last_optimize_time = last_optimize.select("timestamp").first()[0]
new_data_count = delta_table.toDF().filter(
col("__commit_timestamp") > last_optimize_time
).count()
total_count = delta_table.toDF().count()
new_data_pct = new_data_count / total_count
should_optimize = new_data_pct >= threshold_pct
if should_optimize:
print(f"Running Z-Order for {table_path}")
delta_table.optimize().executeZOrderBy(zorder_columns)
else:
print(f"Z-Order not needed for {table_path}")
# Example usage
schedule_zordering(
table_path="s3://bucket/delta/sales/",
zorder_columns=["order_date", "customer_id"],
threshold_pct=0.3
)

Z-Order Performance

Performance Analysis

# Z-Order performance benchmark
from pyspark.sql import SparkSession
import time
spark = SparkSession.builder \
.appName("ZOrderBenchmark") \
.getOrCreate()
def benchmark_zorder(
table_path: str,
query: str,
zorder_columns: list
) -> dict:
"""Benchmark query performance with Z-Ordering"""
# Query without Z-Order
start = time.time()
result_before = spark.sql(query)
count_before = result_before.count()
time_before = time.time() - start
# Apply Z-Ordering
from delta.tables import DeltaTable
delta_table = DeltaTable.forPath(spark, table_path)
delta_table.optimize().executeZOrderBy(zorder_columns)
# Query with Z-Order
start = time.time()
result_after = spark.sql(query)
count_after = result_after.count()
time_after = time.time() - start
return {
'time_before': time_before,
'time_after': time_after,
'improvement_pct': (time_before - time_after) / time_before * 100,
'zorder_columns': zorder_columns
}
# Example results (typical)
# time_before: 120 seconds
# time_after: 30 seconds
# improvement_pct: 75%

Z-Order Best Practices

DO

# 1. Z-Order by frequently filtered columns
# Maximizes skipping
# 2. Limit to 3-5 columns
# Diminishing returns after 5
# 3. Re-Z-Order after significant writes
# Maintain clustering
# 4. Use with partitioning
# Partition + Z-Order for best results
# 5. Monitor effectiveness
# Track query improvement

DON’T

# 1. Don't Z-Order all columns
# Too many = diminishing returns
# 2. Don't Z-Order high-cardinality random columns
# No clustering benefit
# 3. Don't forget to re-Z-Order
# Data becomes fragmented
# 4. Don't Z-Order rarely filtered columns
# No query benefit
# 5. Don't ignore cost
# Z-Ordering is expensive

Key Takeaways

  1. Multi-dimensional: Cluster by multiple columns simultaneously
  2. Data skipping: 50-80% I/O reduction
  3. Delta Lake: OPTIMIZE ZORDER BY command
  4. Iceberg: sortWithinPartitions for Z-Order
  5. Column selection: 3-5 frequently filtered columns
  6. Maintenance: Re-Z-Order after significant writes
  7. Combination: Use with partitioning for best results
  8. Use When: Multi-column queries, point lookups, range queries

Back to Module 7