Feature Stores
Centralized Feature Management for ML Systems
Overview
Feature stores provide a centralized repository for storing, managing, and serving features for machine learning systems. They enable feature reuse, ensure consistency between training and inference, and provide governance for feature engineering.
Feature Store Architecture
Core Components
Key Components:
- Feature Definitions: Schema, metadata, documentation
- Data Ingestion: Batch, streaming, request-time features
- Storage: Online (low-latency) and offline (historical)
- Feature Registry: Version control, lineage, discovery
- Transformation: Feature transformation and validation
Feature Store Benefits
Why Use a Feature Store?
Key Benefits:
- Consistency: Same features in training and inference
- Reusability: Share features across models
- Version Control: Track feature changes over time
- Point-in-Time Correctness: Avoid data leakage
- Governance: Feature ownership, documentation, lineage
- Discovery: Search and browse features
Feature Store Patterns
Feature Groups
# Feature group pattern
from hsfs.feature_group import FeatureGroup
# Define feature groupdriver_stats_fg = fs.create_feature_group( name="driver_stats", version=1, description="Driver statistics", primary_key=["driver_id"], event_time="event_timestamp", online_enabled=True, schema=[ {"name": "driver_id", "type": "int"}, {"name": "conv_rate", "type": "float"}, {"name": "avg_daily_trips", "type": "int"}, {"name": "event_timestamp", "type": "timestamp"}, ])
# Insert datadriver_stats_fg.insert(data_df)
# Serve featuresfeatures = driver_stats_fg.get_online_features( entity_keys=["1", "2", "3"], feature_names=["conv_rate", "avg_daily_trips"])Training Datasets
# Training dataset pattern
from hsfs.training_dataset import TrainingDataset
# Create training datasettd = fs.create_training_dataset( name="driver_training_dataset", version=1, description="Training dataset for driver prediction", label=["target"], features=[ "driver_stats:conv_rate", "driver_stats:avg_daily_trips", "driver_demographics:age", ])
# Insert training datatd.save(training_df)
# Retrieve training datasettd = fs.get_training_dataset("driver_training_dataset", 1)training_df = td.read()
# Use in trainingX = training_df[["conv_rate", "avg_daily_trips", "age"]]y = training_df["target"]Point-in-Time Joins
# Point-in-time correctness
# Problem: Training-inference skew# Without feature store: Use latest features (incorrect)# With feature store: Use historical features at training time
# Feature store ensures point-in-time correctness# For each training example, use features as of that timestamp
training_df = fs.get_historical_features( features=[ "driver_stats:conv_rate", "driver_stats:avg_daily_trips", ], entity_df=entity_df, # Contains entity_id, timestamp)
# Result: Features as of each timestamp (correct)Feature Store Comparison
Feature Comparison
| Feature | Feast | Hopsworks | Tecton | Vertex AI |
|---|---|---|---|---|
| Open Source | Yes | Yes (limited) | No | No |
| Managed | No | Yes | Yes | Yes |
| Online Store | Redis, DynamoDB | MySQL, Redis | Redis, DynamoDB | Redis |
| Offline Store | Parquet, BigQuery | HDFS, S3, GCS | Snowflake, BigQuery | BigQuery |
| Streaming | Kafka, Kinesis | Kafka | Kafka | Pub/Sub |
| Feature Registry | Yes | Yes | Yes | Yes |
| Pricing | Free (self-hosted) | Usage-based | Custom | Usage-based |
| Best For | Open-source, custom | Enterprise | Enterprise | GCP users |
Selection Criteria
Feature Store Implementation
Data Flow
Feature Lifecycle
Feature Store Best Practices
DO
# 1. Use version control# Version all feature groups and training datasets
# 2. Use point-in-time joins# Ensure correctness, avoid data leakage
# 3. Document features# Add descriptions, owners, tags
# 4. Monitor feature drift# Track distribution changes over time
# 5. Use online serving wisely# Only enable for frequently accessed featuresDON’T
# 1. Don't skip versioning# Essential for reproducibility
# 2. Don't ignore schema# Define all features with types
# 3. Don't forget event time# Required for point-in-time correctness
# 4. Don't serve stale features# Monitor feature freshness
# 5. Don't duplicate features# Reuse existing featuresFeature Store Metrics
Key Metrics
# Monitor feature store health
# 1. Feature freshness# How recent is the feature data?
# 2. Feature completeness# Are there missing values?
# 3. Feature drift# Has the distribution changed?
# 4. Serving latency# How fast are features served?
# 5. Feature popularity# Which features are used most?Feature Store Guides
- Feast Guide - Open-source feature store
- Hopsworks Guide - Enterprise feature store platform
Key Takeaways
- Centralized storage: Unified repository for features
- Consistency: Same features in training and inference
- Version control: Track feature changes over time
- Point-in-time correctness: Avoid data leakage
- Governance: Feature ownership, documentation, lineage
- Online serving: Low-latency inference features
- Offline storage: Historical training data
- Use When: ML systems, feature reuse, governance
Back to Module 5