Skip to content

Feature Stores

Centralized Feature Management for ML Systems


Overview

Feature stores provide a centralized repository for storing, managing, and serving features for machine learning systems. They enable feature reuse, ensure consistency between training and inference, and provide governance for feature engineering.


Feature Store Architecture

Core Components

Key Components:

  • Feature Definitions: Schema, metadata, documentation
  • Data Ingestion: Batch, streaming, request-time features
  • Storage: Online (low-latency) and offline (historical)
  • Feature Registry: Version control, lineage, discovery
  • Transformation: Feature transformation and validation

Feature Store Benefits

Why Use a Feature Store?

Key Benefits:

  • Consistency: Same features in training and inference
  • Reusability: Share features across models
  • Version Control: Track feature changes over time
  • Point-in-Time Correctness: Avoid data leakage
  • Governance: Feature ownership, documentation, lineage
  • Discovery: Search and browse features

Feature Store Patterns

Feature Groups

# Feature group pattern
from hsfs.feature_group import FeatureGroup
# Define feature group
driver_stats_fg = fs.create_feature_group(
name="driver_stats",
version=1,
description="Driver statistics",
primary_key=["driver_id"],
event_time="event_timestamp",
online_enabled=True,
schema=[
{"name": "driver_id", "type": "int"},
{"name": "conv_rate", "type": "float"},
{"name": "avg_daily_trips", "type": "int"},
{"name": "event_timestamp", "type": "timestamp"},
]
)
# Insert data
driver_stats_fg.insert(data_df)
# Serve features
features = driver_stats_fg.get_online_features(
entity_keys=["1", "2", "3"],
feature_names=["conv_rate", "avg_daily_trips"]
)

Training Datasets

# Training dataset pattern
from hsfs.training_dataset import TrainingDataset
# Create training dataset
td = fs.create_training_dataset(
name="driver_training_dataset",
version=1,
description="Training dataset for driver prediction",
label=["target"],
features=[
"driver_stats:conv_rate",
"driver_stats:avg_daily_trips",
"driver_demographics:age",
]
)
# Insert training data
td.save(training_df)
# Retrieve training dataset
td = fs.get_training_dataset("driver_training_dataset", 1)
training_df = td.read()
# Use in training
X = training_df[["conv_rate", "avg_daily_trips", "age"]]
y = training_df["target"]

Point-in-Time Joins

# Point-in-time correctness
# Problem: Training-inference skew
# Without feature store: Use latest features (incorrect)
# With feature store: Use historical features at training time
# Feature store ensures point-in-time correctness
# For each training example, use features as of that timestamp
training_df = fs.get_historical_features(
features=[
"driver_stats:conv_rate",
"driver_stats:avg_daily_trips",
],
entity_df=entity_df, # Contains entity_id, timestamp
)
# Result: Features as of each timestamp (correct)

Feature Store Comparison

Feature Comparison

FeatureFeastHopsworksTectonVertex AI
Open SourceYesYes (limited)NoNo
ManagedNoYesYesYes
Online StoreRedis, DynamoDBMySQL, RedisRedis, DynamoDBRedis
Offline StoreParquet, BigQueryHDFS, S3, GCSSnowflake, BigQueryBigQuery
StreamingKafka, KinesisKafkaKafkaPub/Sub
Feature RegistryYesYesYesYes
PricingFree (self-hosted)Usage-basedCustomUsage-based
Best ForOpen-source, customEnterpriseEnterpriseGCP users

Selection Criteria


Feature Store Implementation

Data Flow

Feature Lifecycle


Feature Store Best Practices

DO

# 1. Use version control
# Version all feature groups and training datasets
# 2. Use point-in-time joins
# Ensure correctness, avoid data leakage
# 3. Document features
# Add descriptions, owners, tags
# 4. Monitor feature drift
# Track distribution changes over time
# 5. Use online serving wisely
# Only enable for frequently accessed features

DON’T

# 1. Don't skip versioning
# Essential for reproducibility
# 2. Don't ignore schema
# Define all features with types
# 3. Don't forget event time
# Required for point-in-time correctness
# 4. Don't serve stale features
# Monitor feature freshness
# 5. Don't duplicate features
# Reuse existing features

Feature Store Metrics

Key Metrics

# Monitor feature store health
# 1. Feature freshness
# How recent is the feature data?
# 2. Feature completeness
# Are there missing values?
# 3. Feature drift
# Has the distribution changed?
# 4. Serving latency
# How fast are features served?
# 5. Feature popularity
# Which features are used most?

Feature Store Guides


Key Takeaways

  1. Centralized storage: Unified repository for features
  2. Consistency: Same features in training and inference
  3. Version control: Track feature changes over time
  4. Point-in-time correctness: Avoid data leakage
  5. Governance: Feature ownership, documentation, lineage
  6. Online serving: Low-latency inference features
  7. Offline storage: Historical training data
  8. Use When: ML systems, feature reuse, governance

Back to Module 5