Skip to content

Feast Guide

Feature Store for Machine Learning


Overview

Feast (Feature Store) is an open-source feature store for machine learning. It provides a unified interface for managing, serving, and storing features for ML models, enabling feature reuse, consistency, and version control.


Feast Architecture

Feature Store Architecture

Key Components:

  • Feature Views: Feature definitions and schema
  • Data Sources: Batch (data warehouse), streaming (Kafka), request (inference time)
  • Offline Store: Parquet files for training data generation
  • Online Store: Redis/DynamoDB for low-latency inference
  • Feature Registry: Version control and metadata

Feast Installation

Installation

Terminal window
# Install Feast
pip install feast
# Install with specific dependencies
pip install 'feast[redis]' # Redis online store
pip install 'feast[dynamodb]' # DynamoDB online store
pip install 'feast[gcp]' # GCP dependencies
pip install 'feast[aws]' # AWS dependencies
# Verify installation
feast version

Initialize Feature Store

Terminal window
# Initialize feature store
feast init my_feature_repo
cd my_feature_repo
# Directory structure:
# my_feature_repo/
# ├── feature_repo/
# │ ├── __init__.py
# │ ├── example.py # Feature definitions
# ├── data/ # Offline store (Parquet)
# └── feast.yaml # Configuration

Configuration

feature_repo/feast.yaml
project: my_feature_repo
registry:
path: data/registry.db
provider: local
online_store:
type: redis
connection_string: "localhost:6379"
offline_store:
type: file
path: data

Feast Operations

Feature Definition

feature_repo/features.py
from datetime import timedelta
from feast import (
Entity,
Feature,
FeatureView,
Field,
FileSource,
RequestSource
)
from feast.types import Float32, Int64, String
from feast.data_source import PushSource
# 1. Define entity (primary key)
driver = Entity(
name="driver_id",
join_keys=["driver_id"],
description="Driver ID",
)
# 2. Define batch source (data warehouse)
driver_stats_fv = FeatureView(
name="driver_stats_fv",
entities=[driver],
ttl=timedelta(days=1),
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
Field(name="created_timestamp", dtype=Int64),
],
source=FileSource(
path="data/driver_stats.parquet",
timestamp_field="created_timestamp",
created_timestamp_column="created_timestamp",
),
)
# 3. Define streaming source (Kafka)
from kafka import KafkaSource
driver_stats_stream = FeatureView(
name="driver_stats_stream",
entities=[driver],
ttl=timedelta(hours=1),
schema=[
Field(name="conv_rate", dtype=Float32),
Field(name="avg_daily_trips", dtype=Int64),
],
source=KafkaSource(
bootstrap_servers="localhost:9092",
topic="driver_stats",
timestamp_field="event_timestamp",
batch_source=FileSource(
path="data/driver_stats.parquet",
timestamp_field="event_timestamp",
),
),
)
# 4. Define request source (inference time)
request_features = RequestSource(
name="request_features",
schema=[
Field(name="trip_distance", dtype=Float32),
Field(name="trip_duration", dtype=Float32),
],
)
# 5. Define on-demand feature view (transformation)
from feast import OnDemandFeatureView
driver_age_input = RequestSource(
name="driver_age_input",
schema=[
Field(name="val_to_add", dtype=Int64),
],
)
driver_age_fv = OnDemandFeatureView(
name="driver_age_fv",
sources=[driver_stats_fv, driver_age_input],
schema=[
Field(name="conv_rate_plus_val", dtype=Float32),
],
)
from pyspark.sql import functions as F
@df_transformation(
sources=[driver_stats_fv, driver_age_input],
schema=[Field(name="conv_rate_plus_val", dtype=Float32)],
)
def transform_driver_age(source_dicts: Dict[str, DataFrame]) -> DataFrame:
"""
Transform driver stats by adding a value
"""
driver_stats_df = source_dicts[driver_stats_fv.name]
driver_age_input_df = source_dicts[driver_age_input.name]
return driver_stats_df.join(
driver_age_input_df, on="driver_id", how="left"
).select(
"driver_id",
"event_timestamp",
(F.col("conv_rate") + F.col("val_to_add")).alias(
"conv_rate_plus_val"
),
)

Feature Registration

Terminal window
# Register features
cd feature_repo
feast apply
# Output:
# Registered entity driver_id
# Registered feature view driver_stats_fv
# Registered feature view driver_stats_stream
# Registered on-demand feature view driver_age_fv

Data Ingestion

# Ingest data into feature store
from feast import FeatureStore
import pandas as pd
# Initialize feature store
store = FeatureStore(repo_path="feature_repo")
# 1. Ingest batch data
driver_stats_df = pd.DataFrame({
'driver_id': [1, 2, 3, 4, 5],
'conv_rate': [0.7, 0.6, 0.8, 0.5, 0.9],
'avg_daily_trips': [100, 150, 200, 80, 250],
'created_timestamp': pd.Timestamp('2025-01-27'),
})
store ingest(
driver_stats_fv,
driver_stats_df,
# Infers Infra for offline and online stores
)
# 2. Ingest streaming data
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers='localhost:9092',
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Push to Kafka
event = {
'driver_id': 1,
'conv_rate': 0.75,
'avg_daily_trips': 110,
'event_timestamp': int(pd.Timestamp('2025-01-27').timestamp()),
}
producer.send('driver_stats', value=event)
producer.flush()
# 3. Ingest from data warehouse
# Feast automatically ingests from batch sources

Feast Feature Retrieval

Online Retrieval (Inference)

# Retrieve features for online inference
from feast import FeatureStore
store = FeatureStore(repo_path="feature_repo")
# 1. Get latest features (online store)
feature_vector = store.get_online_features(
features=[
"driver_stats_fv:conv_rate",
"driver_stats_fv:avg_daily_trips",
],
entity_rows=[
{"driver_id": 1},
{"driver_id": 2},
{"driver_id": 3},
],
)
# Output:
# driver_id conv_rate avg_daily_trips
# 0 1 0.70 100
# 1 2 0.60 150
# 2 3 0.80 200
# Convert to DataFrame
df = feature_vector.to_df()
print(df)
# 2. Get historical features (offline store)
from datetime import datetime, timedelta
entity_df = pd.DataFrame({
'driver_id': [1, 2, 3],
'event_timestamp': [
datetime(2025, 1, 27, 12, 0),
datetime(2025, 1, 27, 12, 0),
datetime(2025, 1, 27, 12, 0),
],
)
training_data = store.get_historical_features(
features=[
"driver_stats_fv:conv_rate",
"driver_stats_fv:avg_daily_trips",
],
entity_df=entity_df,
).to_df()
print(training_data)

Historical Retrieval (Training)

# Retrieve historical features for training
from feast import FeatureStore
import pandas as pd
store = FeatureStore(repo_path="feature_repo")
# 1. Define entity DataFrame
entity_df = pd.DataFrame({
'driver_id': [1, 2, 3, 4, 5],
'event_timestamp': [
pd.Timestamp('2025-01-27 10:00:00'),
pd.Timestamp('2025-01-27 10:00:00'),
pd.Timestamp('2025-01-27 10:00:00'),
pd.Timestamp('2025-01-27 10:00:00'),
pd.Timestamp('2025-01-27 10:00:00'),
],
})
# 2. Get historical features (point-in-time correct)
training_df = store.get_historical_features(
features=[
"driver_stats_fv:conv_rate",
"driver_stats_fv:avg_daily_trips",
],
entity_df=entity_df,
).to_df()
# 3. Save training data
training_df.to_csv('data/training_data.csv', index=False)
# 4. Use in training
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
# Prepare features and target
X = training_df[['conv_rate', 'avg_daily_trips']]
y = training_df['target'] # Assume target column exists
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = RandomForestRegressor()
model.fit(X_train, y_train)
# Evaluate
score = model.score(X_test, y_test)
print(f"R² score: {score:.4f}")

Feast Advanced Features

Feature Transformations

# On-demand feature transformations
from feast import OnDemandFeatureView, RequestSource
from feast.data_source import RequestSource
from feast.types import Float32
# 1. Define request source
val_to_add_request = RequestSource(
name="val_to_add",
schema=[
Field(name="val_to_add", dtype=Float32),
],
)
# 2. Define on-demand feature view
@pd_transformation(
sources=[driver_stats_fv, val_to_add_request],
schema=[
Field(name="conv_rate_plus_val", dtype=Float32),
],
)
def transform_conv_rate(source_dicts: Dict[str, pd.DataFrame]) -> pd.DataFrame:
"""
Transform conv_rate by adding a value
"""
driver_stats_df = source_dicts[driver_stats_fv.name]
val_to_add_df = source_dicts[val_to_add_request.name]
return pd.DataFrame({
"driver_id": driver_stats_df["driver_id"],
"event_timestamp": driver_stats_df["event_timestamp"],
"conv_rate_plus_val": driver_stats_df["conv_rate"] + val_to_add_df["val_to_add"],
})
conv_rate_plus_odfv = OnDemandFeatureView(
name="conv_rate_plus_odfv",
sources=[driver_stats_fv, val_to_add_request],
schema=[Field(name="conv_rate_plus_val", dtype=Float32)],
pandas_transformation=transform_conv_rate,
)
# 3. Use on-demand feature
feature_vector = store.get_online_features(
features=[
"conv_rate_plus_odfv:conv_rate_plus_val",
],
entity_rows=[
{
"driver_id": 1,
"val_to_add": 0.1,
}
],
)

Feature Services

# Feature services for serving
from feast import FeatureService
# Define feature service (group of features)
driver_activity_v1 = FeatureService(
name="driver_activity_v1",
features=[
driver_stats_fv[["conv_rate", "avg_daily_trips"]],
],
)
driver_activity_v2 = FeatureService(
name="driver_activity_v2",
features=[
driver_stats_fv[["conv_rate", "avg_daily_trips"]],
conv_rate_plus_odfv[["conv_rate_plus_val"]],
],
)
# Use feature service
feature_vector = store.get_online_features(
feature_service=driver_activity_v1,
entity_rows=[
{"driver_id": 1},
{"driver_id": 2},
],
)

Feast Performance

Optimization Strategies

# Performance optimization
# 1. Use appropriate TTL
driver_stats_fv = FeatureView(
name="driver_stats_fv",
entities=[driver],
ttl=timedelta(hours=1), # Shorter TTL = faster lookups
# ...
)
# 2. Use batch retrieval
feature_vectors = store.get_online_features(
features=[
"driver_stats_fv:conv_rate",
"driver_stats_fv:avg_daily_trips",
],
entity_rows=[
{"driver_id": i} for i in range(1, 1000) # Batch of 1000
],
)
# 3. Use online store for low-latency
# Redis: < 10ms
# DynamoDB: < 50ms
# 4. Use materialization
# Pre-compute features for faster retrieval
feast materialize-incremental 2025-01-27T00:00:00
# 5. Use feature services
# Group features for optimized retrieval

Feast Cost Optimization

Cost Strategies

# Cost optimization
# 1. Use Redis for online store (cheaper than DynamoDB)
# Redis: Self-hosted (free)
# DynamoDB: $0.20 per million read requests
# 2. Use appropriate TTL
# Shorter TTL = less storage = lower cost
# 3. Use file-based offline store
# Parquet files: Free (S3, GCS)
# BigQuery: $5 per TB
# 4. Clean up old features
feast teardown # Remove all features
# 5. Use incremental materialization
# Only materialize new data
feast materialize-incremental 2025-01-27T00:00:00

Feast Monitoring

Metrics

# Monitor feature store
from feast import FeatureStore
store = FeatureStore(repo_path="feature_repo")
# 1. Get feature statistics
stats = store.get_online_features(
features=["driver_stats_fv:conv_rate"],
entity_rows=[{"driver_id": i} for i in range(1, 1000)],
)
print(stats.describe())
# 2. Monitor ingestion latency
import time
start = time.time()
store.ingest(driver_stats_fv, driver_stats_df)
ingestion_latency = time.time() - start
print(f"Ingestion latency: {ingestion_latency:.2f}s")
# 3. Monitor retrieval latency
start = time.time()
feature_vector = store.get_online_features(
features=["driver_stats_fv:conv_rate"],
entity_rows=[{"driver_id": i} for i in range(1, 1000)],
)
retrieval_latency = time.time() - start
print(f"Retrieval latency: {retrieval_latency:.2f}s")

Feast Best Practices

DO

# 1. Use version control
# Git for feature definitions
# 2. Use meaningful names
# driver_stats_fv (not fv1)
# 3. Use appropriate TTL
# Balance freshness and cost
# 4. Use feature services
# Group features for serving
# 5. Monitor latency
# Essential for inference

DON’T

# 1. Don't ignore schema
# Define all fields
# 2. Don't skip TTL
# Essential for data freshness
# 3. Don't forget entity keys
# Primary keys are required
# 4. Don't ignore monitoring
# Essential for operations
# 5. Don't hardcode values
# Use feature store

Feast vs. Alternatives

FeatureFeastHopsworksTectonVertex AI
Open SourceYesYes (limited)NoNo
Online StoreRedis, DynamoDBRedis, MySQLRedis, DynamoDBRedis
Offline StoreParquet, BigQueryHive, BigQuerySnowflake, BigQueryBigQuery
StreamingKafka, KinesisKafkaKafkaPub/Sub
PricingFree (self-hosted)Free tierCustomUsage-based
Best ForOpen-source, customEnterprise, feature registryEnterprise, managedGCP, Vertex AI

Key Takeaways

  1. Open source: Free, self-hosted feature store
  2. Feature views: Define features with schema
  3. Data sources: Batch, streaming, request
  4. Online store: Low-latency inference (Redis)
  5. Offline store: Historical training data (Parquet)
  6. Feature registry: Version control and metadata
  7. Transformations: On-demand feature transformations
  8. Use When: Open-source, custom feature store, MLOps

Back to Module 5