Skip to content

Case Study - FinTech Fraud Detection

Real-Time ML for Financial Services


Business Context

Company: Online payment processor (10M transactions/day)

Challenge: Detect fraudulent transactions in real-time while minimizing false positives that block legitimate customers.

Scale: 10B transactions/year, 10TB/day


Requirements


Architecture Design


Technology Selection

Component Selection Rationale

ComponentTechnologyWhy
StreamingKafkaProven at scale, exactly-once
ProcessingFlinkMillisecond latency, state management
Feature StoreFeastReal-time serving, offline/online
Real-Time AnalyticsClickHouseFast ingest, excellent compression
Historical AnalyticsS3 + Delta + SparkCost optimization, time travel
ML PlatformMLflow + KFServingOpen source, flexible
OrchestrationAirflowFault tolerance, retries
QualityGreat ExpectationsData validation, contracts

Data Pipeline

Real-Time Pipeline

# Flink streaming job
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)
# Read from Kafka
transactions = t_env.sql_decode("""
CREATE TABLE transactions (
transaction_id BIGINT,
user_id BIGINT,
merchant_id BIGINT,
amount DECIMAL(18,2),
timestamp TIMESTAMP(3),
metadata ROW(METADATA)
) WITH (
'connector' = 'kafka',
'topic' = 'transactions',
'properties.bootstrap.servers' = 'kafka:9092',
'format' = 'json'
)
""")
# Join with features (Feast)
with_features = transactions.sql_decode("""
SELECT
t.transaction_id,
t.user_id,
t.merchant_id,
t.amount,
f.user_age,
f.user_account_age_days,
f.merchant_fraud_rate,
f.merchant_category
FROM transactions AS t
JOIN feature_store.user_features FOR SYSTEM_TIME AS OF t.timestamp AS f
ON t.user_id = f.user_id
JOIN feature_store.merchant_features FOR SYSTEM_TIME AS OF t.timestamp AS m
ON t.merchant_id = m.merchant_id
""")
# ML inference (XGBoost model)
with_predictions = with_features.map(fraud_detection_model)
# Route based on prediction
predictions = predictions.sql_decode("""
SELECT
transaction_id,
prediction,
probability
FROM predictions
""")
# Split streams
fraud = predictions.filter("prediction = 1 AND probability > 0.8")
legitimate = predictions.filter("prediction = 0 OR probability <= 0.8")
# Sink: Block fraud
fraud.execute_insert("blocked_transactions")
# Sink: Forward legitimate
legitimate.execute_insert("approved_transactions")

Batch Pipeline (Model Training)

# Daily model training
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FraudModelTraining").getOrCreate()
# Read historical data
transactions = spark.read.format("delta") \
.load("s3://lakehouse/bronze/transactions")
# Join with features
features = spark.read.format("delta") \
.load("s3://lakehouse/gold/user_features")
training_data = transactions.join(features, "user_id")
# Train model
from xgboost.spark import SparkXGBClassifier
xgb = SparkXGBClassifier(
max_depth=6,
learning_rate=0.1,
n_estimators=100,
labelCol="is_fraud"
)
model = xgb.fit(training_data)
# Save model
model.write().overwrite().save("s3://models/fraud_detection_v2")

Cost Optimization

Storage Costs

LayerFormatSizeMonthly Cost
Raw (Bronze)Delta300TB$6,900
Features (Silver)Delta150TB$3,450
Training (Gold)Parquet50TB$1,150
Total500TB$11,500

Optimization:

  • Lifecycle: Bronze (7 days) → Archive
  • Compression: ZSTD-19 for cold data
  • Partitioning: Date + hour for efficient pruning

Compute Costs

ComponentComputeCostOptimization
Flink Cluster100 nodes (spot)$43,200/month70% spot discount
Spark Training200 nodes (spot)$57,600/month70% spot discount
Feature Store50 nodes (on-demand)$21,600/monthRight-sized
Total$122,400/month

Optimization:

  • Spot instances for fault-tolerant workloads (70% savings)
  • Auto-scaling based on transaction volume
  • Scheduled operations (train during off-peak)

Total Monthly Cost

CategoryCostOptimization
Storage$11,500Lifecycle + compression
Compute$122,400Spot + auto-scaling
Network$5,000Colocation, minimize egress
Total$138,900/month
Annual$1,666,800

Failure Modes

Mode 1: ML Model Drift

Mitigation:

  • Continuous monitoring (precision, recall, F1)
  • Automated retraining on drift detection
  • Shadow mode testing before deployment
  • Model versioning with rollback capability

Mode 2: Streaming Pipeline Failure

Mitigation:

  • Kafka buffer: 7 days of retention
  • Flink checkpointing: Save state to S3
  • Horizontal pod autoscaling: Scale on backlog
  • Alert on processing lag > 5 minutes

Mode 3: Feature Store Unavailability

Mitigation:

  • Cache features in Redis (24-hour TTL)
  • Fallback to rule-based system
  • Feature store high availability (multi-region)
  • Graceful degradation with monitoring

SLA/SLO Definitions

slas:
fraud_detection:
latency:
p50: "50ms"
p95: "100ms"
p99: "200ms"
target: "99% of transactions"
accuracy:
fraud_detection_rate: "> 99%"
false_positive_rate: "< 0.1%"
availability:
uptime: "99.99%"
planned_maintenance: "4 hours/month"
data_pipeline:
freshness:
features: "< 5 minutes stale"
model_performance: "< 1 hour old"
completeness:
transaction_capture: "> 99.9%"

Migration Strategy

Phase 1: Pilot (3 months)

Phase 2: Gradual Rollout (6 months)


Key Takeaways

  1. Latency is critical: < 100ms for real-time decisions
  2. State management: Flink state backend for ML features
  3. Feature store: Feast for real-time feature serving
  4. Model drift: Continuous monitoring and retraining
  5. Fault tolerance: Fallback strategies for all components
  6. Cost optimization: Spot instances for 70% compute savings
  7. Compliance: PCI-DSS and SOX requirements drive architecture
  8. Monitoring: Comprehensive alerting on performance and cost

Back to Module 8