Case Study - FinTech Fraud Detection
Real-Time ML for Financial Services
Business Context
Company: Online payment processor (10M transactions/day)
Challenge: Detect fraudulent transactions in real-time while minimizing false positives that block legitimate customers.
Scale: 10B transactions/year, 10TB/day
Requirements
Architecture Design
Technology Selection
Component Selection Rationale
| Component | Technology | Why |
|---|---|---|
| Streaming | Kafka | Proven at scale, exactly-once |
| Processing | Flink | Millisecond latency, state management |
| Feature Store | Feast | Real-time serving, offline/online |
| Real-Time Analytics | ClickHouse | Fast ingest, excellent compression |
| Historical Analytics | S3 + Delta + Spark | Cost optimization, time travel |
| ML Platform | MLflow + KFServing | Open source, flexible |
| Orchestration | Airflow | Fault tolerance, retries |
| Quality | Great Expectations | Data validation, contracts |
Data Pipeline
Real-Time Pipeline
# Flink streaming jobfrom pyflink.datastream import StreamExecutionEnvironmentfrom pyflink.table import StreamTableEnvironment
env = StreamExecutionEnvironment.get_execution_environment()t_env = StreamTableEnvironment.create(env)
# Read from Kafkatransactions = t_env.sql_decode(""" CREATE TABLE transactions ( transaction_id BIGINT, user_id BIGINT, merchant_id BIGINT, amount DECIMAL(18,2), timestamp TIMESTAMP(3), metadata ROW(METADATA) ) WITH ( 'connector' = 'kafka', 'topic' = 'transactions', 'properties.bootstrap.servers' = 'kafka:9092', 'format' = 'json' )""")
# Join with features (Feast)with_features = transactions.sql_decode(""" SELECT t.transaction_id, t.user_id, t.merchant_id, t.amount, f.user_age, f.user_account_age_days, f.merchant_fraud_rate, f.merchant_category FROM transactions AS t JOIN feature_store.user_features FOR SYSTEM_TIME AS OF t.timestamp AS f ON t.user_id = f.user_id JOIN feature_store.merchant_features FOR SYSTEM_TIME AS OF t.timestamp AS m ON t.merchant_id = m.merchant_id""")
# ML inference (XGBoost model)with_predictions = with_features.map(fraud_detection_model)
# Route based on predictionpredictions = predictions.sql_decode(""" SELECT transaction_id, prediction, probability FROM predictions""")
# Split streamsfraud = predictions.filter("prediction = 1 AND probability > 0.8")legitimate = predictions.filter("prediction = 0 OR probability <= 0.8")
# Sink: Block fraudfraud.execute_insert("blocked_transactions")
# Sink: Forward legitimatelegitimate.execute_insert("approved_transactions")Batch Pipeline (Model Training)
# Daily model trainingfrom pyspark.sql import SparkSession
spark = SparkSession.builder.appName("FraudModelTraining").getOrCreate()
# Read historical datatransactions = spark.read.format("delta") \ .load("s3://lakehouse/bronze/transactions")
# Join with featuresfeatures = spark.read.format("delta") \ .load("s3://lakehouse/gold/user_features")
training_data = transactions.join(features, "user_id")
# Train modelfrom xgboost.spark import SparkXGBClassifier
xgb = SparkXGBClassifier( max_depth=6, learning_rate=0.1, n_estimators=100, labelCol="is_fraud")
model = xgb.fit(training_data)
# Save modelmodel.write().overwrite().save("s3://models/fraud_detection_v2")Cost Optimization
Storage Costs
| Layer | Format | Size | Monthly Cost |
|---|---|---|---|
| Raw (Bronze) | Delta | 300TB | $6,900 |
| Features (Silver) | Delta | 150TB | $3,450 |
| Training (Gold) | Parquet | 50TB | $1,150 |
| Total | 500TB | $11,500 |
Optimization:
- Lifecycle: Bronze (7 days) → Archive
- Compression: ZSTD-19 for cold data
- Partitioning: Date + hour for efficient pruning
Compute Costs
| Component | Compute | Cost | Optimization |
|---|---|---|---|
| Flink Cluster | 100 nodes (spot) | $43,200/month | 70% spot discount |
| Spark Training | 200 nodes (spot) | $57,600/month | 70% spot discount |
| Feature Store | 50 nodes (on-demand) | $21,600/month | Right-sized |
| Total | $122,400/month |
Optimization:
- Spot instances for fault-tolerant workloads (70% savings)
- Auto-scaling based on transaction volume
- Scheduled operations (train during off-peak)
Total Monthly Cost
| Category | Cost | Optimization |
|---|---|---|
| Storage | $11,500 | Lifecycle + compression |
| Compute | $122,400 | Spot + auto-scaling |
| Network | $5,000 | Colocation, minimize egress |
| Total | $138,900/month | |
| Annual | $1,666,800 |
Failure Modes
Mode 1: ML Model Drift
Mitigation:
- Continuous monitoring (precision, recall, F1)
- Automated retraining on drift detection
- Shadow mode testing before deployment
- Model versioning with rollback capability
Mode 2: Streaming Pipeline Failure
Mitigation:
- Kafka buffer: 7 days of retention
- Flink checkpointing: Save state to S3
- Horizontal pod autoscaling: Scale on backlog
- Alert on processing lag > 5 minutes
Mode 3: Feature Store Unavailability
Mitigation:
- Cache features in Redis (24-hour TTL)
- Fallback to rule-based system
- Feature store high availability (multi-region)
- Graceful degradation with monitoring
SLA/SLO Definitions
slas: fraud_detection: latency: p50: "50ms" p95: "100ms" p99: "200ms" target: "99% of transactions"
accuracy: fraud_detection_rate: "> 99%" false_positive_rate: "< 0.1%"
availability: uptime: "99.99%" planned_maintenance: "4 hours/month"
data_pipeline: freshness: features: "< 5 minutes stale" model_performance: "< 1 hour old"
completeness: transaction_capture: "> 99.9%"Migration Strategy
Phase 1: Pilot (3 months)
Phase 2: Gradual Rollout (6 months)
Key Takeaways
- Latency is critical: < 100ms for real-time decisions
- State management: Flink state backend for ML features
- Feature store: Feast for real-time feature serving
- Model drift: Continuous monitoring and retraining
- Fault tolerance: Fallback strategies for all components
- Cost optimization: Spot instances for 70% compute savings
- Compliance: PCI-DSS and SOX requirements drive architecture
- Monitoring: Comprehensive alerting on performance and cost
Back to Module 8