Spot Preemptible Instances
60-80% Compute Cost Savings
Overview
Spot (AWS) and preemptible (GCP) instances offer unused cloud capacity at 60-80% discounts. They can be terminated by the cloud provider with short notice, making them ideal for fault-tolerant batch workloads. This document covers patterns for using spot instances effectively in data platforms.
The Spot Instance Model
Spot instance: Unused cloud capacity offered at discounted prices.
Trade-off: 60-80% discount vs. can be interrupted anytime.
Pricing Comparison
On-Demand vs. Spot
| Instance Type | On-Demand | Spot (60-80% off) | Savings |
|---|---|---|---|
| r5.8xlarge (32 vCPU, 256GB) | $2.00/hour | $0.40-0.80/hour | $1,200-1,800/month |
| r5.16xlarge (64 vCPU, 512GB) | $4.00/hour | $0.80-1.60/hour | $2,400-4,800/month |
Example: 100-node cluster, r5.8xlarge
- On-demand: $200/hour = $144,000/month
- Spot: $40-80/hour = $28,800-57,600/month
- Savings: $86,400-115,200/month (60-80%)
Use Cases
Ideal for Spot
Not Ideal for Spot
- Interactive queries (user-facing, latency-sensitive)
- Streaming with strict SLA (can’t afford interruption)
- Low-latency serving (databases, APIs)
- Short critical jobs (interruption unacceptable)
Spot Instance Strategies
Strategy 1: Spot + On-Demand Mix
Configuration:
- 70-80% spot (for cost savings)
- 20-30% on-demand (for fallback)
- On-demand handles critical workloads
- Spot handles fault-tolerant workloads
Strategy 2: Spot for Specific Node Pools
# Kubernetes node poolsnode_pools: spot_pool: instance_type: "r5.8xlarge" spot: true min_nodes: 10 max_nodes: 100 workloads: - "batch-etl" - "ml-training"
on_demand_pool: instance_type: "r5.8xlarge" spot: false min_nodes: 5 max_nodes: 20 workloads: - "streaming-jobs" - "interactive-queries"Strategy 3: Spot with Checkpointing
# Spark with checkpointingdf = spark.read.parquet("s3://bucket/input")
# Configure for spotspark.conf.set("spark.dynamicAllocation.enabled", "true")spark.conf.set("spark.dynamicAllocation.maxExecutors", "100")spark.conf.set("spark.dynamicAllocation.minExecutors", "10")
# Write with checkpointingdf.write.format("delta") \ .mode("overwrite") \ .option("checkpointLocation", "s3://bucket/checkpoint/") \ .save("s3://bucket/output")
# If interrupted, restart from checkpointFault Tolerance Patterns
Pattern 1: Idempotent Operations
# Idempotent writes (safe to retry)def write_to_delta(df, path): """ Write to Delta Lake idempotently. """ df.write.format("delta") \ .mode("overwrite") \ .save(path)
# If interrupted, just retry# Result will be the samePattern 2: Checkpoint and Resume
# Checkpoint progressdef process_with_checkpoint(input_path, output_path, checkpoint_path): """ Process data with checkpoint/resume capability. """ # Check for existing checkpoint if checkpoint_exists(checkpoint_path): last_processed = read_checkpoint(checkpoint_path) df = spark.read.parquet(input_path) \ .filter(col("timestamp") > last_processed) else: df = spark.read.parquet(input_path)
# Process result = df.groupBy("key").agg(sum("value").alias("total"))
# Checkpoint progress write_checkpoint(checkpoint_path, df.agg(max("timestamp")))
# Write results result.write.format("delta") \ .mode("overwrite") \ .save(output_path)Pattern 3: Shuffle Partitioning
# Persist shuffle to external storagespark.conf.set("spark.shuffle.service.enabled", "true")spark.conf.set("spark.shuffle.service.port", "7337")spark.conf.set("spark.dynamicAllocation.enabled", "true")
# Shuffle can survive executor lossdf = df.repartition(200).cache() # Persist shuffledf.count() # Materializeresult = df.groupBy("key").agg(sum("value"))Cloud Provider Implementation
AWS EMR Spot Instances
import boto3
# Create EMR cluster with spot instancesemr = boto3.client('emr')
response = emr.run_job_flow( Name='Spot-EMR-Cluster', Instances={ 'InstanceGroups': [ { 'Name': 'Master', 'InstanceCount': 1, 'InstanceType': 'r5.8xlarge', 'Market': 'ON_DEMAND' }, { 'Name': 'Core-Spot', 'InstanceCount': 10, 'InstanceType': 'r5.8xlarge', 'Market': 'SPOT', 'BidPrice': '0.40', # Max price 'EbsConfiguration': { 'EbsBlockDeviceConfigs': [ { 'VolumeSpecification': { 'VolumeSize': 100, 'VolumeType': 'gp2' } } ] } } ] })GCP Dataproc Preemptible VMs
# Create Dataproc cluster with preemptible VMsfrom google.cloud import dataproc_v1 as dataproc
cluster_config = { 'project_id': 'my-project', 'cluster_name': 'spot-cluster', 'config': { 'worker_config': { 'num_instances': 10, 'machine_type_uri': 'n1-standard-8', 'preemptibility': 'PREEMPTIBLE' # Spot equivalent }, 'secondary_worker_config': { 'num_instances': 20, 'machine_type_uri': 'n1-standard-8', 'preemptibility': 'PREEMPTIBLE' } }}Databricks Spot Instances
# Databricks cluster policy with spotcluster_policy = { "name": "spot-policy", "spark_version": "11.3.x-scala2.8", "node_type_id": "r5.8xlarge", "autoscale": { "min_workers": 2, "max_workers": 100 }, "spot_bid_price_percent": 80, # 80% of on-demand "spot_instance_termination": { "enable": True, "max_idle_time": 300 }}Spot Instance Diversification
Use Multiple Instance Types
# Diversify spot instances to improve availabilityinstance_types = [ "r5.8xlarge", # General purpose "m5.8xlarge", # Compute optimized "c5.8xlarge", # Compute optimized "r5a.8xlarge" # ARM-based (cheaper)]
# EMR configurationinstance_groups = [ { 'Name': f'Spot-{i}', 'InstanceCount': 5, 'InstanceType': instance_type, 'Market': 'SPOT' } for instance_type in instance_types]Benefit: If one instance type becomes unavailable, others continue running.
Monitoring and Alerting
Spot Interruption Monitoring
# Monitor spot interruption noticesdef monitor_spot_interruptions(): """ Monitor and log spot instance interruptions. """ import boto3
ec2 = boto3.client('ec2')
# Check for spot interruption warnings response = ec2.describe_spot_instance_requests()
for request in response['SpotInstanceRequests']: if request['Status']['Code'] == 'schedule-term-timeout': logger.warning(f"Spot interruption imminent: {request['SpotInstanceRequestId']}") # Trigger graceful shutdown graceful_shutdown()
def graceful_shutdown(): """ Gracefully shutdown: Checkpoint, save state. """ # Write checkpoint spark.sparkContext.setJobGroup("shutdown", "Graceful shutdown") spark.sparkContext.cancelJobGroup()
# Save application state save_application_state()Dashboard Metrics
# Metrics to trackspot_dashboard: metrics: - name: "Spot instance count" query: "SELECT COUNT(*) FROM instances WHERE type = 'spot'"
- name: "Spot interruption rate" query: "SELECT COUNT(*) FROM spot_interruptions WHERE timestamp > NOW() - 1 HOUR"
- name: "Cost savings" query: "SELECT SUM(on_demand_cost - spot_cost) FROM cluster_metrics"
- name: "Job failure rate" query: "SELECT COUNT(*) FROM jobs WHERE status = 'failed' AND instance_type = 'spot'"Cost Optimization
Spot Usage Calculator
def calculate_spot_savings( instance_count: int, instance_type: str, hours_per_month: int): """ Calculate spot savings. """ # Pricing data on_demand_price = get_on_demand_price(instance_type) spot_price = get_spot_price(instance_type)
on_demand_cost = instance_count * hours_per_month * on_demand_price spot_cost = instance_count * hours_per_month * spot_price
savings = on_demand_cost - spot_cost savings_percent = (savings / on_demand_cost) * 100
return { "on_demand_cost": on_demand_cost, "spot_cost": spot_cost, "savings": savings, "savings_percent": savings_percent }
# Exampleresult = calculate_spot_savings( instance_count=100, instance_type="r5.8xlarge", hours_per_month=720 # 24x30)# Result: $86,400-115,200/month savingsSenior Level Considerations
Anti-Patterns
Anti-Pattern 1: No fallback mechanism
# Bad: All spot, no on-demandcluster = [spot_nodes] * 100
# Good: Mixed spot + on-demandcluster = [spot_nodes] * 80 + [on_demand_nodes] * 20Anti-Pattern 2: No checkpointing
# Bad: No checkpointingdf.write.mode("overwrite").save(path)
# Good: With checkpointingdf.write.mode("overwrite") \ .option("checkpointLocation", checkpoint_path) \ .save(path)Anti-Pattern 3: Spot for everything
# Bad: Spot for critical streamingstreaming_job = spot_cluster
# Good: Spot for batch, on-demand for streamingbatch_job = spot_clusterstreaming_job = on_demand_clusterKey Takeaways
- 60-80% savings: Spot instances offer massive cost reduction
- Fault tolerance required: Workloads must handle interruption
- Checkpoints: Enable restart from interruption
- Idempotency: Safe to retry operations
- Mixed strategy: Spot + on-demand for reliability
- Diversification: Use multiple instance types
- Monitoring: Track interruption rates, job failures
- Not for everything: Avoid for latency-sensitive workloads
Next: Cluster Rightsizing