Skip to content

Spot Preemptible Instances

60-80% Compute Cost Savings


Overview

Spot (AWS) and preemptible (GCP) instances offer unused cloud capacity at 60-80% discounts. They can be terminated by the cloud provider with short notice, making them ideal for fault-tolerant batch workloads. This document covers patterns for using spot instances effectively in data platforms.


The Spot Instance Model

Spot instance: Unused cloud capacity offered at discounted prices.

Trade-off: 60-80% discount vs. can be interrupted anytime.


Pricing Comparison

On-Demand vs. Spot

Instance TypeOn-DemandSpot (60-80% off)Savings
r5.8xlarge (32 vCPU, 256GB)$2.00/hour$0.40-0.80/hour$1,200-1,800/month
r5.16xlarge (64 vCPU, 512GB)$4.00/hour$0.80-1.60/hour$2,400-4,800/month

Example: 100-node cluster, r5.8xlarge

  • On-demand: $200/hour = $144,000/month
  • Spot: $40-80/hour = $28,800-57,600/month
  • Savings: $86,400-115,200/month (60-80%)

Use Cases

Ideal for Spot

Not Ideal for Spot

  • Interactive queries (user-facing, latency-sensitive)
  • Streaming with strict SLA (can’t afford interruption)
  • Low-latency serving (databases, APIs)
  • Short critical jobs (interruption unacceptable)

Spot Instance Strategies

Strategy 1: Spot + On-Demand Mix

Configuration:

  • 70-80% spot (for cost savings)
  • 20-30% on-demand (for fallback)
  • On-demand handles critical workloads
  • Spot handles fault-tolerant workloads

Strategy 2: Spot for Specific Node Pools

# Kubernetes node pools
node_pools:
spot_pool:
instance_type: "r5.8xlarge"
spot: true
min_nodes: 10
max_nodes: 100
workloads:
- "batch-etl"
- "ml-training"
on_demand_pool:
instance_type: "r5.8xlarge"
spot: false
min_nodes: 5
max_nodes: 20
workloads:
- "streaming-jobs"
- "interactive-queries"

Strategy 3: Spot with Checkpointing

# Spark with checkpointing
df = spark.read.parquet("s3://bucket/input")
# Configure for spot
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "100")
spark.conf.set("spark.dynamicAllocation.minExecutors", "10")
# Write with checkpointing
df.write.format("delta") \
.mode("overwrite") \
.option("checkpointLocation", "s3://bucket/checkpoint/") \
.save("s3://bucket/output")
# If interrupted, restart from checkpoint

Fault Tolerance Patterns

Pattern 1: Idempotent Operations

# Idempotent writes (safe to retry)
def write_to_delta(df, path):
"""
Write to Delta Lake idempotently.
"""
df.write.format("delta") \
.mode("overwrite") \
.save(path)
# If interrupted, just retry
# Result will be the same

Pattern 2: Checkpoint and Resume

# Checkpoint progress
def process_with_checkpoint(input_path, output_path, checkpoint_path):
"""
Process data with checkpoint/resume capability.
"""
# Check for existing checkpoint
if checkpoint_exists(checkpoint_path):
last_processed = read_checkpoint(checkpoint_path)
df = spark.read.parquet(input_path) \
.filter(col("timestamp") > last_processed)
else:
df = spark.read.parquet(input_path)
# Process
result = df.groupBy("key").agg(sum("value").alias("total"))
# Checkpoint progress
write_checkpoint(checkpoint_path, df.agg(max("timestamp")))
# Write results
result.write.format("delta") \
.mode("overwrite") \
.save(output_path)

Pattern 3: Shuffle Partitioning

# Persist shuffle to external storage
spark.conf.set("spark.shuffle.service.enabled", "true")
spark.conf.set("spark.shuffle.service.port", "7337")
spark.conf.set("spark.dynamicAllocation.enabled", "true")
# Shuffle can survive executor loss
df = df.repartition(200).cache() # Persist shuffle
df.count() # Materialize
result = df.groupBy("key").agg(sum("value"))

Cloud Provider Implementation

AWS EMR Spot Instances

import boto3
# Create EMR cluster with spot instances
emr = boto3.client('emr')
response = emr.run_job_flow(
Name='Spot-EMR-Cluster',
Instances={
'InstanceGroups': [
{
'Name': 'Master',
'InstanceCount': 1,
'InstanceType': 'r5.8xlarge',
'Market': 'ON_DEMAND'
},
{
'Name': 'Core-Spot',
'InstanceCount': 10,
'InstanceType': 'r5.8xlarge',
'Market': 'SPOT',
'BidPrice': '0.40', # Max price
'EbsConfiguration': {
'EbsBlockDeviceConfigs': [
{
'VolumeSpecification': {
'VolumeSize': 100,
'VolumeType': 'gp2'
}
}
]
}
}
]
}
)

GCP Dataproc Preemptible VMs

# Create Dataproc cluster with preemptible VMs
from google.cloud import dataproc_v1 as dataproc
cluster_config = {
'project_id': 'my-project',
'cluster_name': 'spot-cluster',
'config': {
'worker_config': {
'num_instances': 10,
'machine_type_uri': 'n1-standard-8',
'preemptibility': 'PREEMPTIBLE' # Spot equivalent
},
'secondary_worker_config': {
'num_instances': 20,
'machine_type_uri': 'n1-standard-8',
'preemptibility': 'PREEMPTIBLE'
}
}
}

Databricks Spot Instances

# Databricks cluster policy with spot
cluster_policy = {
"name": "spot-policy",
"spark_version": "11.3.x-scala2.8",
"node_type_id": "r5.8xlarge",
"autoscale": {
"min_workers": 2,
"max_workers": 100
},
"spot_bid_price_percent": 80, # 80% of on-demand
"spot_instance_termination": {
"enable": True,
"max_idle_time": 300
}
}

Spot Instance Diversification

Use Multiple Instance Types

# Diversify spot instances to improve availability
instance_types = [
"r5.8xlarge", # General purpose
"m5.8xlarge", # Compute optimized
"c5.8xlarge", # Compute optimized
"r5a.8xlarge" # ARM-based (cheaper)
]
# EMR configuration
instance_groups = [
{
'Name': f'Spot-{i}',
'InstanceCount': 5,
'InstanceType': instance_type,
'Market': 'SPOT'
}
for instance_type in instance_types
]

Benefit: If one instance type becomes unavailable, others continue running.


Monitoring and Alerting

Spot Interruption Monitoring

# Monitor spot interruption notices
def monitor_spot_interruptions():
"""
Monitor and log spot instance interruptions.
"""
import boto3
ec2 = boto3.client('ec2')
# Check for spot interruption warnings
response = ec2.describe_spot_instance_requests()
for request in response['SpotInstanceRequests']:
if request['Status']['Code'] == 'schedule-term-timeout':
logger.warning(f"Spot interruption imminent: {request['SpotInstanceRequestId']}")
# Trigger graceful shutdown
graceful_shutdown()
def graceful_shutdown():
"""
Gracefully shutdown: Checkpoint, save state.
"""
# Write checkpoint
spark.sparkContext.setJobGroup("shutdown", "Graceful shutdown")
spark.sparkContext.cancelJobGroup()
# Save application state
save_application_state()

Dashboard Metrics

# Metrics to track
spot_dashboard:
metrics:
- name: "Spot instance count"
query: "SELECT COUNT(*) FROM instances WHERE type = 'spot'"
- name: "Spot interruption rate"
query: "SELECT COUNT(*) FROM spot_interruptions WHERE timestamp > NOW() - 1 HOUR"
- name: "Cost savings"
query: "SELECT SUM(on_demand_cost - spot_cost) FROM cluster_metrics"
- name: "Job failure rate"
query: "SELECT COUNT(*) FROM jobs WHERE status = 'failed' AND instance_type = 'spot'"

Cost Optimization

Spot Usage Calculator

def calculate_spot_savings(
instance_count: int,
instance_type: str,
hours_per_month: int
):
"""
Calculate spot savings.
"""
# Pricing data
on_demand_price = get_on_demand_price(instance_type)
spot_price = get_spot_price(instance_type)
on_demand_cost = instance_count * hours_per_month * on_demand_price
spot_cost = instance_count * hours_per_month * spot_price
savings = on_demand_cost - spot_cost
savings_percent = (savings / on_demand_cost) * 100
return {
"on_demand_cost": on_demand_cost,
"spot_cost": spot_cost,
"savings": savings,
"savings_percent": savings_percent
}
# Example
result = calculate_spot_savings(
instance_count=100,
instance_type="r5.8xlarge",
hours_per_month=720 # 24x30
)
# Result: $86,400-115,200/month savings

Senior Level Considerations

Anti-Patterns

Anti-Pattern 1: No fallback mechanism

# Bad: All spot, no on-demand
cluster = [spot_nodes] * 100
# Good: Mixed spot + on-demand
cluster = [spot_nodes] * 80 + [on_demand_nodes] * 20

Anti-Pattern 2: No checkpointing

# Bad: No checkpointing
df.write.mode("overwrite").save(path)
# Good: With checkpointing
df.write.mode("overwrite") \
.option("checkpointLocation", checkpoint_path) \
.save(path)

Anti-Pattern 3: Spot for everything

# Bad: Spot for critical streaming
streaming_job = spot_cluster
# Good: Spot for batch, on-demand for streaming
batch_job = spot_cluster
streaming_job = on_demand_cluster

Key Takeaways

  1. 60-80% savings: Spot instances offer massive cost reduction
  2. Fault tolerance required: Workloads must handle interruption
  3. Checkpoints: Enable restart from interruption
  4. Idempotency: Safe to retry operations
  5. Mixed strategy: Spot + on-demand for reliability
  6. Diversification: Use multiple instance types
  7. Monitoring: Track interruption rates, job failures
  8. Not for everything: Avoid for latency-sensitive workloads

Next: Cluster Rightsizing