Spot Preemptible Instances

60-80% Compute Cost Savings

Overview

Spot (AWS) and preemptible (GCP) instances offer unused cloud capacity at 60-80% discounts. They can be terminated by the cloud provider with short notice, making them ideal for fault-tolerant batch workloads. This document covers patterns for using spot instances effectively in data platforms.

The Spot Instance Model

Spot instance: Unused cloud capacity offered at discounted prices.

Trade-off: 60-80% discount vs. can be interrupted anytime.

Pricing Comparison

On-Demand vs. Spot

Instance Type	On-Demand	Spot (60-80% off)	Savings
r5.8xlarge (32 vCPU, 256GB)	$2.00/hour	$0.40-0.80/hour	$1,200-1,800/month
r5.16xlarge (64 vCPU, 512GB)	$4.00/hour	$0.80-1.60/hour	$2,400-4,800/month

Example: 100-node cluster, r5.8xlarge

On-demand: $200/hour = $144,000/month
Spot: $40-80/hour = $28,800-57,600/month
Savings: $86,400-115,200/month (60-80%)

Use Cases

Ideal for Spot

Not Ideal for Spot

Interactive queries (user-facing, latency-sensitive)
Streaming with strict SLA (can’t afford interruption)
Low-latency serving (databases, APIs)
Short critical jobs (interruption unacceptable)

Spot Instance Strategies

Strategy 1: Spot + On-Demand Mix

Configuration:

70-80% spot (for cost savings)
20-30% on-demand (for fallback)
On-demand handles critical workloads
Spot handles fault-tolerant workloads

Strategy 2: Spot for Specific Node Pools

# Kubernetes node pools
node_pools:
  spot_pool:
    instance_type: "r5.8xlarge"
    spot: true
    min_nodes: 10
    max_nodes: 100
    workloads:
      - "batch-etl"
      - "ml-training"

  on_demand_pool:
    instance_type: "r5.8xlarge"
    spot: false
    min_nodes: 5
    max_nodes: 20
    workloads:
      - "streaming-jobs"
      - "interactive-queries"

Strategy 3: Spot with Checkpointing

# Spark with checkpointing
df = spark.read.parquet("s3://bucket/input")

# Configure for spot
spark.conf.set("spark.dynamicAllocation.enabled", "true")
spark.conf.set("spark.dynamicAllocation.maxExecutors", "100")
spark.conf.set("spark.dynamicAllocation.minExecutors", "10")

# Write with checkpointing
df.write.format("delta") \
  .mode("overwrite") \
  .option("checkpointLocation", "s3://bucket/checkpoint/") \
  .save("s3://bucket/output")

# If interrupted, restart from checkpoint

Fault Tolerance Patterns

Pattern 1: Idempotent Operations

# Idempotent writes (safe to retry)
def write_to_delta(df, path):
    """
    Write to Delta Lake idempotently.
    """
    df.write.format("delta") \
      .mode("overwrite") \
      .save(path)

# If interrupted, just retry
# Result will be the same

Pattern 2: Checkpoint and Resume

# Checkpoint progress
def process_with_checkpoint(input_path, output_path, checkpoint_path):
    """
    Process data with checkpoint/resume capability.
    """
    # Check for existing checkpoint
    if checkpoint_exists(checkpoint_path):
        last_processed = read_checkpoint(checkpoint_path)
        df = spark.read.parquet(input_path) \
                 .filter(col("timestamp") > last_processed)
    else:
        df = spark.read.parquet(input_path)

    # Process
    result = df.groupBy("key").agg(sum("value").alias("total"))

    # Checkpoint progress
    write_checkpoint(checkpoint_path, df.agg(max("timestamp")))

    # Write results
    result.write.format("delta") \
           .mode("overwrite") \
           .save(output_path)

Pattern 3: Shuffle Partitioning

# Persist shuffle to external storage
spark.conf.set("spark.shuffle.service.enabled", "true")
spark.conf.set("spark.shuffle.service.port", "7337")
spark.conf.set("spark.dynamicAllocation.enabled", "true")

# Shuffle can survive executor loss
df = df.repartition(200).cache()  # Persist shuffle
df.count()  # Materialize
result = df.groupBy("key").agg(sum("value"))

Cloud Provider Implementation

AWS EMR Spot Instances

import boto3

# Create EMR cluster with spot instances
emr = boto3.client('emr')

response = emr.run_job_flow(
    Name='Spot-EMR-Cluster',
    Instances={
        'InstanceGroups': [
            {
                'Name': 'Master',
                'InstanceCount': 1,
                'InstanceType': 'r5.8xlarge',
                'Market': 'ON_DEMAND'
            },
            {
                'Name': 'Core-Spot',
                'InstanceCount': 10,
                'InstanceType': 'r5.8xlarge',
                'Market': 'SPOT',
                'BidPrice': '0.40',  # Max price
                'EbsConfiguration': {
                    'EbsBlockDeviceConfigs': [
                        {
                            'VolumeSpecification': {
                                'VolumeSize': 100,
                                'VolumeType': 'gp2'
                            }
                        }
                    ]
                }
            }
        ]
    }
)

GCP Dataproc Preemptible VMs

# Create Dataproc cluster with preemptible VMs
from google.cloud import dataproc_v1 as dataproc

cluster_config = {
    'project_id': 'my-project',
    'cluster_name': 'spot-cluster',
    'config': {
        'worker_config': {
            'num_instances': 10,
            'machine_type_uri': 'n1-standard-8',
            'preemptibility': 'PREEMPTIBLE'  # Spot equivalent
        },
        'secondary_worker_config': {
            'num_instances': 20,
            'machine_type_uri': 'n1-standard-8',
            'preemptibility': 'PREEMPTIBLE'
        }
    }
}

Databricks Spot Instances

# Databricks cluster policy with spot
cluster_policy = {
    "name": "spot-policy",
    "spark_version": "11.3.x-scala2.8",
    "node_type_id": "r5.8xlarge",
    "autoscale": {
        "min_workers": 2,
        "max_workers": 100
    },
    "spot_bid_price_percent": 80,  # 80% of on-demand
    "spot_instance_termination": {
        "enable": True,
        "max_idle_time": 300
    }
}

Spot Instance Diversification

Use Multiple Instance Types

# Diversify spot instances to improve availability
instance_types = [
    "r5.8xlarge",   # General purpose
    "m5.8xlarge",   # Compute optimized
    "c5.8xlarge",   # Compute optimized
    "r5a.8xlarge"   # ARM-based (cheaper)
]

# EMR configuration
instance_groups = [
    {
        'Name': f'Spot-{i}',
        'InstanceCount': 5,
        'InstanceType': instance_type,
        'Market': 'SPOT'
    }
    for instance_type in instance_types
]

Benefit: If one instance type becomes unavailable, others continue running.

Monitoring and Alerting

Spot Interruption Monitoring

# Monitor spot interruption notices
def monitor_spot_interruptions():
    """
    Monitor and log spot instance interruptions.
    """
    import boto3

    ec2 = boto3.client('ec2')

    # Check for spot interruption warnings
    response = ec2.describe_spot_instance_requests()

    for request in response['SpotInstanceRequests']:
        if request['Status']['Code'] == 'schedule-term-timeout':
            logger.warning(f"Spot interruption imminent: {request['SpotInstanceRequestId']}")
            # Trigger graceful shutdown
            graceful_shutdown()

def graceful_shutdown():
    """
    Gracefully shutdown: Checkpoint, save state.
    """
    # Write checkpoint
    spark.sparkContext.setJobGroup("shutdown", "Graceful shutdown")
    spark.sparkContext.cancelJobGroup()

    # Save application state
    save_application_state()

Dashboard Metrics

# Metrics to track
spot_dashboard:
  metrics:
    - name: "Spot instance count"
      query: "SELECT COUNT(*) FROM instances WHERE type = 'spot'"

    - name: "Spot interruption rate"
      query: "SELECT COUNT(*) FROM spot_interruptions WHERE timestamp > NOW() - 1 HOUR"

    - name: "Cost savings"
      query: "SELECT SUM(on_demand_cost - spot_cost) FROM cluster_metrics"

    - name: "Job failure rate"
      query: "SELECT COUNT(*) FROM jobs WHERE status = 'failed' AND instance_type = 'spot'"

Cost Optimization

Spot Usage Calculator

def calculate_spot_savings(
    instance_count: int,
    instance_type: str,
    hours_per_month: int
):
    """
    Calculate spot savings.
    """
    # Pricing data
    on_demand_price = get_on_demand_price(instance_type)
    spot_price = get_spot_price(instance_type)

    on_demand_cost = instance_count * hours_per_month * on_demand_price
    spot_cost = instance_count * hours_per_month * spot_price

    savings = on_demand_cost - spot_cost
    savings_percent = (savings / on_demand_cost) * 100

    return {
        "on_demand_cost": on_demand_cost,
        "spot_cost": spot_cost,
        "savings": savings,
        "savings_percent": savings_percent
    }

# Example
result = calculate_spot_savings(
    instance_count=100,
    instance_type="r5.8xlarge",
    hours_per_month=720  # 24x30
)
# Result: $86,400-115,200/month savings

Senior Level Considerations

Anti-Patterns

Anti-Pattern 1: No fallback mechanism

# Bad: All spot, no on-demand
cluster = [spot_nodes] * 100

# Good: Mixed spot + on-demand
cluster = [spot_nodes] * 80 + [on_demand_nodes] * 20

Anti-Pattern 2: No checkpointing

# Bad: No checkpointing
df.write.mode("overwrite").save(path)

# Good: With checkpointing
df.write.mode("overwrite") \
  .option("checkpointLocation", checkpoint_path) \
  .save(path)

Anti-Pattern 3: Spot for everything

# Bad: Spot for critical streaming
streaming_job = spot_cluster

# Good: Spot for batch, on-demand for streaming
batch_job = spot_cluster
streaming_job = on_demand_cluster

Key Takeaways

60-80% savings: Spot instances offer massive cost reduction
Fault tolerance required: Workloads must handle interruption
Checkpoints: Enable restart from interruption
Idempotency: Safe to retry operations
Mixed strategy: Spot + on-demand for reliability
Diversification: Use multiple instance types
Monitoring: Track interruption rates, job failures
Not for everything: Avoid for latency-sensitive workloads

Next: Cluster Rightsizing