Kubernetes for Data Engineering

Container Orchestration for Data Platforms

Overview

Kubernetes (K8s) automates deployment, scaling, and management of containerized applications. For data platforms, K8s provides scalable infrastructure for Spark, Airflow, Jupyter, and data services.

Kubernetes Basics

Spark on Kubernetes

apiVersion: v1
kind: Namespace
metadata:
  name: spark-operator

---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-operator
  namespace: spark-operator

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: spark-operator
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["*"]
- apiGroups: ["sparkoperator.k8s.io"]
  resources: ["sparkapplications", "scheduledsparkapplications"]
  verbs: ["*"]

---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: spark-operator
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: spark-operator
subjects:
- kind: ServiceAccount
  name: spark-operator
  namespace: spark-operator

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-operator
  namespace: spark-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-operator
  template:
    metadata:
      labels:
        app: spark-operator
    spec:
      serviceAccountName: spark-operator
      containers:
      - name: spark-operator
        image: googlecloudplatform/spark-operator:v1beta2-1.3.0-3.1.1
        imagePullPolicy: Always
        env:
        - name: SPARK_OPERATOR_IMAGE
          value: googlecloudplatform/spark-operator:v1beta2-1.3.0-3.1.1
        - name: METRICS_GENERATOR_PORT
          value: "10254"

Spark Application

apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: etl-job
  namespace: data-platform
spec:
  type: Python
  pythonVersion: "311"
  mode: cluster
  image: my-company/spark-py:3.5.0
  imagePullPolicy: Always
  mainApplicationFile: local:///app/etl_job.py
  sparkVersion: "3.5.0"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onSubmissionFailureRetries: 5
    interval: 10
  driver:
    cores: 2
    coreLimit: "2000m"
    memory: "2g"
    serviceAccount: spark-driver
    envVars:
      DATABASE_URL:
        secretKeyRef:
          name: database-secrets
          key: connection-string
  executor:
    cores: 2
    coreLimit: "2000m"
    instances: 4
    memory: "4g"
    envVars:
      DATABASE_URL:
        secretKeyRef:
          name: database-secrets
          key: connection-string
  deps:
    packages:
    - "org.postgresql:postgresql:42.6.0"
    - "com.amazonaws:aws-java-sdk-bundle:1.12.262"
    jars:
    - "local:///app/lib/my-custom.jar"
    files:
    - "local:///app/config/etl-config.json"
  hadoopConf:
    fs.s3a.access.key: "{{ AWS_ACCESS_KEY_ID }}"
    fs.s3a.secret.key: "{{ AWS_SECRET_ACCESS_KEY }}"

JupyterHub on Kubernetes

apiVersion: v1
kind: Namespace
metadata:
  name: jupyterhub

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: jupyterhub-config
  namespace: jupyterhub
data:
  jupyterhub_config.py: |
    import os
    from kubespawner import KubeSpawner

    c.JupyterHub.spawner_class = KubeSpawner

    # CPU and memory limits
    c.KubeSpawner.cpu_limit = 2.0
    c.KubeSpawner.mem_limit = "4G"
    c.KubeSpawner.cpu_guarantee = 0.5
    c.KubeSpawner.mem_guarantee = "1G"

    # Image
    c.KubeSpawner.image = "my-company/jupyter-datascience:latest"
    c.KubeSpawner.image_pull_policy = "Always"

    # Service account
    c.KubeSpawner.service_account = "jupyterhub"

    # Volumes
    c.KubeSpawner.volumes = [
        {
            "name": "data",
            "persistentVolumeClaim": {
                "claimName": "jupyter-data"
            }
        }
    ]
    c.KubeSpawner.volume_mounts = [
        {
            "name": "data",
            "mountPath": "/home/jovyan/data"
        }
    ]

---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jupyterhub
  namespace: jupyterhub
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jupyterhub
  template:
    metadata:
      labels:
        app: jupyterhub
    spec:
      serviceAccountName: jupyterhub
      containers:
      - name: jupyterhub
        image: jupyterhub/jupyterhub:latest
        ports:
        - containerPort: 8000
          name: http
        volumeMounts:
        - name: config
          mountPath: /etc/jupyterhub
        env:
        - name: JUPYTERHUB_CRYPT_KEY
          valueFrom:
            secretKeyRef:
              name: jupyterhub-secrets
              key: crypt-key
        - name: OAUTH_CALLBACK_URL
          value: "https://jupyterhub.my-company.com/hub/oauth-callback"
      volumes:
      - name: config
        configMap:
          name: jupyterhub-config

---
apiVersion: v1
kind: Service
metadata:
  name: jupyterhub
  namespace: jupyterhub
spec:
  type: LoadBalancer
  ports:
  - port: 80
    targetPort: 8000
  selector:
    app: jupyterhub

Airflow on Kubernetes

# Airflow Helm chart values

# Executor
executor: "KubernetesExecutor"

# Environment variables
env:
- name: AIRFLOW__CORE__EXECUTOR
  value: "KubernetesExecutor"
- name: AIRFLOW__CORE__DAGS_FOLDER
  value: "/opt/airflow/dags"
- name: AIRFLOW__CORE__LOAD_EXAMPLES
  value: "False"
- name: AIRFLOW__KUBERNETES__NAMESPACE
  value: "airflow"
- name: AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME
  value: "airflow-worker"
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY
  value: "apache/airflow"
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG
  value: "2.7.0-python3.10"
- name: AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE
  value: "/opt/airflow/pod_templates/pod_template.yaml"
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
  valueFrom:
    secretKeyRef:
      name: airflow-secrets
      key: connection-string

# DAGs repository (Git sync)
dags:
  git:
    repo: "https://github.com/my-company/airflow-dags.git"
    branch: "main"
    subPath: "."

# Pod template
podTemplate: "pod_template.yaml"

# S3 logging
logging:
  s3:
    enable: true
    bucket: "my-company-airflow-logs"
    key_prefix: "logs/"

# StatsD monitoring
statsd:
  enabled: true
  host: "statsd-service"
  port: 9125

# Prometheus metrics
prometheus:
  enabled: true
  serviceMonitor:
    enabled: true

Kubernetes Resources

Resource Management

apiVersion: v1
kind: ResourceQuota
metadata:
  name: data-platform-quota
  namespace: data-platform
spec:
  hard:
    requests.cpu: "20"
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    persistentvolumeclaims: "20"
    requests.storage: "1Ti"

---
apiVersion: v1
kind: LimitRange
metadata:
  name: data-platform-limits
  namespace: data-platform
spec:
  limits:
  - max:
      cpu: "4"
      memory: "8Gi"
    min:
      cpu: "100m"
      memory: "128Mi"
    default:
      cpu: "500m"
      memory: "512Mi"
    defaultRequest:
      cpu: "250m"
      memory: "256Mi"
    type: Container

Storage Classes

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard-hdd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: st1
  encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Kubernetes Monitoring

Prometheus Monitoring

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: data-platform
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
    - job_name: 'spark-applications'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - data-platform
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_spark-app-name]
        action: keep
        regex: .+

    - job_name: 'airflow'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - airflow
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: airflow

    - job_name: 'jupyter'
      kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
          - jupyterhub
      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: jupyter

Kubernetes Best Practices

DO

# 1. Use resource limits
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1000m"
    memory: "2Gi"

# 2. Use liveness and readiness probes
livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5

# 3. Use pod disruption budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: airflow-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: airflow

# 4. Use secrets for credentials
env:
- name: DATABASE_URL
  valueFrom:
    secretKeyRef:
      name: database-secrets
      key: connection-string

# 5. Use network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: data-platform-network-policy
spec:
  podSelector:
    matchLabels:
      app: my-app
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: allowed-app
    ports:
    - protocol: TCP
      port: 8080

DON’T

# 1. Don't ignore resource limits
# Always set requests and limits

# 2. Don't use latest image tag
# Use specific versions
image: my-app:v1.0.0  # Good
image: my-app:latest  # Bad

# 3. Don't ignore pod disruption budgets
# Set PDB for availability

# 4. Don't run as root
# Use security contexts
securityContext:
  runAsNonRoot: true
  runAsUser: 1000

# 5. Don't ignore networking
# Use network policies

Kubernetes vs. Docker Compose

Feature	Kubernetes	Docker Compose
Scaling	Auto-scaling	Manual scaling
Resilience	Self-healing	Manual recovery
Load Balancing	Built-in	Limited
Service Discovery	Built-in	Limited
Complexity	High	Low
Best For	Production, scaling	Development, testing

Key Takeaways

Spark Operator: Run Spark on K8s
JupyterHub: Data science notebooks
Airflow: Orchestration on K8s
Resources: Set limits and requests
Storage: Use persistent volumes
Monitoring: Prometheus metrics
Scaling: Auto-scaling based on load
Use When: Production, scalability, resilience needed

Back to Module 3