Skip to content

Kubernetes for Data Engineering

Container Orchestration for Data Platforms


Overview

Kubernetes (K8s) automates deployment, scaling, and management of containerized applications. For data platforms, K8s provides scalable infrastructure for Spark, Airflow, Jupyter, and data services.


Kubernetes Basics

Spark on Kubernetes

spark-operator-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
name: spark-operator
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark-operator
namespace: spark-operator
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: spark-operator
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["*"]
- apiGroups: ["sparkoperator.k8s.io"]
resources: ["sparkapplications", "scheduledsparkapplications"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: spark-operator
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: spark-operator
subjects:
- kind: ServiceAccount
name: spark-operator
namespace: spark-operator
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-operator
namespace: spark-operator
spec:
replicas: 1
selector:
matchLabels:
app: spark-operator
template:
metadata:
labels:
app: spark-operator
spec:
serviceAccountName: spark-operator
containers:
- name: spark-operator
image: googlecloudplatform/spark-operator:v1beta2-1.3.0-3.1.1
imagePullPolicy: Always
env:
- name: SPARK_OPERATOR_IMAGE
value: googlecloudplatform/spark-operator:v1beta2-1.3.0-3.1.1
- name: METRICS_GENERATOR_PORT
value: "10254"

Spark Application

spark-application.yaml
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: etl-job
namespace: data-platform
spec:
type: Python
pythonVersion: "311"
mode: cluster
image: my-company/spark-py:3.5.0
imagePullPolicy: Always
mainApplicationFile: local:///app/etl_job.py
sparkVersion: "3.5.0"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onSubmissionFailureRetries: 5
interval: 10
driver:
cores: 2
coreLimit: "2000m"
memory: "2g"
serviceAccount: spark-driver
envVars:
DATABASE_URL:
secretKeyRef:
name: database-secrets
key: connection-string
executor:
cores: 2
coreLimit: "2000m"
instances: 4
memory: "4g"
envVars:
DATABASE_URL:
secretKeyRef:
name: database-secrets
key: connection-string
deps:
packages:
- "org.postgresql:postgresql:42.6.0"
- "com.amazonaws:aws-java-sdk-bundle:1.12.262"
jars:
- "local:///app/lib/my-custom.jar"
files:
- "local:///app/config/etl-config.json"
hadoopConf:
fs.s3a.access.key: "{{ AWS_ACCESS_KEY_ID }}"
fs.s3a.secret.key: "{{ AWS_SECRET_ACCESS_KEY }}"

JupyterHub on Kubernetes

jupyterhub-deployment.yaml
apiVersion: v1
kind: Namespace
metadata:
name: jupyterhub
---
apiVersion: v1
kind: ConfigMap
metadata:
name: jupyterhub-config
namespace: jupyterhub
data:
jupyterhub_config.py: |
import os
from kubespawner import KubeSpawner
c.JupyterHub.spawner_class = KubeSpawner
# CPU and memory limits
c.KubeSpawner.cpu_limit = 2.0
c.KubeSpawner.mem_limit = "4G"
c.KubeSpawner.cpu_guarantee = 0.5
c.KubeSpawner.mem_guarantee = "1G"
# Image
c.KubeSpawner.image = "my-company/jupyter-datascience:latest"
c.KubeSpawner.image_pull_policy = "Always"
# Service account
c.KubeSpawner.service_account = "jupyterhub"
# Volumes
c.KubeSpawner.volumes = [
{
"name": "data",
"persistentVolumeClaim": {
"claimName": "jupyter-data"
}
}
]
c.KubeSpawner.volume_mounts = [
{
"name": "data",
"mountPath": "/home/jovyan/data"
}
]
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: jupyterhub
namespace: jupyterhub
spec:
replicas: 1
selector:
matchLabels:
app: jupyterhub
template:
metadata:
labels:
app: jupyterhub
spec:
serviceAccountName: jupyterhub
containers:
- name: jupyterhub
image: jupyterhub/jupyterhub:latest
ports:
- containerPort: 8000
name: http
volumeMounts:
- name: config
mountPath: /etc/jupyterhub
env:
- name: JUPYTERHUB_CRYPT_KEY
valueFrom:
secretKeyRef:
name: jupyterhub-secrets
key: crypt-key
- name: OAUTH_CALLBACK_URL
value: "https://jupyterhub.my-company.com/hub/oauth-callback"
volumes:
- name: config
configMap:
name: jupyterhub-config
---
apiVersion: v1
kind: Service
metadata:
name: jupyterhub
namespace: jupyterhub
spec:
type: LoadBalancer
ports:
- port: 80
targetPort: 8000
selector:
app: jupyterhub

Airflow on Kubernetes

airflow-helm-values.yaml
# Airflow Helm chart values
# Executor
executor: "KubernetesExecutor"
# Environment variables
env:
- name: AIRFLOW__CORE__EXECUTOR
value: "KubernetesExecutor"
- name: AIRFLOW__CORE__DAGS_FOLDER
value: "/opt/airflow/dags"
- name: AIRFLOW__CORE__LOAD_EXAMPLES
value: "False"
- name: AIRFLOW__KUBERNETES__NAMESPACE
value: "airflow"
- name: AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME
value: "airflow-worker"
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY
value: "apache/airflow"
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG
value: "2.7.0-python3.10"
- name: AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE
value: "/opt/airflow/pod_templates/pod_template.yaml"
- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN
valueFrom:
secretKeyRef:
name: airflow-secrets
key: connection-string
# DAGs repository (Git sync)
dags:
git:
repo: "https://github.com/my-company/airflow-dags.git"
branch: "main"
subPath: "."
# Pod template
podTemplate: "pod_template.yaml"
# S3 logging
logging:
s3:
enable: true
bucket: "my-company-airflow-logs"
key_prefix: "logs/"
# StatsD monitoring
statsd:
enabled: true
host: "statsd-service"
port: 9125
# Prometheus metrics
prometheus:
enabled: true
serviceMonitor:
enabled: true

Kubernetes Resources

Resource Management

resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: data-platform-quota
namespace: data-platform
spec:
hard:
requests.cpu: "20"
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
persistentvolumeclaims: "20"
requests.storage: "1Ti"
---
apiVersion: v1
kind: LimitRange
metadata:
name: data-platform-limits
namespace: data-platform
spec:
limits:
- max:
cpu: "4"
memory: "8Gi"
min:
cpu: "100m"
memory: "128Mi"
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "250m"
memory: "256Mi"
type: Container

Storage Classes

storage-class.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
throughput: "125"
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard-hdd
provisioner: kubernetes.io/aws-ebs
parameters:
type: st1
encrypted: "true"
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true

Kubernetes Monitoring

Prometheus Monitoring

prometheus-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: data-platform
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'spark-applications'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- data-platform
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_spark-app-name]
action: keep
regex: .+
- job_name: 'airflow'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- airflow
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: airflow
- job_name: 'jupyter'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- jupyterhub
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: jupyter

Kubernetes Best Practices

DO

# 1. Use resource limits
resources:
requests:
cpu: "500m"
memory: "512Mi"
limits:
cpu: "1000m"
memory: "2Gi"
# 2. Use liveness and readiness probes
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
# 3. Use pod disruption budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: airflow-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: airflow
# 4. Use secrets for credentials
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-secrets
key: connection-string
# 5. Use network policies
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: data-platform-network-policy
spec:
podSelector:
matchLabels:
app: my-app
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: allowed-app
ports:
- protocol: TCP
port: 8080

DON’T

# 1. Don't ignore resource limits
# Always set requests and limits
# 2. Don't use latest image tag
# Use specific versions
image: my-app:v1.0.0 # Good
image: my-app:latest # Bad
# 3. Don't ignore pod disruption budgets
# Set PDB for availability
# 4. Don't run as root
# Use security contexts
securityContext:
runAsNonRoot: true
runAsUser: 1000
# 5. Don't ignore networking
# Use network policies

Kubernetes vs. Docker Compose

FeatureKubernetesDocker Compose
ScalingAuto-scalingManual scaling
ResilienceSelf-healingManual recovery
Load BalancingBuilt-inLimited
Service DiscoveryBuilt-inLimited
ComplexityHighLow
Best ForProduction, scalingDevelopment, testing

Key Takeaways

  1. Spark Operator: Run Spark on K8s
  2. JupyterHub: Data science notebooks
  3. Airflow: Orchestration on K8s
  4. Resources: Set limits and requests
  5. Storage: Use persistent volumes
  6. Monitoring: Prometheus metrics
  7. Scaling: Auto-scaling based on load
  8. Use When: Production, scalability, resilience needed

Back to Module 3