Kubernetes for Data Engineering
Container Orchestration for Data Platforms
Overview
Kubernetes (K8s) automates deployment, scaling, and management of containerized applications. For data platforms, K8s provides scalable infrastructure for Spark, Airflow, Jupyter, and data services.
Kubernetes Basics
Spark on Kubernetes
apiVersion: v1kind: Namespacemetadata: name: spark-operator
---apiVersion: v1kind: ServiceAccountmetadata: name: spark-operator namespace: spark-operator
---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRolemetadata: name: spark-operatorrules:- apiGroups: [""] resources: ["pods", "services", "configmaps"] verbs: ["*"]- apiGroups: ["sparkoperator.k8s.io"] resources: ["sparkapplications", "scheduledsparkapplications"] verbs: ["*"]
---apiVersion: rbac.authorization.k8s.io/v1kind: ClusterRoleBindingmetadata: name: spark-operatorroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: spark-operatorsubjects:- kind: ServiceAccount name: spark-operator namespace: spark-operator
---apiVersion: apps/v1kind: Deploymentmetadata: name: spark-operator namespace: spark-operatorspec: replicas: 1 selector: matchLabels: app: spark-operator template: metadata: labels: app: spark-operator spec: serviceAccountName: spark-operator containers: - name: spark-operator image: googlecloudplatform/spark-operator:v1beta2-1.3.0-3.1.1 imagePullPolicy: Always env: - name: SPARK_OPERATOR_IMAGE value: googlecloudplatform/spark-operator:v1beta2-1.3.0-3.1.1 - name: METRICS_GENERATOR_PORT value: "10254"Spark Application
apiVersion: sparkoperator.k8s.io/v1beta2kind: SparkApplicationmetadata: name: etl-job namespace: data-platformspec: type: Python pythonVersion: "311" mode: cluster image: my-company/spark-py:3.5.0 imagePullPolicy: Always mainApplicationFile: local:///app/etl_job.py sparkVersion: "3.5.0" restartPolicy: type: OnFailure onFailureRetries: 3 onSubmissionFailureRetries: 5 interval: 10 driver: cores: 2 coreLimit: "2000m" memory: "2g" serviceAccount: spark-driver envVars: DATABASE_URL: secretKeyRef: name: database-secrets key: connection-string executor: cores: 2 coreLimit: "2000m" instances: 4 memory: "4g" envVars: DATABASE_URL: secretKeyRef: name: database-secrets key: connection-string deps: packages: - "org.postgresql:postgresql:42.6.0" - "com.amazonaws:aws-java-sdk-bundle:1.12.262" jars: - "local:///app/lib/my-custom.jar" files: - "local:///app/config/etl-config.json" hadoopConf: fs.s3a.access.key: "{{ AWS_ACCESS_KEY_ID }}" fs.s3a.secret.key: "{{ AWS_SECRET_ACCESS_KEY }}"JupyterHub on Kubernetes
apiVersion: v1kind: Namespacemetadata: name: jupyterhub
---apiVersion: v1kind: ConfigMapmetadata: name: jupyterhub-config namespace: jupyterhubdata: jupyterhub_config.py: | import os from kubespawner import KubeSpawner
c.JupyterHub.spawner_class = KubeSpawner
# CPU and memory limits c.KubeSpawner.cpu_limit = 2.0 c.KubeSpawner.mem_limit = "4G" c.KubeSpawner.cpu_guarantee = 0.5 c.KubeSpawner.mem_guarantee = "1G"
# Image c.KubeSpawner.image = "my-company/jupyter-datascience:latest" c.KubeSpawner.image_pull_policy = "Always"
# Service account c.KubeSpawner.service_account = "jupyterhub"
# Volumes c.KubeSpawner.volumes = [ { "name": "data", "persistentVolumeClaim": { "claimName": "jupyter-data" } } ] c.KubeSpawner.volume_mounts = [ { "name": "data", "mountPath": "/home/jovyan/data" } ]
---apiVersion: apps/v1kind: Deploymentmetadata: name: jupyterhub namespace: jupyterhubspec: replicas: 1 selector: matchLabels: app: jupyterhub template: metadata: labels: app: jupyterhub spec: serviceAccountName: jupyterhub containers: - name: jupyterhub image: jupyterhub/jupyterhub:latest ports: - containerPort: 8000 name: http volumeMounts: - name: config mountPath: /etc/jupyterhub env: - name: JUPYTERHUB_CRYPT_KEY valueFrom: secretKeyRef: name: jupyterhub-secrets key: crypt-key - name: OAUTH_CALLBACK_URL value: "https://jupyterhub.my-company.com/hub/oauth-callback" volumes: - name: config configMap: name: jupyterhub-config
---apiVersion: v1kind: Servicemetadata: name: jupyterhub namespace: jupyterhubspec: type: LoadBalancer ports: - port: 80 targetPort: 8000 selector: app: jupyterhubAirflow on Kubernetes
# Airflow Helm chart values
# Executorexecutor: "KubernetesExecutor"
# Environment variablesenv:- name: AIRFLOW__CORE__EXECUTOR value: "KubernetesExecutor"- name: AIRFLOW__CORE__DAGS_FOLDER value: "/opt/airflow/dags"- name: AIRFLOW__CORE__LOAD_EXAMPLES value: "False"- name: AIRFLOW__KUBERNETES__NAMESPACE value: "airflow"- name: AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME value: "airflow-worker"- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY value: "apache/airflow"- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG value: "2.7.0-python3.10"- name: AIRFLOW__KUBERNETES__POD_TEMPLATE_FILE value: "/opt/airflow/pod_templates/pod_template.yaml"- name: AIRFLOW__CORE__SQL_ALCHEMY_CONN valueFrom: secretKeyRef: name: airflow-secrets key: connection-string
# DAGs repository (Git sync)dags: git: repo: "https://github.com/my-company/airflow-dags.git" branch: "main" subPath: "."
# Pod templatepodTemplate: "pod_template.yaml"
# S3 logginglogging: s3: enable: true bucket: "my-company-airflow-logs" key_prefix: "logs/"
# StatsD monitoringstatsd: enabled: true host: "statsd-service" port: 9125
# Prometheus metricsprometheus: enabled: true serviceMonitor: enabled: trueKubernetes Resources
Resource Management
apiVersion: v1kind: ResourceQuotametadata: name: data-platform-quota namespace: data-platformspec: hard: requests.cpu: "20" requests.memory: "40Gi" limits.cpu: "40" limits.memory: "80Gi" persistentvolumeclaims: "20" requests.storage: "1Ti"
---apiVersion: v1kind: LimitRangemetadata: name: data-platform-limits namespace: data-platformspec: limits: - max: cpu: "4" memory: "8Gi" min: cpu: "100m" memory: "128Mi" default: cpu: "500m" memory: "512Mi" defaultRequest: cpu: "250m" memory: "256Mi" type: ContainerStorage Classes
apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: fast-ssdprovisioner: kubernetes.io/aws-ebsparameters: type: gp3 iops: "3000" throughput: "125" encrypted: "true"volumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: true
---apiVersion: storage.k8s.io/v1kind: StorageClassmetadata: name: standard-hddprovisioner: kubernetes.io/aws-ebsparameters: type: st1 encrypted: "true"volumeBindingMode: WaitForFirstConsumerallowVolumeExpansion: trueKubernetes Monitoring
Prometheus Monitoring
apiVersion: v1kind: ConfigMapmetadata: name: prometheus-config namespace: data-platformdata: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s
scrape_configs: - job_name: 'spark-applications' kubernetes_sd_configs: - role: pod namespaces: names: - data-platform relabel_configs: - source_labels: [__meta_kubernetes_pod_label_spark-app-name] action: keep regex: .+
- job_name: 'airflow' kubernetes_sd_configs: - role: pod namespaces: names: - airflow relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: airflow
- job_name: 'jupyter' kubernetes_sd_configs: - role: pod namespaces: names: - jupyterhub relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] action: keep regex: jupyterKubernetes Best Practices
DO
# 1. Use resource limitsresources: requests: cpu: "500m" memory: "512Mi" limits: cpu: "1000m" memory: "2Gi"
# 2. Use liveness and readiness probeslivenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 10 periodSeconds: 5
# 3. Use pod disruption budgetsapiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: airflow-pdbspec: minAvailable: 2 selector: matchLabels: app: airflow
# 4. Use secrets for credentialsenv:- name: DATABASE_URL valueFrom: secretKeyRef: name: database-secrets key: connection-string
# 5. Use network policiesapiVersion: networking.k8s.io/v1kind: NetworkPolicymetadata: name: data-platform-network-policyspec: podSelector: matchLabels: app: my-app policyTypes: - Ingress - Egress ingress: - from: - podSelector: matchLabels: app: allowed-app ports: - protocol: TCP port: 8080DON’T
# 1. Don't ignore resource limits# Always set requests and limits
# 2. Don't use latest image tag# Use specific versionsimage: my-app:v1.0.0 # Goodimage: my-app:latest # Bad
# 3. Don't ignore pod disruption budgets# Set PDB for availability
# 4. Don't run as root# Use security contextssecurityContext: runAsNonRoot: true runAsUser: 1000
# 5. Don't ignore networking# Use network policiesKubernetes vs. Docker Compose
| Feature | Kubernetes | Docker Compose |
|---|---|---|
| Scaling | Auto-scaling | Manual scaling |
| Resilience | Self-healing | Manual recovery |
| Load Balancing | Built-in | Limited |
| Service Discovery | Built-in | Limited |
| Complexity | High | Low |
| Best For | Production, scaling | Development, testing |
Key Takeaways
- Spark Operator: Run Spark on K8s
- JupyterHub: Data science notebooks
- Airflow: Orchestration on K8s
- Resources: Set limits and requests
- Storage: Use persistent volumes
- Monitoring: Prometheus metrics
- Scaling: Auto-scaling based on load
- Use When: Production, scalability, resilience needed
Back to Module 3