Back to Blog
Apache SparkKubernetesSpark OperatorData EngineeringCost OptimizationResource ManagementDevOpsCloud

Running Spark on Kubernetes — Operators, Resource Management, and Cost Optimization

A comprehensive guide to running Apache Spark on Kubernetes in production: the Kubeflow Spark Operator with SparkApplication and ScheduledSparkApplication CRDs, driver and executor pod configuration with memory anatomy (heap + overhead + Python worker), pod templates for init containers, volume mounts, and node affinity, dynamic resource allocation with shuffle tracking (no external shuffle service required), Karpenter NodePool configuration for spot instance pools with multi-instance-type diversification, driver on-demand and executor spot separation for cost-safe preemption handling, graceful executor decommission on spot interruption, Volcano gang scheduler for partial-allocation deadlock elimination, JMX Prometheus exporter with PodMonitor CRDs, SLO alerting rules for job duration and executor failure rate, CI/CD pipeline with Docker image build, manifest validation, staging smoke test, and ScheduledSparkApplication for cron-based pipelines, per-job cost attribution with OpenCost label queries, and ResourceQuota + LimitRange patterns for multi-tenant namespace isolation.

2026-06-06

Why Run Spark on Kubernetes — Escaping YARN and Mesos

Apache Spark was born in the Hadoop era and carries that legacy in its native cluster managers: YARN and Mesos. Both are purpose-built for long-running cluster daemons and collocated storage (HDFS), not for ephemeral, container-first workloads. Running Spark on YARN in 2026 means operating a separate cluster management plane, dealing with NodeManager resource fragmentation, and fighting YARN's queue-based fair scheduling that doesn't understand pod-level resource isolation.

Kubernetes changes the equation. Spark's native Kubernetes scheduler backend (GA since Spark 3.1) lets you submit jobs directly to a Kubernetes cluster — the driver runs as a pod, executors are launched as pods, and everything is garbage-collected when the job ends. The benefits are concrete: unified infrastructure (same cluster running Airflow, model serving, and Spark), fine-grained CPU/memory isolation via cgroups, spot instance pools for 60–80% cost reduction (see Kubernetes cost optimization), and GitOps lifecycle management via the Spark Kubernetes Operator. This guide covers everything from the first SparkApplication CRD to production cost optimization.

Spark on Kubernetes Architecture — Driver, Executors, and the Operator

Without the operator, a bare spark-submit --master k8s:// creates a driver pod and then spawns executor pods from inside the driver. This works but has critical operational gaps: no retry logic, no CRD-based state tracking, no webhook validation, and no Prometheus metrics endpoint. The Kubeflow Spark Operator (formerly GCP Spark Operator) fills all of these gaps by wrapping every Spark job in a SparkApplication or ScheduledSparkApplication CRD.

  • SparkOperator controller — watches for SparkApplication resources, calls spark-submit on your behalf, and reconciles the desired state. On failure it applies the restartPolicy (OnFailure, Never, Always).
  • Driver pod — the JVM process that runs the SparkContext, builds the DAG, schedules tasks, and communicates with the Kubernetes API server to create executor pods. Requests should be right-sized: the driver typically needs 1–4 cores and 2–8 GB memory, plus overhead.
  • Executor pods — short-lived workers created on demand. Each executor holds a JVM process with N task slots (spark.executor.cores) and a configurable memory footprint split between heap, off-heap, and overhead.
  • Webhook — the operator installs a mutating admission webhook that injects sidecar containers, init containers, and volume mounts defined in the SparkApplication spec before the pod is created — no per-image customization required.

Installing the Spark Operator — Helm, RBAC, and Webhook Setup

The operator is distributed as a Helm chart. Install it into a dedicated namespace; the operator will watch all namespaces where you create SparkApplication resources (configurable per namespace or cluster-wide).

# Add the Spark Operator Helm repo
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update

# Install the operator — enable webhook + Prometheus metrics
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace \
  --set webhook.enable=true \
  --set metrics.enable=true \
  --set metrics.prometheusPort=10254 \
  --set controller.workers=10 \
  --set spark.jobNamespaces="{spark-jobs}" \
  --version 1.4.6

# Verify the operator is running
kubectl get pods -n spark-operator
# NAME                                    READY   STATUS    RESTARTS   AGE
# spark-operator-7f4d8b9c5-xk9pl          1/1     Running   0          2m
# spark-operator-webhook-6b4d9b7c9-j8k2n  1/1     Running   0          2m

# Create the namespace where jobs will run
kubectl create namespace spark-jobs

# RBAC: service account for the driver pod to manage executor pods
kubectl create serviceaccount spark -n spark-jobs
kubectl create clusterrolebinding spark-role \
  --clusterrole=edit \
  --serviceaccount=spark-jobs:spark \
  --namespace=spark-jobs

Note

The mutating webhook intercepts all pod creation requests in watched namespaces. If the webhook pod is unavailable, new SparkApplication jobs will fail with connection refused from the API server. Always run the operator with replicaCount: 2 and a PodDisruptionBudget in production to ensure webhook availability during node upgrades and operator rollouts.

SparkApplication CRD — Complete Configuration Reference

The SparkApplication CRD is the declarative interface to every Spark job. It covers the Docker image, Spark configuration, driver and executor resource profiles, volume mounts, environment variables, Python dependencies, restart policy, and retry logic — everything spark-submit flags can express, but in a version-controlled, GitOps-friendly YAML manifest.

# spark-application.yaml — production-ready SparkApplication manifest
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: etl-daily-orders
  namespace: spark-jobs
  labels:
    app: etl-daily-orders
    team: data-engineering
    environment: production
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster

  # Container image — built with your ETL code baked in
  image: "my-registry.io/spark-etl:3.5.1-v42"
  imagePullPolicy: IfNotPresent

  mainApplicationFile: "local:///app/etl/daily_orders.py"

  arguments:
    - "--date=2026-06-06"
    - "--env=production"

  sparkVersion: "3.5.1"

  # ── Restart policy ──────────────────────────────────────────────────────────
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 30          # seconds between retries
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20

  # ── Spark configuration ─────────────────────────────────────────────────────
  sparkConf:
    "spark.sql.adaptive.enabled": "true"
    "spark.sql.adaptive.coalescePartitions.enabled": "true"
    "spark.sql.adaptive.skewJoin.enabled": "true"
    "spark.sql.shuffle.partitions": "200"
    "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
    "spark.kryoserializer.buffer.max": "512m"
    "spark.sql.parquet.compression.codec": "snappy"
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    # Dynamic allocation — see Section 6
    "spark.dynamicAllocation.enabled": "true"
    "spark.dynamicAllocation.minExecutors": "2"
    "spark.dynamicAllocation.maxExecutors": "20"
    "spark.dynamicAllocation.initialExecutors": "5"
    "spark.dynamicAllocation.shuffleTracking.enabled": "true"

  # ── Driver spec ─────────────────────────────────────────────────────────────
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "2g"
    memoryOverhead: "512m"
    serviceAccount: spark
    labels:
      component: driver
    annotations:
      cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
    env:
      - name: AWS_REGION
        value: us-east-1
      - name: APP_ENV
        value: production
    envFrom:
      - secretRef:
          name: spark-etl-secrets       # DB_USER, DB_PASSWORD, API_KEY
    volumeMounts:
      - name: spark-local-dir
        mountPath: /tmp/spark-local
      - name: config-volume
        mountPath: /app/config

  # ── Executor spec ────────────────────────────────────────────────────────────
  executor:
    cores: 2
    coreLimit: "2400m"
    instances: 5                        # initial count; dynamic allocation overrides
    memory: "4g"
    memoryOverhead: "1g"
    labels:
      component: executor
    env:
      - name: AWS_REGION
        value: us-east-1
    volumeMounts:
      - name: spark-local-dir
        mountPath: /tmp/spark-local

  # ── Shared volumes ───────────────────────────────────────────────────────────
  volumes:
    - name: spark-local-dir
      emptyDir:
        sizeLimit: 20Gi
    - name: config-volume
      configMap:
        name: spark-etl-config

  # ── Dependencies ─────────────────────────────────────────────────────────────
  deps:
    jars:
      - "local:///app/jars/aws-java-sdk-bundle-1.12.600.jar"
      - "local:///app/jars/hadoop-aws-3.3.4.jar"
    pyFiles:
      - "local:///app/lib/shared_utils.zip"

  # ── Monitoring ───────────────────────────────────────────────────────────────
  monitoring:
    exposeDriverMetrics: true
    exposeExecutorMetrics: true
    prometheus:
      jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.20.0.jar"
      port: 8090

Resource Management — Requests, Limits, Memory Tuning, and Pod Templates

Spark on Kubernetes translates memory settings into pod resource requests in a non-obvious way. The memory field sets the JVM heap (spark.executor.memory), and memoryOverhead covers non-heap memory (off-heap, native libs, Python worker memory, JVM overhead). The pod's actual memory request is memory + memoryOverhead. If you omit memoryOverhead, Spark defaults to max(384m, 10% of memory) — often not enough for PySpark workloads where the Python worker process adds 200–800 MB.

# ── Memory anatomy for a 4g executor with 1g overhead ────────────────────────
#
#   Kubernetes pod memory request = 4g + 1g = 5g
#
#   Inside the JVM (4g heap):
#     spark.executor.memoryFraction     (default 0.6)  → 2.4g for execution
#     spark.executor.memory.storageFraction (default 0.5) → 1.2g for storage/cache
#
#   Off-heap:
#     spark.memory.offHeap.enabled = true
#     spark.memory.offHeap.size    = 512m   (part of overhead budget)
#
#   Python worker (PySpark):
#     pyspark.worker.memory = 512m         (also part of overhead)

# ── SparkApplication pod template for init containers and extra volumes ────────
# Use podTemplateFile to inject Kubernetes-specific fields the CRD doesn't expose
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: spark-executor-template
  namespace: spark-jobs
data:
  executor-template.yaml: |
    apiVersion: v1
    kind: Pod
    spec:
      # Priority class — keeps executors on spot nodes
      priorityClassName: spark-batch-low
      tolerations:
        - key: "spot"
          operator: "Equal"
          value: "true"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: node.kubernetes.io/instance-type
                    operator: In
                    values: [m5.4xlarge, m5.8xlarge, m6i.4xlarge, r5.4xlarge]
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 50
              podAffinityTerm:
                labelSelector:
                  matchLabels:
                    component: executor
                topologyKey: kubernetes.io/hostname
      initContainers:
        - name: wait-for-driver
          image: busybox:1.36
          command: ['sh', '-c', 'until nc -z ${SPARK_DRIVER_BIND_ADDRESS} 7078; do sleep 2; done']
      containers:
        - name: spark-kubernetes-executor
          resources:
            requests:
              ephemeral-storage: "5Gi"
            limits:
              ephemeral-storage: "20Gi"

# Reference in SparkApplication
---
sparkConf:
  "spark.kubernetes.executor.podTemplateFile": "/tmp/executor-template.yaml"
  "spark.kubernetes.driver.podTemplateFile": "/tmp/driver-template.yaml"

# ── Right-sizing executor memory for PySpark ──────────────────────────────────
# Rule of thumb per task slot:
#   - Input partition size:              128–256 MB (default spark.sql.files.maxPartitionBytes)
#   - Working memory per task (3x input): ~400–800 MB
#   - 2 cores → ~1.6 GB heap needed
#   - Add 40% shuffle overhead:          ~2.3 GB
#   - Round up to 3g heap + 1g overhead → 4g pod request per 2-core executor

# ── ResourceQuota for multi-tenant namespace isolation ─────────────────────────
apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-data-engineering-quota
  namespace: spark-jobs-team-a
spec:
  hard:
    requests.cpu: "80"
    requests.memory: 320Gi
    limits.cpu: "160"
    limits.memory: 640Gi
    count/pods: "200"
    count/sparkoperator.k8s.io/v1beta2/sparkapplications: "20"

Note

Set CPU requests equal to the number of task slots (spark.executor.cores) and set the coreLimit 20% higher. Never set CPU limits equal to requests on executor pods — the Kubernetes CPU throttling that kicks in when the JVM GC spikes above the limit causes cascading task failures and heartbeat timeouts that look like executor crashes. Use Guaranteed QoS only for the driver pod where stability matters most.

Dynamic Resource Allocation — Scaling Executors Without an External Shuffle Service

Spark's Dynamic Resource Allocation (DRA) lets Spark scale executor count up and down within a single job based on task queue depth. On YARN this required an external shuffle service; on Kubernetes, Spark 3.1+ ships shuffleTracking — a driver-side mechanism that tracks which executors hold live shuffle data and delays decommissioning them until their blocks are consumed. No daemon process required.

# ── Dynamic Resource Allocation with shuffle tracking ─────────────────────────
sparkConf:
  # Enable DRA
  "spark.dynamicAllocation.enabled": "true"
  "spark.dynamicAllocation.shuffleTracking.enabled": "true"

  # Executor count bounds
  "spark.dynamicAllocation.minExecutors": "2"
  "spark.dynamicAllocation.maxExecutors": "50"
  "spark.dynamicAllocation.initialExecutors": "5"

  # Scale-up aggressiveness
  # Request new executors when tasks have been pending for 1s
  "spark.dynamicAllocation.schedulerBacklogTimeout": "1s"
  "spark.dynamicAllocation.sustainedSchedulerBacklogTimeout": "5s"

  # Scale-down conservatism — only remove idle executors after 60s
  "spark.dynamicAllocation.executorIdleTimeout": "60s"

  # Keep cached-data executors alive longer (avoid re-reading S3 for cached RDDs)
  "spark.dynamicAllocation.cachedExecutorIdleTimeout": "300s"

  # Shuffle tracking — don't evict executors that hold un-fetched shuffle blocks
  "spark.dynamicAllocation.shuffleTracking.timeout": "300s"

# ── Node pool topology for DRA ─────────────────────────────────────────────────
# Label a spot node pool for Spark executors
# Karpenter NodePool (preferred over Cluster Autoscaler for Spark)
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spark-spot-pool
spec:
  template:
    metadata:
      labels:
        workload: spark-executor
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: spark-nodes
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m5.4xlarge
            - m5.8xlarge
            - m5.16xlarge
            - m6i.4xlarge
            - m6i.8xlarge
            - r5.4xlarge           # memory-optimized for join-heavy jobs
            - r5.8xlarge
      taints:
        - key: workload
          value: spark-executor
          effect: NoSchedule
  limits:
    cpu: 2000
    memory: 8000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 120s      # wait 2 min before removing underutilized nodes

---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: spark-nodes
spec:
  amiSelectorTerms:
    - alias: al2023@latest
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  instanceStorePolicy: RAID0   # stripe NVMe instance storage for fast shuffle
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 50Gi
        volumeType: gp3
        iops: 3000
        throughput: 125

Spot Instance Cost Optimization — Preemption Handling and Checkpoint Strategies

Spot instances offer 60–80% savings over on-demand pricing but can be reclaimed with a two-minute warning. For Spark this means potential executor loss — which Spark handles gracefully through task re-execution. The risk is the driver pod: if the driver runs on a spot node and is preempted, the entire job is lost. The architecture is straightforward: always run the driver on on-demand nodes and route executors to spot pools.

# ── Driver on on-demand, executors on spot ────────────────────────────────────
driver:
  cores: 2
  memory: "4g"
  labels:
    workload: spark-driver
  nodeSelector:
    karpenter.sh/capacity-type: on-demand
    workload: spark-driver
  tolerations: []               # no spot toleration on driver
  annotations:
    cluster-autoscaler.kubernetes.io/safe-to-evict: "false"

executor:
  cores: 4
  memory: "8g"
  memoryOverhead: "2g"
  labels:
    workload: spark-executor
  tolerations:
    - key: workload
      operator: Equal
      value: spark-executor
      effect: NoSchedule
  nodeSelector:
    workload: spark-executor    # Karpenter spots this label and places on spot pool

# ── Graceful executor decommission on spot interruption ───────────────────────
sparkConf:
  # Enable graceful decommission (Spark 3.1+)
  "spark.executor.decommission.enabled": "true"
  # How long to wait for in-progress tasks to finish before forced shutdown
  "spark.storage.decommission.enabled": "true"
  "spark.storage.decommission.shuffleBlocks.enabled": "true"
  "spark.storage.decommission.rddBlocks.enabled": "true"
  "spark.decommission.timeout": "90s"

# ── Kubernetes SIGTERM handler for 2-minute spot warning ──────────────────────
# Karpenter sends SIGTERM when spot reclamation is imminent.
# Spark's executor process catches it and begins graceful decommission.
# Wire this into your preStop lifecycle hook for belt-and-suspenders:
executor:
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "sleep 25"]   # give Spark time to migrate blocks

# ── Checkpointing for long-running jobs (>2h) ─────────────────────────────────
# For iterative ML jobs or long pipelines, checkpoint RDDs to S3
# so spot preemption triggers a restart from the latest checkpoint, not t=0.
sparkConf:
  "spark.checkpoint.dir": "s3a://my-bucket/spark-checkpoints/etl-daily-orders"

# In PySpark code:
# rdd.checkpoint()          # after expensive shuffle stages
# spark.sparkContext.setCheckpointDir("s3a://my-bucket/spark-checkpoints/")

Note

Spot savings are maximized when you diversify across multiple instance families and sizes. A Karpenter NodePool with 6–8 compatible instance types (m5.4xl, m5.8xl, m6i.4xl, r5.4xl, etc.) gives the spot market more room to fulfill requests and reduces the chance of a zone-wide capacity shortage interrupting your job. Avoid single-instance-type pools — they amplify spot interruption risk.

Gang Scheduling with Volcano — Eliminating Partial-Allocation Deadlocks

Kubernetes' default scheduler allocates pods one at a time. Under resource contention, a Spark job can have its driver running while all executor slots are taken by another job's executors — neither job makes progress, and both hold resources. This is the partial allocation deadlock. Volcano solves it with gang scheduling: it holds all pods for a job in a waiting state until the full minAvailable count can be placed simultaneously.

# Install Volcano scheduler
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

# ── SparkApplication with Volcano gang scheduling ─────────────────────────────
sparkConf:
  "spark.kubernetes.scheduler.name": "volcano"
  # Minimum executors that must start together (gang constraint)
  "spark.kubernetes.scheduler.volcano.podGroupPolicy.minAvailable": "5"

# ── Volcano Queue — per-team resource pools ───────────────────────────────────
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
  name: team-data-engineering
spec:
  weight: 3               # relative weight vs other queues
  reclaimable: true       # allows borrowing from idle queues
  capability:
    cpu: 80
    memory: 320Gi

---
# Reference the queue in SparkApplication
sparkConf:
  "spark.kubernetes.scheduler.volcano.podGroupPolicy.queue": "team-data-engineering"

# ── PodGroup created automatically by operator ────────────────────────────────
# The Spark operator creates a PodGroup for each SparkApplication when
# spark.kubernetes.scheduler.name=volcano. Monitor pending jobs:
kubectl get podgroup -n spark-jobs
# NAME                    STATUS   AGE
# etl-daily-orders-group  Running  5m
# ml-feature-gen-group    Pending  2m   ← waiting for gang to fit

Monitoring — Prometheus Metrics, Grafana Dashboard, and SLO Alerting

Spark exposes JVM and application metrics via JMX. The Spark Operator integrates a JMX Prometheus exporter as a Java agent on both the driver and executors, creating a/metrics endpoint on port 8090. Prometheus scrapes it via PodMonitor (Prometheus Operator CRD). The operator itself exposes operator-level metrics (job submit count, failure count, active application count) on its own metrics port.

# ── PodMonitor for Spark driver and executor metrics ─────────────────────────
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: spark-applications
  namespace: spark-jobs
spec:
  selector:
    matchLabels:
      spark-app-selector: spark-etl      # set via sparkConf label
  podMetricsEndpoints:
    - port: "metrics"                    # name from SparkApplication monitoring.prometheus.port
      interval: 15s
      scheme: http
      path: /metrics

# ── PodMonitor for the Spark Operator itself ──────────────────────────────────
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: spark-operator
  namespace: spark-operator
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: spark-operator
  podMetricsEndpoints:
    - port: "metrics"
      interval: 30s

# ── Prometheus alerting rules for Spark SLOs ──────────────────────────────────
groups:
  - name: spark.rules
    rules:
      # Alert when a Spark job has been running longer than its SLA
      - alert: SparkJobExceedsSLA
        expr: |
          (time() - spark_app_submission_time_seconds) > 7200   # 2h SLA
          and spark_app_state == 1                               # 1=RUNNING
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Spark job {{ $labels.app_name }} running > 2h"

      # Alert when executor failure rate is high (job about to fail)
      - alert: SparkHighExecutorFailureRate
        expr: |
          increase(spark_app_executor_failure_count_total[10m]) > 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High executor failures in {{ $labels.app_name }}"

      # Operator: pending submissions backlog
      - alert: SparkOperatorHighSubmissionBacklog
        expr: spark_operator_spark_app_count{state="PENDING"} > 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $value }} Spark apps pending — possible resource starvation"

# ── Key metrics to dashboard in Grafana ───────────────────────────────────────
# spark_app_executor_count           — current executor count (DRA scaling visible here)
# spark_app_task_failure_count_total — cumulative task failures
# spark_app_executor_failure_count_total — cumulative executor failures
# jvm_memory_bytes_used{area="heap"} — heap utilization per executor
# jvm_gc_collection_seconds_sum      — GC pressure
# spark_operator_spark_app_count     — running/pending/completed/failed apps

Production Patterns — CI/CD Pipeline, Image Building, and Multi-Tenant Namespacing

A production Spark-on-Kubernetes delivery pipeline has four stages: build a reproducible Docker image with all dependencies, validate the SparkApplication manifest with a schema linter, run a test job in a staging namespace with reduced executor count, then promote to production via ArgoCD or a GitOps pull request.

# ── Dockerfile: Spark 3.5 + Python dependencies ──────────────────────────────
FROM apache/spark:3.5.1-python3

USER root

# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    curl openjdk-17-jdk-headless && \
    rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt /opt/spark/requirements.txt
RUN pip install --no-cache-dir -r /opt/spark/requirements.txt

# Copy application code
COPY src/ /app/
COPY jars/ /app/jars/
COPY config/ /app/config/

# JMX exporter agent
COPY prometheus/jmx_prometheus_javaagent-0.20.0.jar /prometheus/

USER 185    # spark user

# ── GitHub Actions CI/CD pipeline ────────────────────────────────────────────
# .github/workflows/spark-etl-deploy.yml (excerpt)
name: Spark ETL — Build and Deploy

on:
  push:
    branches: [main]
    paths: ['spark-etl/**']

jobs:
  build-and-push:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and push image
        uses: docker/build-push-action@v5
        with:
          context: ./spark-etl
          push: true
          tags: my-registry.io/spark-etl:3.5.1-${{ github.sha }}

  validate-manifests:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
      - name: Validate SparkApplication CRD
        run: |
          helm template spark-etl ./helm/spark-etl \
            --set image.tag=${{ github.sha }} | \
            kubectl apply --dry-run=server -f -

  deploy-staging:
    needs: validate-manifests
    steps:
      - name: Deploy to staging and wait for completion
        run: |
          kubectl apply -f manifests/staging/spark-application.yaml
          kubectl wait sparkapplication/etl-daily-orders-staging \
            --for=condition=Running --timeout=300s -n spark-jobs-staging
          # Poll until COMPLETED or FAILED
          for i in $(seq 1 60); do
            STATE=$(kubectl get sparkapplication etl-daily-orders-staging \
              -n spark-jobs-staging -o jsonpath='{.status.applicationState.state}')
            if [ "$STATE" = "COMPLETED" ]; then exit 0; fi
            if [ "$STATE" = "FAILED" ]; then exit 1; fi
            sleep 30
          done

# ── ScheduledSparkApplication for recurring cron jobs ─────────────────────────
apiVersion: sparkoperator.k8s.io/v1beta2
kind: ScheduledSparkApplication
metadata:
  name: etl-daily-orders-scheduled
  namespace: spark-jobs
spec:
  schedule: "0 2 * * *"              # 02:00 UTC daily
  concurrencyPolicy: Forbid           # skip if previous run not done
  successfulRunHistoryLimit: 5
  failedRunHistoryLimit: 5
  template:
    # Identical to SparkApplication spec above
    type: Python
    image: "my-registry.io/spark-etl:3.5.1-latest"
    mainApplicationFile: "local:///app/etl/daily_orders.py"
    sparkVersion: "3.5.1"
    restartPolicy:
      type: OnFailure
      onFailureRetries: 2
      onFailureRetryInterval: 60
    driver:
      cores: 1
      memory: "2g"
      serviceAccount: spark
    executor:
      cores: 2
      instances: 5
      memory: "4g"
      memoryOverhead: "1g"

Note

When running multiple teams on the same cluster, use one Kubernetes namespace per team and configure LimitRange objects alongside ResourceQuota to enforce minimum and maximum resource requests per pod. Without a LimitRange, a misconfigured SparkApplication with no resource requests will be scheduled as a BestEffort pod and evicted first under pressure — often at the worst moment during a critical pipeline run.

Cost Tracking — Per-Job Compute Attribution and FinOps Patterns

Running Spark on Kubernetes opens the door to precise per-job cost attribution. Label every SparkApplication with team, pipeline, and cost-center — these propagate to all driver and executor pods. Tools like OpenCost or Kubecost aggregate pod-level resource consumption by label, letting you see exactly what each team's Spark jobs cost per day, week, or pipeline run.

# ── Label propagation in SparkApplication ────────────────────────────────────
metadata:
  labels:
    team: data-engineering
    pipeline: daily-orders-etl
    cost-center: cc-1042
    environment: production

driver:
  labels:
    team: data-engineering
    pipeline: daily-orders-etl
    cost-center: cc-1042

executor:
  labels:
    team: data-engineering
    pipeline: daily-orders-etl
    cost-center: cc-1042

# ── OpenCost Prometheus query: daily cost by pipeline ─────────────────────────
# (OpenCost exposes container_cpu_allocation and container_memory_allocation metrics)

# CPU cost per pipeline per day
sum by (label_pipeline) (
  increase(container_cpu_allocation{namespace="spark-jobs"}[1d])
  * on(pod) group_left(label_pipeline)
  kube_pod_labels{namespace="spark-jobs"}
) * on() group_left()
  opencost_node_cpu_hourly_cost

# Total daily Spark cost per team
sum by (label_team) (
  (
    increase(container_cpu_allocation{namespace=~"spark-jobs.*"}[1d])
      * on() group_left() opencost_node_cpu_hourly_cost
  ) + (
    increase(container_memory_allocation_bytes{namespace=~"spark-jobs.*"}[1d])
      * on() group_left() opencost_node_ram_hourly_cost / 1073741824
  )
)

# ── Weekly cost anomaly alert ─────────────────────────────────────────────────
- alert: SparkWeeklyCostAnomaly
  expr: |
    sum by (label_pipeline) (
      increase(spark_job_cost_usd[7d])
    ) > 1.5 * sum by (label_pipeline) (
      avg_over_time(spark_job_cost_usd[4w:7d])
    )
  labels:
    severity: warning
  annotations:
    summary: "Spark pipeline {{ $labels.label_pipeline }} cost up 50% vs 4-week avg"

Work with us

Running Spark on YARN or standalone clusters and looking to migrate to Kubernetes for unified infrastructure and lower costs?

We design and implement Spark-on-Kubernetes platforms — from Spark Operator installation and SparkApplication CRD design to dynamic resource allocation configuration, Karpenter spot node pool setup, Volcano gang scheduling for high-throughput workloads, JMX Prometheus monitoring with Grafana dashboards, CI/CD pipelines for Spark image builds and manifest promotion, multi-tenant ResourceQuota and LimitRange enforcement, and per-job cost attribution with OpenCost. Let’s talk.

Get in touch

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.