Why Run Spark on Kubernetes — Escaping YARN and Mesos
Apache Spark was born in the Hadoop era and carries that legacy in its native cluster managers: YARN and Mesos. Both are purpose-built for long-running cluster daemons and collocated storage (HDFS), not for ephemeral, container-first workloads. Running Spark on YARN in 2026 means operating a separate cluster management plane, dealing with NodeManager resource fragmentation, and fighting YARN's queue-based fair scheduling that doesn't understand pod-level resource isolation.
Kubernetes changes the equation. Spark's native Kubernetes scheduler backend (GA since Spark 3.1) lets you submit jobs directly to a Kubernetes cluster — the driver runs as a pod, executors are launched as pods, and everything is garbage-collected when the job ends. The benefits are concrete: unified infrastructure (same cluster running Airflow, model serving, and Spark), fine-grained CPU/memory isolation via cgroups, spot instance pools for 60–80% cost reduction (see Kubernetes cost optimization), and GitOps lifecycle management via the Spark Kubernetes Operator. This guide covers everything from the first SparkApplication CRD to production cost optimization.
Spark on Kubernetes Architecture — Driver, Executors, and the Operator
Without the operator, a bare spark-submit --master k8s:// creates a driver pod and then spawns executor pods from inside the driver. This works but has critical operational gaps: no retry logic, no CRD-based state tracking, no webhook validation, and no Prometheus metrics endpoint. The Kubeflow Spark Operator (formerly GCP Spark Operator) fills all of these gaps by wrapping every Spark job in a SparkApplication or ScheduledSparkApplication CRD.
- SparkOperator controller — watches for
SparkApplicationresources, calls spark-submit on your behalf, and reconciles the desired state. On failure it applies therestartPolicy(OnFailure, Never, Always). - Driver pod — the JVM process that runs the SparkContext, builds the DAG, schedules tasks, and communicates with the Kubernetes API server to create executor pods. Requests should be right-sized: the driver typically needs 1–4 cores and 2–8 GB memory, plus overhead.
- Executor pods — short-lived workers created on demand. Each executor holds a JVM process with N task slots (
spark.executor.cores) and a configurable memory footprint split between heap, off-heap, and overhead. - Webhook — the operator installs a mutating admission webhook that injects sidecar containers, init containers, and volume mounts defined in the SparkApplication spec before the pod is created — no per-image customization required.
Installing the Spark Operator — Helm, RBAC, and Webhook Setup
The operator is distributed as a Helm chart. Install it into a dedicated namespace; the operator will watch all namespaces where you create SparkApplication resources (configurable per namespace or cluster-wide).
# Add the Spark Operator Helm repo
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update
# Install the operator — enable webhook + Prometheus metrics
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--create-namespace \
--set webhook.enable=true \
--set metrics.enable=true \
--set metrics.prometheusPort=10254 \
--set controller.workers=10 \
--set spark.jobNamespaces="{spark-jobs}" \
--version 1.4.6
# Verify the operator is running
kubectl get pods -n spark-operator
# NAME READY STATUS RESTARTS AGE
# spark-operator-7f4d8b9c5-xk9pl 1/1 Running 0 2m
# spark-operator-webhook-6b4d9b7c9-j8k2n 1/1 Running 0 2m
# Create the namespace where jobs will run
kubectl create namespace spark-jobs
# RBAC: service account for the driver pod to manage executor pods
kubectl create serviceaccount spark -n spark-jobs
kubectl create clusterrolebinding spark-role \
--clusterrole=edit \
--serviceaccount=spark-jobs:spark \
--namespace=spark-jobsNote
connection refused from the API server. Always run the operator with replicaCount: 2 and a PodDisruptionBudget in production to ensure webhook availability during node upgrades and operator rollouts.SparkApplication CRD — Complete Configuration Reference
The SparkApplication CRD is the declarative interface to every Spark job. It covers the Docker image, Spark configuration, driver and executor resource profiles, volume mounts, environment variables, Python dependencies, restart policy, and retry logic — everything spark-submit flags can express, but in a version-controlled, GitOps-friendly YAML manifest.
# spark-application.yaml — production-ready SparkApplication manifest
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: etl-daily-orders
namespace: spark-jobs
labels:
app: etl-daily-orders
team: data-engineering
environment: production
spec:
type: Python
pythonVersion: "3"
mode: cluster
# Container image — built with your ETL code baked in
image: "my-registry.io/spark-etl:3.5.1-v42"
imagePullPolicy: IfNotPresent
mainApplicationFile: "local:///app/etl/daily_orders.py"
arguments:
- "--date=2026-06-06"
- "--env=production"
sparkVersion: "3.5.1"
# ── Restart policy ──────────────────────────────────────────────────────────
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 30 # seconds between retries
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
# ── Spark configuration ─────────────────────────────────────────────────────
sparkConf:
"spark.sql.adaptive.enabled": "true"
"spark.sql.adaptive.coalescePartitions.enabled": "true"
"spark.sql.adaptive.skewJoin.enabled": "true"
"spark.sql.shuffle.partitions": "200"
"spark.serializer": "org.apache.spark.serializer.KryoSerializer"
"spark.kryoserializer.buffer.max": "512m"
"spark.sql.parquet.compression.codec": "snappy"
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
"spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
# Dynamic allocation — see Section 6
"spark.dynamicAllocation.enabled": "true"
"spark.dynamicAllocation.minExecutors": "2"
"spark.dynamicAllocation.maxExecutors": "20"
"spark.dynamicAllocation.initialExecutors": "5"
"spark.dynamicAllocation.shuffleTracking.enabled": "true"
# ── Driver spec ─────────────────────────────────────────────────────────────
driver:
cores: 1
coreLimit: "1200m"
memory: "2g"
memoryOverhead: "512m"
serviceAccount: spark
labels:
component: driver
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
env:
- name: AWS_REGION
value: us-east-1
- name: APP_ENV
value: production
envFrom:
- secretRef:
name: spark-etl-secrets # DB_USER, DB_PASSWORD, API_KEY
volumeMounts:
- name: spark-local-dir
mountPath: /tmp/spark-local
- name: config-volume
mountPath: /app/config
# ── Executor spec ────────────────────────────────────────────────────────────
executor:
cores: 2
coreLimit: "2400m"
instances: 5 # initial count; dynamic allocation overrides
memory: "4g"
memoryOverhead: "1g"
labels:
component: executor
env:
- name: AWS_REGION
value: us-east-1
volumeMounts:
- name: spark-local-dir
mountPath: /tmp/spark-local
# ── Shared volumes ───────────────────────────────────────────────────────────
volumes:
- name: spark-local-dir
emptyDir:
sizeLimit: 20Gi
- name: config-volume
configMap:
name: spark-etl-config
# ── Dependencies ─────────────────────────────────────────────────────────────
deps:
jars:
- "local:///app/jars/aws-java-sdk-bundle-1.12.600.jar"
- "local:///app/jars/hadoop-aws-3.3.4.jar"
pyFiles:
- "local:///app/lib/shared_utils.zip"
# ── Monitoring ───────────────────────────────────────────────────────────────
monitoring:
exposeDriverMetrics: true
exposeExecutorMetrics: true
prometheus:
jmxExporterJar: "/prometheus/jmx_prometheus_javaagent-0.20.0.jar"
port: 8090Resource Management — Requests, Limits, Memory Tuning, and Pod Templates
Spark on Kubernetes translates memory settings into pod resource requests in a non-obvious way. The memory field sets the JVM heap (spark.executor.memory), and memoryOverhead covers non-heap memory (off-heap, native libs, Python worker memory, JVM overhead). The pod's actual memory request is memory + memoryOverhead. If you omit memoryOverhead, Spark defaults to max(384m, 10% of memory) — often not enough for PySpark workloads where the Python worker process adds 200–800 MB.
# ── Memory anatomy for a 4g executor with 1g overhead ────────────────────────
#
# Kubernetes pod memory request = 4g + 1g = 5g
#
# Inside the JVM (4g heap):
# spark.executor.memoryFraction (default 0.6) → 2.4g for execution
# spark.executor.memory.storageFraction (default 0.5) → 1.2g for storage/cache
#
# Off-heap:
# spark.memory.offHeap.enabled = true
# spark.memory.offHeap.size = 512m (part of overhead budget)
#
# Python worker (PySpark):
# pyspark.worker.memory = 512m (also part of overhead)
# ── SparkApplication pod template for init containers and extra volumes ────────
# Use podTemplateFile to inject Kubernetes-specific fields the CRD doesn't expose
---
apiVersion: v1
kind: ConfigMap
metadata:
name: spark-executor-template
namespace: spark-jobs
data:
executor-template.yaml: |
apiVersion: v1
kind: Pod
spec:
# Priority class — keeps executors on spot nodes
priorityClassName: spark-batch-low
tolerations:
- key: "spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values: [m5.4xlarge, m5.8xlarge, m6i.4xlarge, r5.4xlarge]
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchLabels:
component: executor
topologyKey: kubernetes.io/hostname
initContainers:
- name: wait-for-driver
image: busybox:1.36
command: ['sh', '-c', 'until nc -z ${SPARK_DRIVER_BIND_ADDRESS} 7078; do sleep 2; done']
containers:
- name: spark-kubernetes-executor
resources:
requests:
ephemeral-storage: "5Gi"
limits:
ephemeral-storage: "20Gi"
# Reference in SparkApplication
---
sparkConf:
"spark.kubernetes.executor.podTemplateFile": "/tmp/executor-template.yaml"
"spark.kubernetes.driver.podTemplateFile": "/tmp/driver-template.yaml"
# ── Right-sizing executor memory for PySpark ──────────────────────────────────
# Rule of thumb per task slot:
# - Input partition size: 128–256 MB (default spark.sql.files.maxPartitionBytes)
# - Working memory per task (3x input): ~400–800 MB
# - 2 cores → ~1.6 GB heap needed
# - Add 40% shuffle overhead: ~2.3 GB
# - Round up to 3g heap + 1g overhead → 4g pod request per 2-core executor
# ── ResourceQuota for multi-tenant namespace isolation ─────────────────────────
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-data-engineering-quota
namespace: spark-jobs-team-a
spec:
hard:
requests.cpu: "80"
requests.memory: 320Gi
limits.cpu: "160"
limits.memory: 640Gi
count/pods: "200"
count/sparkoperator.k8s.io/v1beta2/sparkapplications: "20"Note
spark.executor.cores) and set the coreLimit 20% higher. Never set CPU limits equal to requests on executor pods — the Kubernetes CPU throttling that kicks in when the JVM GC spikes above the limit causes cascading task failures and heartbeat timeouts that look like executor crashes. Use Guaranteed QoS only for the driver pod where stability matters most.Dynamic Resource Allocation — Scaling Executors Without an External Shuffle Service
Spark's Dynamic Resource Allocation (DRA) lets Spark scale executor count up and down within a single job based on task queue depth. On YARN this required an external shuffle service; on Kubernetes, Spark 3.1+ ships shuffleTracking — a driver-side mechanism that tracks which executors hold live shuffle data and delays decommissioning them until their blocks are consumed. No daemon process required.
# ── Dynamic Resource Allocation with shuffle tracking ─────────────────────────
sparkConf:
# Enable DRA
"spark.dynamicAllocation.enabled": "true"
"spark.dynamicAllocation.shuffleTracking.enabled": "true"
# Executor count bounds
"spark.dynamicAllocation.minExecutors": "2"
"spark.dynamicAllocation.maxExecutors": "50"
"spark.dynamicAllocation.initialExecutors": "5"
# Scale-up aggressiveness
# Request new executors when tasks have been pending for 1s
"spark.dynamicAllocation.schedulerBacklogTimeout": "1s"
"spark.dynamicAllocation.sustainedSchedulerBacklogTimeout": "5s"
# Scale-down conservatism — only remove idle executors after 60s
"spark.dynamicAllocation.executorIdleTimeout": "60s"
# Keep cached-data executors alive longer (avoid re-reading S3 for cached RDDs)
"spark.dynamicAllocation.cachedExecutorIdleTimeout": "300s"
# Shuffle tracking — don't evict executors that hold un-fetched shuffle blocks
"spark.dynamicAllocation.shuffleTracking.timeout": "300s"
# ── Node pool topology for DRA ─────────────────────────────────────────────────
# Label a spot node pool for Spark executors
# Karpenter NodePool (preferred over Cluster Autoscaler for Spark)
---
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spark-spot-pool
spec:
template:
metadata:
labels:
workload: spark-executor
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: spark-nodes
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: node.kubernetes.io/instance-type
operator: In
values:
- m5.4xlarge
- m5.8xlarge
- m5.16xlarge
- m6i.4xlarge
- m6i.8xlarge
- r5.4xlarge # memory-optimized for join-heavy jobs
- r5.8xlarge
taints:
- key: workload
value: spark-executor
effect: NoSchedule
limits:
cpu: 2000
memory: 8000Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 120s # wait 2 min before removing underutilized nodes
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
name: spark-nodes
spec:
amiSelectorTerms:
- alias: al2023@latest
subnetSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
securityGroupSelectorTerms:
- tags:
karpenter.sh/discovery: my-cluster
instanceStorePolicy: RAID0 # stripe NVMe instance storage for fast shuffle
blockDeviceMappings:
- deviceName: /dev/xvda
ebs:
volumeSize: 50Gi
volumeType: gp3
iops: 3000
throughput: 125Spot Instance Cost Optimization — Preemption Handling and Checkpoint Strategies
Spot instances offer 60–80% savings over on-demand pricing but can be reclaimed with a two-minute warning. For Spark this means potential executor loss — which Spark handles gracefully through task re-execution. The risk is the driver pod: if the driver runs on a spot node and is preempted, the entire job is lost. The architecture is straightforward: always run the driver on on-demand nodes and route executors to spot pools.
# ── Driver on on-demand, executors on spot ────────────────────────────────────
driver:
cores: 2
memory: "4g"
labels:
workload: spark-driver
nodeSelector:
karpenter.sh/capacity-type: on-demand
workload: spark-driver
tolerations: [] # no spot toleration on driver
annotations:
cluster-autoscaler.kubernetes.io/safe-to-evict: "false"
executor:
cores: 4
memory: "8g"
memoryOverhead: "2g"
labels:
workload: spark-executor
tolerations:
- key: workload
operator: Equal
value: spark-executor
effect: NoSchedule
nodeSelector:
workload: spark-executor # Karpenter spots this label and places on spot pool
# ── Graceful executor decommission on spot interruption ───────────────────────
sparkConf:
# Enable graceful decommission (Spark 3.1+)
"spark.executor.decommission.enabled": "true"
# How long to wait for in-progress tasks to finish before forced shutdown
"spark.storage.decommission.enabled": "true"
"spark.storage.decommission.shuffleBlocks.enabled": "true"
"spark.storage.decommission.rddBlocks.enabled": "true"
"spark.decommission.timeout": "90s"
# ── Kubernetes SIGTERM handler for 2-minute spot warning ──────────────────────
# Karpenter sends SIGTERM when spot reclamation is imminent.
# Spark's executor process catches it and begins graceful decommission.
# Wire this into your preStop lifecycle hook for belt-and-suspenders:
executor:
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 25"] # give Spark time to migrate blocks
# ── Checkpointing for long-running jobs (>2h) ─────────────────────────────────
# For iterative ML jobs or long pipelines, checkpoint RDDs to S3
# so spot preemption triggers a restart from the latest checkpoint, not t=0.
sparkConf:
"spark.checkpoint.dir": "s3a://my-bucket/spark-checkpoints/etl-daily-orders"
# In PySpark code:
# rdd.checkpoint() # after expensive shuffle stages
# spark.sparkContext.setCheckpointDir("s3a://my-bucket/spark-checkpoints/")Note
Gang Scheduling with Volcano — Eliminating Partial-Allocation Deadlocks
Kubernetes' default scheduler allocates pods one at a time. Under resource contention, a Spark job can have its driver running while all executor slots are taken by another job's executors — neither job makes progress, and both hold resources. This is the partial allocation deadlock. Volcano solves it with gang scheduling: it holds all pods for a job in a waiting state until the full minAvailable count can be placed simultaneously.
# Install Volcano scheduler
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
# ── SparkApplication with Volcano gang scheduling ─────────────────────────────
sparkConf:
"spark.kubernetes.scheduler.name": "volcano"
# Minimum executors that must start together (gang constraint)
"spark.kubernetes.scheduler.volcano.podGroupPolicy.minAvailable": "5"
# ── Volcano Queue — per-team resource pools ───────────────────────────────────
apiVersion: scheduling.volcano.sh/v1beta1
kind: Queue
metadata:
name: team-data-engineering
spec:
weight: 3 # relative weight vs other queues
reclaimable: true # allows borrowing from idle queues
capability:
cpu: 80
memory: 320Gi
---
# Reference the queue in SparkApplication
sparkConf:
"spark.kubernetes.scheduler.volcano.podGroupPolicy.queue": "team-data-engineering"
# ── PodGroup created automatically by operator ────────────────────────────────
# The Spark operator creates a PodGroup for each SparkApplication when
# spark.kubernetes.scheduler.name=volcano. Monitor pending jobs:
kubectl get podgroup -n spark-jobs
# NAME STATUS AGE
# etl-daily-orders-group Running 5m
# ml-feature-gen-group Pending 2m ← waiting for gang to fitMonitoring — Prometheus Metrics, Grafana Dashboard, and SLO Alerting
Spark exposes JVM and application metrics via JMX. The Spark Operator integrates a JMX Prometheus exporter as a Java agent on both the driver and executors, creating a/metrics endpoint on port 8090. Prometheus scrapes it via PodMonitor (Prometheus Operator CRD). The operator itself exposes operator-level metrics (job submit count, failure count, active application count) on its own metrics port.
# ── PodMonitor for Spark driver and executor metrics ─────────────────────────
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: spark-applications
namespace: spark-jobs
spec:
selector:
matchLabels:
spark-app-selector: spark-etl # set via sparkConf label
podMetricsEndpoints:
- port: "metrics" # name from SparkApplication monitoring.prometheus.port
interval: 15s
scheme: http
path: /metrics
# ── PodMonitor for the Spark Operator itself ──────────────────────────────────
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: spark-operator
namespace: spark-operator
spec:
selector:
matchLabels:
app.kubernetes.io/name: spark-operator
podMetricsEndpoints:
- port: "metrics"
interval: 30s
# ── Prometheus alerting rules for Spark SLOs ──────────────────────────────────
groups:
- name: spark.rules
rules:
# Alert when a Spark job has been running longer than its SLA
- alert: SparkJobExceedsSLA
expr: |
(time() - spark_app_submission_time_seconds) > 7200 # 2h SLA
and spark_app_state == 1 # 1=RUNNING
for: 5m
labels:
severity: warning
annotations:
summary: "Spark job {{ $labels.app_name }} running > 2h"
# Alert when executor failure rate is high (job about to fail)
- alert: SparkHighExecutorFailureRate
expr: |
increase(spark_app_executor_failure_count_total[10m]) > 5
for: 2m
labels:
severity: critical
annotations:
summary: "High executor failures in {{ $labels.app_name }}"
# Operator: pending submissions backlog
- alert: SparkOperatorHighSubmissionBacklog
expr: spark_operator_spark_app_count{state="PENDING"} > 20
for: 10m
labels:
severity: warning
annotations:
summary: "{{ $value }} Spark apps pending — possible resource starvation"
# ── Key metrics to dashboard in Grafana ───────────────────────────────────────
# spark_app_executor_count — current executor count (DRA scaling visible here)
# spark_app_task_failure_count_total — cumulative task failures
# spark_app_executor_failure_count_total — cumulative executor failures
# jvm_memory_bytes_used{area="heap"} — heap utilization per executor
# jvm_gc_collection_seconds_sum — GC pressure
# spark_operator_spark_app_count — running/pending/completed/failed appsProduction Patterns — CI/CD Pipeline, Image Building, and Multi-Tenant Namespacing
A production Spark-on-Kubernetes delivery pipeline has four stages: build a reproducible Docker image with all dependencies, validate the SparkApplication manifest with a schema linter, run a test job in a staging namespace with reduced executor count, then promote to production via ArgoCD or a GitOps pull request.
# ── Dockerfile: Spark 3.5 + Python dependencies ──────────────────────────────
FROM apache/spark:3.5.1-python3
USER root
# Install system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
curl openjdk-17-jdk-headless && \
rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt /opt/spark/requirements.txt
RUN pip install --no-cache-dir -r /opt/spark/requirements.txt
# Copy application code
COPY src/ /app/
COPY jars/ /app/jars/
COPY config/ /app/config/
# JMX exporter agent
COPY prometheus/jmx_prometheus_javaagent-0.20.0.jar /prometheus/
USER 185 # spark user
# ── GitHub Actions CI/CD pipeline ────────────────────────────────────────────
# .github/workflows/spark-etl-deploy.yml (excerpt)
name: Spark ETL — Build and Deploy
on:
push:
branches: [main]
paths: ['spark-etl/**']
jobs:
build-and-push:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build and push image
uses: docker/build-push-action@v5
with:
context: ./spark-etl
push: true
tags: my-registry.io/spark-etl:3.5.1-${{ github.sha }}
validate-manifests:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Validate SparkApplication CRD
run: |
helm template spark-etl ./helm/spark-etl \
--set image.tag=${{ github.sha }} | \
kubectl apply --dry-run=server -f -
deploy-staging:
needs: validate-manifests
steps:
- name: Deploy to staging and wait for completion
run: |
kubectl apply -f manifests/staging/spark-application.yaml
kubectl wait sparkapplication/etl-daily-orders-staging \
--for=condition=Running --timeout=300s -n spark-jobs-staging
# Poll until COMPLETED or FAILED
for i in $(seq 1 60); do
STATE=$(kubectl get sparkapplication etl-daily-orders-staging \
-n spark-jobs-staging -o jsonpath='{.status.applicationState.state}')
if [ "$STATE" = "COMPLETED" ]; then exit 0; fi
if [ "$STATE" = "FAILED" ]; then exit 1; fi
sleep 30
done
# ── ScheduledSparkApplication for recurring cron jobs ─────────────────────────
apiVersion: sparkoperator.k8s.io/v1beta2
kind: ScheduledSparkApplication
metadata:
name: etl-daily-orders-scheduled
namespace: spark-jobs
spec:
schedule: "0 2 * * *" # 02:00 UTC daily
concurrencyPolicy: Forbid # skip if previous run not done
successfulRunHistoryLimit: 5
failedRunHistoryLimit: 5
template:
# Identical to SparkApplication spec above
type: Python
image: "my-registry.io/spark-etl:3.5.1-latest"
mainApplicationFile: "local:///app/etl/daily_orders.py"
sparkVersion: "3.5.1"
restartPolicy:
type: OnFailure
onFailureRetries: 2
onFailureRetryInterval: 60
driver:
cores: 1
memory: "2g"
serviceAccount: spark
executor:
cores: 2
instances: 5
memory: "4g"
memoryOverhead: "1g"Note
LimitRange objects alongside ResourceQuota to enforce minimum and maximum resource requests per pod. Without a LimitRange, a misconfigured SparkApplication with no resource requests will be scheduled as a BestEffort pod and evicted first under pressure — often at the worst moment during a critical pipeline run.Cost Tracking — Per-Job Compute Attribution and FinOps Patterns
Running Spark on Kubernetes opens the door to precise per-job cost attribution. Label every SparkApplication with team, pipeline, and cost-center — these propagate to all driver and executor pods. Tools like OpenCost or Kubecost aggregate pod-level resource consumption by label, letting you see exactly what each team's Spark jobs cost per day, week, or pipeline run.
# ── Label propagation in SparkApplication ────────────────────────────────────
metadata:
labels:
team: data-engineering
pipeline: daily-orders-etl
cost-center: cc-1042
environment: production
driver:
labels:
team: data-engineering
pipeline: daily-orders-etl
cost-center: cc-1042
executor:
labels:
team: data-engineering
pipeline: daily-orders-etl
cost-center: cc-1042
# ── OpenCost Prometheus query: daily cost by pipeline ─────────────────────────
# (OpenCost exposes container_cpu_allocation and container_memory_allocation metrics)
# CPU cost per pipeline per day
sum by (label_pipeline) (
increase(container_cpu_allocation{namespace="spark-jobs"}[1d])
* on(pod) group_left(label_pipeline)
kube_pod_labels{namespace="spark-jobs"}
) * on() group_left()
opencost_node_cpu_hourly_cost
# Total daily Spark cost per team
sum by (label_team) (
(
increase(container_cpu_allocation{namespace=~"spark-jobs.*"}[1d])
* on() group_left() opencost_node_cpu_hourly_cost
) + (
increase(container_memory_allocation_bytes{namespace=~"spark-jobs.*"}[1d])
* on() group_left() opencost_node_ram_hourly_cost / 1073741824
)
)
# ── Weekly cost anomaly alert ─────────────────────────────────────────────────
- alert: SparkWeeklyCostAnomaly
expr: |
sum by (label_pipeline) (
increase(spark_job_cost_usd[7d])
) > 1.5 * sum by (label_pipeline) (
avg_over_time(spark_job_cost_usd[4w:7d])
)
labels:
severity: warning
annotations:
summary: "Spark pipeline {{ $labels.label_pipeline }} cost up 50% vs 4-week avg"Work with us
Running Spark on YARN or standalone clusters and looking to migrate to Kubernetes for unified infrastructure and lower costs?
We design and implement Spark-on-Kubernetes platforms — from Spark Operator installation and SparkApplication CRD design to dynamic resource allocation configuration, Karpenter spot node pool setup, Volcano gang scheduling for high-throughput workloads, JMX Prometheus monitoring with Grafana dashboards, CI/CD pipelines for Spark image builds and manifest promotion, multi-tenant ResourceQuota and LimitRange enforcement, and per-job cost attribution with OpenCost. Let’s talk.
Get in touch