The Kubernetes Cost Problem Is Mostly Self-Inflicted
Kubernetes clusters are expensive by default — not because the platform is wasteful, but because teams routinely overprovision to avoid reliability risk. The average cloud-managed Kubernetes cluster runs at 20–40% effective CPU utilization. You are paying for four times the compute you actually use. The rest is padding: inflated resource requests, idle replicas, and node capacity held in reserve against traffic spikes that mostly never materialize.
The good news is that most Kubernetes waste is recoverable without touching application code or sacrificing stability. The techniques in this guide — right-sized resource requests, autoscaling at the pod and node level, Spot instance integration, and namespace-level guardrails — routinely reduce cluster bills by 40–70% without degrading availability. The key is applying them systematically with observability at every layer.
Resource Requests and Limits — The Foundation
Every cost optimization conversation in Kubernetes starts here. Requests are the amount of CPU and memory the scheduler reserves on a node for your Pod. Limits are the hard ceiling the container runtime enforces. Requests drive cluster cost — they determine node sizing and autoscaling thresholds.
When requests are too high, nodes fill up with phantom reservations. A node with 4 vCPU might only run two Pods because each requests 2 vCPU — even if both are idle. The remaining capacity is wasted and you are billed for the full node. This is the most common and most expensive form of Kubernetes waste.
# Bad: generous padding that inflates node counts
resources:
requests:
cpu: "2000m" # reserved regardless of actual usage
memory: "4Gi"
limits:
cpu: "4000m"
memory: "8Gi"
---
# Better: sized from actual p99 usage data (see VPA section)
resources:
requests:
cpu: "250m" # p99 usage observed: 180m
memory: "512Mi" # p99 usage observed: 380Mi
limits:
cpu: "1000m" # burst headroom for spikes
memory: "1Gi" # OOMKill protectionNote
The right process for sizing requests is: deploy with generous limits, collect p95/p99 CPU and memory usage over a representative traffic period (at least 7 days to capture weekly patterns), then set requests to p99 plus a 20% safety margin. The Vertical Pod Autoscaler automates this process in recommendation mode.
Vertical Pod Autoscaler — Automated Right-Sizing
The Vertical Pod Autoscaler (VPA) observes actual container resource consumption and recommends — or automatically applies — right-sized requests. It runs in three modes: Off (recommendations only), Initial (sets requests at Pod creation), and Auto (updates live Pods by evicting and recreating them).
Start with Off mode. Run it for two weeks, review the recommendations, and apply them manually. This gives you the cost savings of right-sizing without the operational risk of VPA evicting Pods during peak traffic.
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Off" # Recommendations only — safe for production
resourcePolicy:
containerPolicies:
- containerName: api-server
minAllowed:
cpu: "50m"
memory: "64Mi"
maxAllowed:
cpu: "2000m"
memory: "4Gi"
controlledResources: ["cpu", "memory"]# Read VPA recommendations after 1-2 weeks of observation
kubectl describe vpa api-server-vpa -n production
# Output excerpt:
# Recommendation:
# Container Recommendations:
# Container Name: api-server
# Lower Bound:
# Cpu: 40m
# Memory: 100Mi
# Target:
# Cpu: 130m ← set this as your request
# Memory: 280Mi
# Upper Bound:
# Cpu: 500m
# Memory: 600MiNote
Initial mode combined with HPA on custom metrics (not CPU) is the standard pattern.Horizontal Pod Autoscaler and KEDA
The Horizontal Pod Autoscaler (HPA) scales replica count based on metrics. The native implementation supports CPU and memory utilization. The key configuration mistake is setting minReplicas too high. Idle off-hours workloads with minReplicas: 3 waste two-thirds of their compute at night. Scale them down to 1 (or 0 with KEDA).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
namespace: production
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 1 # not 3 — trust the autoscaler
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70 # scale up when avg CPU > 70%
behavior:
scaleDown:
stabilizationWindowSeconds: 300 # wait 5 min before scaling down
policies:
- type: Pods
value: 1
periodSeconds: 60 # remove max 1 pod per minute
scaleUp:
stabilizationWindowSeconds: 0 # scale up immediately
policies:
- type: Percent
value: 100
periodSeconds: 15 # can double replica count every 15sFor event-driven workloads (queue consumers, cron jobs, stream processors), KEDA (Kubernetes Event-Driven Autoscaling) scales Pods to zero when idle and back up when events arrive. A Kafka consumer that processes 100 messages/day does not need to run 24/7. KEDA can scale it from 0 to N based on consumer group lag, then back to 0 between batches.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: kafka-consumer-scaler
namespace: production
spec:
scaleTargetRef:
name: order-processor
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 10
pollingInterval: 15
cooldownPeriod: 60 # seconds idle before scale to zero
triggers:
- type: kafka
metadata:
bootstrapServers: kafka.internal:9092
consumerGroup: order-processor-group
topic: orders
lagThreshold: "50" # 1 replica per 50 messages of lag
offsetResetPolicy: latestKarpenter — Next-Generation Node Autoscaling
The traditional Cluster Autoscaler works within predefined node groups. When it needs to scale up, it picks from your existing instance type configurations. This means you either overprovision your node groups to cover all cases or end up with pending Pods waiting for a suitable node.
Karpenter takes a different approach: it looks at pending Pod requirements and provisions the exact right instance type on demand — no node groups, no predefined instance types. A Pod requesting 7 vCPU and 28 GB RAM gets a node that fits. Karpenter also consolidates underutilized nodes by evicting and re-scheduling their Pods onto fewer, fuller nodes — a critical cost lever that the Cluster Autoscaler does not perform.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: default
spec:
template:
metadata:
labels:
workload-type: general
spec:
nodeClassRef:
group: karpenter.k8s.aws
kind: EC2NodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot", "on-demand"] # prefer Spot, fallback to on-demand
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["c", "m", "r"] # compute, general, memory optimized
- key: karpenter.k8s.aws/instance-generation
operator: Gt
values: ["3"] # only modern instance generations
- key: kubernetes.io/arch
operator: In
values: ["amd64", "arm64"] # include Graviton for cost savings
limits:
cpu: 1000 # cluster-wide CPU cap
disruption:
consolidationPolicy: WhenUnderutilized
consolidateAfter: 30s # aggressive consolidationNote
minAvailable: 1 or maxUnavailable: 1. Karpenter respects PDBs and will not evict Pods that would violate them.Spot Instances — 60–90% Savings With the Right Architecture
Spot instances (AWS) / preemptible VMs (GCP) / Spot VMs (Azure) offer 60–90% discounts over on-demand pricing with one trade-off: the cloud provider can reclaim them with a 2-minute warning. The teams that get burned by Spot are those who run stateful or singleton workloads on it. The teams that extract maximum savings use Spot for the right workload categories.
Good candidates for Spot
Stateless API replicas (behind a load balancer — if one node dies, traffic routes to others), batch jobs and ML training (can checkpoint and resume), CI/CD runners, development and staging environments, queue consumers (KEDA scales replacements immediately when a node is reclaimed).
Poor candidates for Spot
Stateful databases, single-replica critical services, long-running jobs without checkpointing, anything with a strict startup time budget. Keep these on on-demand nodes using node selectors or affinity rules.
# Label Spot nodes in your NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: spot-workers
spec:
template:
spec:
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"]
taints:
- key: "spot-instance"
value: "true"
effect: NoSchedule
---
# Workloads that tolerate Spot get scheduled there
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
tolerations:
- key: "spot-instance"
operator: "Equal"
value: "true"
effect: NoSchedule
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 80
preference:
matchExpressions:
- key: karpenter.sh/capacity-type
operator: In
values: ["spot"] # prefer Spot, will fall back to on-demandResourceQuotas and LimitRanges — Preventing Namespace-Level Waste
In multi-team clusters, unbounded namespaces are a budget risk. One team deploys a misconfigured Deployment with requests.cpu: 8000m per Pod and triggers a node scale-out that costs thousands of dollars. ResourceQuotas put hard limits on total resource consumption per namespace.
apiVersion: v1
kind: ResourceQuota
metadata:
name: team-quota
namespace: team-payments
spec:
hard:
requests.cpu: "20" # total CPU requests across all Pods
requests.memory: "40Gi"
limits.cpu: "40"
limits.memory: "80Gi"
count/pods: "50" # max Pod count in namespace
count/services: "20"
persistentvolumeclaims: "10"
requests.storage: "500Gi"
---
# LimitRange sets per-Pod defaults and bounds
# Pods without explicit requests get these defaults
apiVersion: v1
kind: LimitRange
metadata:
name: default-limits
namespace: team-payments
spec:
limits:
- type: Container
default:
cpu: "500m"
memory: "512Mi"
defaultRequest:
cpu: "100m"
memory: "128Mi"
max:
cpu: "4000m"
memory: "8Gi"
min:
cpu: "10m"
memory: "32Mi"Note
Pod Disruption Budgets — Stability Guard Rails
Cost optimization techniques like node consolidation and Spot instance replacement create voluntary disruptions. Without Pod Disruption Budgets (PDBs), Karpenter and the node upgrade controller can simultaneously evict all replicas of a service, causing a complete outage. PDBs are the safety mechanism that makes aggressive cost optimization safe.
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
namespace: production
spec:
# Option 1: guarantee minimum available replicas
minAvailable: 2 # at least 2 Pods always available
# Option 2: allow max N disruptions at once
# maxUnavailable: 1 # at most 1 Pod unavailable
selector:
matchLabels:
app: api-server
---
# For single-replica services (dev/staging), allow full disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: dev-service-pdb
namespace: development
spec:
maxUnavailable: 1 # single replica: allow full eviction
selector:
matchLabels:
env: developmentCost Visibility with OpenCost and Kubecost
You cannot optimize what you cannot measure. Kubernetes cost visibility requires tooling that maps cloud spend to Kubernetes constructs — namespaces, Deployments, teams, and labels. Two options dominate:
OpenCost (open source)
CNCF sandbox project. Runs as a Prometheus exporter. Allocates cost per Pod/namespace/label in real time using cloud pricing APIs. Lightweight and free. Lacks advanced features like savings recommendations or team-level showback reports. Best for teams that want raw metrics integrated into existing Grafana dashboards.
Kubecost (commercial, free tier available)
Full cost management platform: allocation, savings insights, budget alerts, right-sizing recommendations, and team showback. The free tier covers single-cluster use with 15 days of data retention. Indispensable for multi-team clusters where you need to attribute costs to business units.
# Install OpenCost via Helm
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update
helm install opencost opencost/opencost \
--namespace opencost \
--create-namespace \
--set opencost.exporter.cloudProviderApiKey="" \ # AWS/GCP/Azure key for pricing
--set opencost.ui.enabled=true
# OpenCost exposes Prometheus metrics at :9003/metrics
# Key metrics to scrape:
# opencost_namespace_total_cost_hourly
# opencost_deployment_total_cost_hourly
# opencost_pod_total_cost_hourly
# opencost_cluster_management_cost_hourlyOnce you have per-namespace cost metrics, build a Grafana dashboard that shows cost per team per day, cost per request served (requires correlating with application metrics), and efficiency ratio (actual usage / reserved capacity). An efficiency ratio below 40% is a clear signal to right-size requests. Above 80% is a risk signal — you are running too close to limits.
Advanced Scheduling — Bin-Packing and Topology Spread
The default Kubernetes scheduler uses a balanced placement strategy that spreads Pods across nodes. This is good for availability but bad for cost — it prevents nodes from filling up, which prevents Karpenter from consolidating them.
Configure the scheduler with a bin-packing profile (available in Kubernetes 1.23+) to prefer placing Pods on nodes that are already running workloads, leaving empty nodes eligible for termination:
# KubeSchedulerConfiguration for bin-packing
# Apply to your scheduler configuration (EKS: use custom scheduler profile)
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
pluginConfig:
- name: NodeResourcesFit
args:
scoringStrategy:
type: MostAllocated # bin-pack: prefer already-loaded nodes
resources:
- name: cpu
weight: 1
- name: memory
weight: 1
---
# Topology spread: spread across AZs for availability,
# but allow up to 1 skew (don't force perfect balance)
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
spec:
template:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: ScheduleAnyway # availability over balance
labelSelector:
matchLabels:
app: api-serverEliminating Idle Workloads
Development and staging environments are frequently forgotten and left running 24/7. A staging cluster that mirrors production size runs at full cost for evenings, weekends, and holidays — often 70% of the week — while completely unused.
Tools like Kube-Downscaler and Goldilocks automate off-hours scale-down. Schedule non-production namespaces to scale to zero replicas outside business hours using kube-downscaler:
# kube-downscaler: scale dev/staging namespaces to zero overnight
# Annotate the namespace or individual Deployments
apiVersion: v1
kind: Namespace
metadata:
name: staging
annotations:
# Scale down outside 08:00-20:00 Mon-Fri Europe/Warsaw
downscaler/uptime: "Mon-Fri 08:00-20:00 Europe/Warsaw"
downscaler/downtime-replicas: "0"
---
# Or annotate per Deployment for finer control
apiVersion: apps/v1
kind: Deployment
metadata:
name: staging-api
namespace: staging
annotations:
downscaler/uptime: "Mon-Fri 07:00-22:00 Europe/Warsaw"
downscaler/downtime-replicas: "0"
downscaler/exclude: "false" # set "true" to exclude from downscalingPutting It Together — A Prioritized Savings Roadmap
Not every technique delivers equal return for equal effort. Here is the practical prioritization based on impact and implementation complexity:
Week 1 — Quick wins (low risk)
Deploy OpenCost and identify your top 5 most expensive namespaces. Scale down minReplicas on non-production HPA configs. Add LimitRanges to every namespace missing them. Schedule staging to scale to zero overnight.Expected savings: 15–25%
Week 2–4 — Right-sizing (medium effort)
Deploy VPA in Off mode across all production Deployments. After 2 weeks, apply recommendations to the 10 highest-cost workloads. Add PDBs to every production Deployment.Expected savings: 20–35% additional
Month 2 — Architecture (higher effort)
Migrate Cluster Autoscaler to Karpenter with consolidation enabled. Add a Spot NodePool for stateless workloads. Implement KEDA for queue consumers and batch jobs.Expected savings: 30–50% additional
Note
Running Kubernetes at scale and spending more than you should?
We help engineering teams audit, right-size, and continuously optimize Kubernetes infrastructure — from resource requests and autoscaling to Karpenter, Spot instances, and full cost visibility. Let’s talk.
Get in Touch