Back to Blog
KubernetesCost OptimizationFinOpsCloudDevOpsAutoscaling

Kubernetes Cost Optimization — Right-Sizing Without Risking Stability

A practical guide to reducing Kubernetes spend without sacrificing reliability: resource requests and limits, VPA, HPA, Karpenter, Spot instances, Kubecost, and namespace-level controls that prevent waste. Typical savings: 40–70% off your cluster bill.

2026-04-16

The Kubernetes Cost Problem Is Mostly Self-Inflicted

Kubernetes clusters are expensive by default — not because the platform is wasteful, but because teams routinely overprovision to avoid reliability risk. The average cloud-managed Kubernetes cluster runs at 20–40% effective CPU utilization. You are paying for four times the compute you actually use. The rest is padding: inflated resource requests, idle replicas, and node capacity held in reserve against traffic spikes that mostly never materialize.

The good news is that most Kubernetes waste is recoverable without touching application code or sacrificing stability. The techniques in this guide — right-sized resource requests, autoscaling at the pod and node level, Spot instance integration, and namespace-level guardrails — routinely reduce cluster bills by 40–70% without degrading availability. The key is applying them systematically with observability at every layer.

Resource Requests and Limits — The Foundation

Every cost optimization conversation in Kubernetes starts here. Requests are the amount of CPU and memory the scheduler reserves on a node for your Pod. Limits are the hard ceiling the container runtime enforces. Requests drive cluster cost — they determine node sizing and autoscaling thresholds.

When requests are too high, nodes fill up with phantom reservations. A node with 4 vCPU might only run two Pods because each requests 2 vCPU — even if both are idle. The remaining capacity is wasted and you are billed for the full node. This is the most common and most expensive form of Kubernetes waste.

# Bad: generous padding that inflates node counts
resources:
  requests:
    cpu: "2000m"      # reserved regardless of actual usage
    memory: "4Gi"
  limits:
    cpu: "4000m"
    memory: "8Gi"

---

# Better: sized from actual p99 usage data (see VPA section)
resources:
  requests:
    cpu: "250m"       # p99 usage observed: 180m
    memory: "512Mi"   # p99 usage observed: 380Mi
  limits:
    cpu: "1000m"      # burst headroom for spikes
    memory: "1Gi"     # OOMKill protection

Note

Never set CPU limits without understanding the throttling model. When a container exceeds its CPU limit, the Linux CFS scheduler throttles it — it does not get killed. CPU throttling is invisible in standard metrics and causes latency spikes that are difficult to diagnose. Many teams remove CPU limits entirely and rely only on requests for scheduling. For memory, always set limits: exceeding memory limit causes an OOMKill, which is at least visible.

The right process for sizing requests is: deploy with generous limits, collect p95/p99 CPU and memory usage over a representative traffic period (at least 7 days to capture weekly patterns), then set requests to p99 plus a 20% safety margin. The Vertical Pod Autoscaler automates this process in recommendation mode.

Vertical Pod Autoscaler — Automated Right-Sizing

The Vertical Pod Autoscaler (VPA) observes actual container resource consumption and recommends — or automatically applies — right-sized requests. It runs in three modes: Off (recommendations only), Initial (sets requests at Pod creation), and Auto (updates live Pods by evicting and recreating them).

Start with Off mode. Run it for two weeks, review the recommendations, and apply them manually. This gives you the cost savings of right-sizing without the operational risk of VPA evicting Pods during peak traffic.

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Off"       # Recommendations only — safe for production
  resourcePolicy:
    containerPolicies:
      - containerName: api-server
        minAllowed:
          cpu: "50m"
          memory: "64Mi"
        maxAllowed:
          cpu: "2000m"
          memory: "4Gi"
        controlledResources: ["cpu", "memory"]
# Read VPA recommendations after 1-2 weeks of observation
kubectl describe vpa api-server-vpa -n production

# Output excerpt:
#   Recommendation:
#     Container Recommendations:
#       Container Name: api-server
#         Lower Bound:
#           Cpu:     40m
#           Memory:  100Mi
#         Target:
#           Cpu:     130m      ← set this as your request
#           Memory:  280Mi
#         Upper Bound:
#           Cpu:     500m
#           Memory:  600Mi

Note

VPA and HPA cannot both manage CPU resources on the same Deployment — they will fight each other. Use VPA for stateful services and batch workloads where vertical scaling is natural. Use HPA for stateless services where horizontal scaling is preferable. For services that need both, VPA in Initial mode combined with HPA on custom metrics (not CPU) is the standard pattern.

Horizontal Pod Autoscaler and KEDA

The Horizontal Pod Autoscaler (HPA) scales replica count based on metrics. The native implementation supports CPU and memory utilization. The key configuration mistake is setting minReplicas too high. Idle off-hours workloads with minReplicas: 3 waste two-thirds of their compute at night. Scale them down to 1 (or 0 with KEDA).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 1          # not 3 — trust the autoscaler
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70   # scale up when avg CPU > 70%
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300   # wait 5 min before scaling down
      policies:
        - type: Pods
          value: 1
          periodSeconds: 60             # remove max 1 pod per minute
    scaleUp:
      stabilizationWindowSeconds: 0    # scale up immediately
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15            # can double replica count every 15s

For event-driven workloads (queue consumers, cron jobs, stream processors), KEDA (Kubernetes Event-Driven Autoscaling) scales Pods to zero when idle and back up when events arrive. A Kafka consumer that processes 100 messages/day does not need to run 24/7. KEDA can scale it from 0 to N based on consumer group lag, then back to 0 between batches.

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-consumer-scaler
  namespace: production
spec:
  scaleTargetRef:
    name: order-processor
  minReplicaCount: 0          # scale to zero when idle
  maxReplicaCount: 10
  pollingInterval: 15
  cooldownPeriod: 60          # seconds idle before scale to zero
  triggers:
    - type: kafka
      metadata:
        bootstrapServers: kafka.internal:9092
        consumerGroup: order-processor-group
        topic: orders
        lagThreshold: "50"    # 1 replica per 50 messages of lag
        offsetResetPolicy: latest

Karpenter — Next-Generation Node Autoscaling

The traditional Cluster Autoscaler works within predefined node groups. When it needs to scale up, it picks from your existing instance type configurations. This means you either overprovision your node groups to cover all cases or end up with pending Pods waiting for a suitable node.

Karpenter takes a different approach: it looks at pending Pod requirements and provisions the exact right instance type on demand — no node groups, no predefined instance types. A Pod requesting 7 vCPU and 28 GB RAM gets a node that fits. Karpenter also consolidates underutilized nodes by evicting and re-scheduling their Pods onto fewer, fuller nodes — a critical cost lever that the Cluster Autoscaler does not perform.

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    metadata:
      labels:
        workload-type: general
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]   # prefer Spot, fallback to on-demand
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["c", "m", "r"]         # compute, general, memory optimized
        - key: karpenter.k8s.aws/instance-generation
          operator: Gt
          values: ["3"]                   # only modern instance generations
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]      # include Graviton for cost savings
  limits:
    cpu: 1000                             # cluster-wide CPU cap
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s               # aggressive consolidation

Note

Karpenter's node consolidation is its most powerful cost feature but requires well-configured Pod Disruption Budgets to avoid outages. Before enabling consolidation, ensure every production Deployment has a PDB with minAvailable: 1 or maxUnavailable: 1. Karpenter respects PDBs and will not evict Pods that would violate them.

Spot Instances — 60–90% Savings With the Right Architecture

Spot instances (AWS) / preemptible VMs (GCP) / Spot VMs (Azure) offer 60–90% discounts over on-demand pricing with one trade-off: the cloud provider can reclaim them with a 2-minute warning. The teams that get burned by Spot are those who run stateful or singleton workloads on it. The teams that extract maximum savings use Spot for the right workload categories.

Good candidates for Spot

Stateless API replicas (behind a load balancer — if one node dies, traffic routes to others), batch jobs and ML training (can checkpoint and resume), CI/CD runners, development and staging environments, queue consumers (KEDA scales replacements immediately when a node is reclaimed).

Poor candidates for Spot

Stateful databases, single-replica critical services, long-running jobs without checkpointing, anything with a strict startup time budget. Keep these on on-demand nodes using node selectors or affinity rules.

# Label Spot nodes in your NodePool
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: spot-workers
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot"]
      taints:
        - key: "spot-instance"
          value: "true"
          effect: NoSchedule

---
# Workloads that tolerate Spot get scheduled there
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      tolerations:
        - key: "spot-instance"
          operator: "Equal"
          value: "true"
          effect: NoSchedule
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 80
              preference:
                matchExpressions:
                  - key: karpenter.sh/capacity-type
                    operator: In
                    values: ["spot"]  # prefer Spot, will fall back to on-demand

ResourceQuotas and LimitRanges — Preventing Namespace-Level Waste

In multi-team clusters, unbounded namespaces are a budget risk. One team deploys a misconfigured Deployment with requests.cpu: 8000m per Pod and triggers a node scale-out that costs thousands of dollars. ResourceQuotas put hard limits on total resource consumption per namespace.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-quota
  namespace: team-payments
spec:
  hard:
    requests.cpu: "20"          # total CPU requests across all Pods
    requests.memory: "40Gi"
    limits.cpu: "40"
    limits.memory: "80Gi"
    count/pods: "50"            # max Pod count in namespace
    count/services: "20"
    persistentvolumeclaims: "10"
    requests.storage: "500Gi"

---
# LimitRange sets per-Pod defaults and bounds
# Pods without explicit requests get these defaults
apiVersion: v1
kind: LimitRange
metadata:
  name: default-limits
  namespace: team-payments
spec:
  limits:
    - type: Container
      default:
        cpu: "500m"
        memory: "512Mi"
      defaultRequest:
        cpu: "100m"
        memory: "128Mi"
      max:
        cpu: "4000m"
        memory: "8Gi"
      min:
        cpu: "10m"
        memory: "32Mi"

Note

LimitRanges solve a subtle cost problem: Pods without explicit resource requests get scheduled with zerorequests, which means they are invisible to the scheduler and to Karpenter's bin-packing. They can end up on over-loaded nodes causing OOMs, or they can balloon resource consumption silently. Always deploy a LimitRange in every namespace to enforce a sane default.

Pod Disruption Budgets — Stability Guard Rails

Cost optimization techniques like node consolidation and Spot instance replacement create voluntary disruptions. Without Pod Disruption Budgets (PDBs), Karpenter and the node upgrade controller can simultaneously evict all replicas of a service, causing a complete outage. PDBs are the safety mechanism that makes aggressive cost optimization safe.

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
  namespace: production
spec:
  # Option 1: guarantee minimum available replicas
  minAvailable: 2           # at least 2 Pods always available

  # Option 2: allow max N disruptions at once
  # maxUnavailable: 1       # at most 1 Pod unavailable

  selector:
    matchLabels:
      app: api-server

---
# For single-replica services (dev/staging), allow full disruption
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: dev-service-pdb
  namespace: development
spec:
  maxUnavailable: 1         # single replica: allow full eviction
  selector:
    matchLabels:
      env: development

Cost Visibility with OpenCost and Kubecost

You cannot optimize what you cannot measure. Kubernetes cost visibility requires tooling that maps cloud spend to Kubernetes constructs — namespaces, Deployments, teams, and labels. Two options dominate:

OpenCost (open source)

CNCF sandbox project. Runs as a Prometheus exporter. Allocates cost per Pod/namespace/label in real time using cloud pricing APIs. Lightweight and free. Lacks advanced features like savings recommendations or team-level showback reports. Best for teams that want raw metrics integrated into existing Grafana dashboards.

Kubecost (commercial, free tier available)

Full cost management platform: allocation, savings insights, budget alerts, right-sizing recommendations, and team showback. The free tier covers single-cluster use with 15 days of data retention. Indispensable for multi-team clusters where you need to attribute costs to business units.

# Install OpenCost via Helm
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm repo update

helm install opencost opencost/opencost \
  --namespace opencost \
  --create-namespace \
  --set opencost.exporter.cloudProviderApiKey="" \  # AWS/GCP/Azure key for pricing
  --set opencost.ui.enabled=true

# OpenCost exposes Prometheus metrics at :9003/metrics
# Key metrics to scrape:
#   opencost_namespace_total_cost_hourly
#   opencost_deployment_total_cost_hourly
#   opencost_pod_total_cost_hourly
#   opencost_cluster_management_cost_hourly

Once you have per-namespace cost metrics, build a Grafana dashboard that shows cost per team per day, cost per request served (requires correlating with application metrics), and efficiency ratio (actual usage / reserved capacity). An efficiency ratio below 40% is a clear signal to right-size requests. Above 80% is a risk signal — you are running too close to limits.

Advanced Scheduling — Bin-Packing and Topology Spread

The default Kubernetes scheduler uses a balanced placement strategy that spreads Pods across nodes. This is good for availability but bad for cost — it prevents nodes from filling up, which prevents Karpenter from consolidating them.

Configure the scheduler with a bin-packing profile (available in Kubernetes 1.23+) to prefer placing Pods on nodes that are already running workloads, leaving empty nodes eligible for termination:

# KubeSchedulerConfiguration for bin-packing
# Apply to your scheduler configuration (EKS: use custom scheduler profile)
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: NodeResourcesFit
        args:
          scoringStrategy:
            type: MostAllocated    # bin-pack: prefer already-loaded nodes
            resources:
              - name: cpu
                weight: 1
              - name: memory
                weight: 1

---
# Topology spread: spread across AZs for availability,
# but allow up to 1 skew (don't force perfect balance)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway   # availability over balance
          labelSelector:
            matchLabels:
              app: api-server

Eliminating Idle Workloads

Development and staging environments are frequently forgotten and left running 24/7. A staging cluster that mirrors production size runs at full cost for evenings, weekends, and holidays — often 70% of the week — while completely unused.

Tools like Kube-Downscaler and Goldilocks automate off-hours scale-down. Schedule non-production namespaces to scale to zero replicas outside business hours using kube-downscaler:

# kube-downscaler: scale dev/staging namespaces to zero overnight
# Annotate the namespace or individual Deployments
apiVersion: v1
kind: Namespace
metadata:
  name: staging
  annotations:
    # Scale down outside 08:00-20:00 Mon-Fri Europe/Warsaw
    downscaler/uptime: "Mon-Fri 08:00-20:00 Europe/Warsaw"
    downscaler/downtime-replicas: "0"

---
# Or annotate per Deployment for finer control
apiVersion: apps/v1
kind: Deployment
metadata:
  name: staging-api
  namespace: staging
  annotations:
    downscaler/uptime: "Mon-Fri 07:00-22:00 Europe/Warsaw"
    downscaler/downtime-replicas: "0"
    downscaler/exclude: "false"   # set "true" to exclude from downscaling

Putting It Together — A Prioritized Savings Roadmap

Not every technique delivers equal return for equal effort. Here is the practical prioritization based on impact and implementation complexity:

Week 1 — Quick wins (low risk)

Deploy OpenCost and identify your top 5 most expensive namespaces. Scale down minReplicas on non-production HPA configs. Add LimitRanges to every namespace missing them. Schedule staging to scale to zero overnight.Expected savings: 15–25%

Week 2–4 — Right-sizing (medium effort)

Deploy VPA in Off mode across all production Deployments. After 2 weeks, apply recommendations to the 10 highest-cost workloads. Add PDBs to every production Deployment.Expected savings: 20–35% additional

Month 2 — Architecture (higher effort)

Migrate Cluster Autoscaler to Karpenter with consolidation enabled. Add a Spot NodePool for stateless workloads. Implement KEDA for queue consumers and batch jobs.Expected savings: 30–50% additional

Note

Track your efficiency ratio (actual CPU/memory usage ÷ total requests) before and after each phase. A cluster-wide efficiency ratio moving from 25% to 55% represents real cost reduction, not just configuration changes. Set a target of 60–70% efficiency — above 80% starts to risk reliability under traffic spikes.

Running Kubernetes at scale and spending more than you should?

We help engineering teams audit, right-size, and continuously optimize Kubernetes infrastructure — from resource requests and autoscaling to Karpenter, Spot instances, and full cost visibility. Let’s talk.

Get in Touch

Related Articles