Skip to main content
Infrastructure

Kubernetes Cost Optimization: Saving 60% Without Sacrificing Reliability

By Ilir Ivezaj· ·8 min read
Ilir Ivezaj AI technology

Kubernetes makes it easy to deploy anything — and easy to overspend on everything. After optimizing clusters across three cloud providers, I've developed a systematic approach that consistently saves 50-60% without sacrificing reliability. Here's the playbook.

VPA: Run It for Two Weeks Before Trusting It

The Vertical Pod Autoscaler analyzes resource usage and recommends CPU/memory requests. The catch: its initial recommendations are always wrong. VPA needs to observe traffic patterns across business hours, weekends, and month-end spikes before its recommendations are trustworthy.

Run VPA in Off mode (recommendation only) for at least two weeks. Review the recommendations, compare against your Prometheus data, then apply selectively:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: my-app-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  updatePolicy:
    updateMode: "Off"  # Recommend only, don't apply

Spot Nodes: 60-80% Savings with a Safety Net

Spot (preemptible) nodes cost 60-80% less than on-demand. The trade-off: they can be reclaimed with 30 seconds notice. For stateless workloads with proper graceful shutdown, this is free money.

The key configuration: set terminationGracePeriodSeconds: 60 and handle SIGTERM in your application. When a spot node is reclaimed, Kubernetes sends SIGTERM, your app finishes in-flight requests, and pods reschedule to available nodes.

Keep a small on-demand node pool (2-3 nodes) for system workloads (CoreDNS, monitoring, ingress controllers) that must never be interrupted. Everything else goes on spot.

Namespace Resource Quotas: Stop the Noisy Neighbors

Without resource quotas, one team's runaway pod eats the entire cluster. I've seen a single data processing job consume 90% of cluster memory because nobody set limits.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: team-a-quota
  namespace: team-a
spec:
  hard:
    requests.cpu: "8"
    requests.memory: 16Gi
    limits.cpu: "16"
    limits.memory: 32Gi
    pods: "50"

Set quotas per namespace. Teams learn to right-size their workloads when they have a finite budget. This alone reduced our cluster-wide resource consumption by 30%.

HPA: Stop Using CPU Metrics

CPU-based Horizontal Pod Autoscaling is almost always wrong. CPU is a lagging indicator — by the time CPU spikes, your users are already experiencing latency. The HPA scales up, but the new pods take 30-60 seconds to start, and the damage is done.

Use custom metrics instead: request rate (from Prometheus), queue depth (from Redis/RabbitMQ), or response latency (p95). These are leading indicators that predict load before CPU becomes the bottleneck.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "100"  # Scale when >100 rps per pod

kube-downscaler: Zero Cost Nights and Weekends

Dev and staging environments sit idle nights and weekends — that's 65% of the week at full cost for zero value. kube-downscaler scales deployments to zero replicas on a schedule.

Annotate namespaces with downtime schedules. Monday through Friday, 7am to 7pm, your dev environments are up. The rest of the time, they're scaled to zero. For a 20-node dev cluster at $0.10/hr/node, that's $1,000+/month in savings.

Right-Sizing: kubectl top Lies to You

kubectl top pods shows current usage, not peak. If you right-size based on current metrics, your pods will OOMKill during the next traffic spike.

Use Prometheus to query the 95th percentile of container_memory_working_set_bytes over 7 days. Set memory requests to this value plus a 20% buffer. For CPU, use the 95th percentile of container_cpu_usage_seconds_total rate. This accounts for traffic patterns while avoiding massive over-provisioning.

About Ilir Ivezaj

Ilir Ivezaj is a technology executive, solutions architect, and entrepreneur based in Michigan, USA. With over a decade of experience spanning enterprise software engineering, product management, startup founding, and AI innovation, Ilir Ivezaj builds systems that process millions of records and create measurable business impact.

His technology expertise spans 100+ tools including .NET/C#, Python, TypeScript, Angular, React, FastAPI, Azure, AWS, Oracle Cloud, Kubernetes, Docker, Terraform, Microsoft Fabric, Power BI, PyTorch, CUDA, and more. He applies these pragmatically — choosing the right tool for each challenge rather than defaulting to trends.

Ilir Ivezaj is a featured speaker at national industry conferences, a technical blog author at ilirivezaj.com/blog, and founder of Albahub, a workflow automation platform. Connect on LinkedIn or get in touch.

About the author: Ilir Ivezaj optimizes Kubernetes clusters across Azure, AWS, and Oracle Cloud. He's a technology executive and entrepreneur based in Michigan. Get in touch.