
I found a ticking time bomb in my Kubernetes cluster this week.
Not over-provisioned.Not under-provisioned.Just... missing.
I ran Goldilocks on my Kubernetes practice cluster expecting to find the usual CPU waste.
What I found was worse:
3 out of 4 system pods have incomplete or missing memory configuration.
kube-proxy: No memory config at allaws-node: No memory config at allCoreDNS: Has request (70Mi), no limitmetrics-server: Properly configured
Both kube-proxy and aws-node route ALL traffic in my cluster.Both are one memory spike away from OOM kill.Both capable of taking down the entire platform.
And this is the DEFAULT configuration.
HOW I FOUND THIS
I installed Goldilocks last week. Took 5 minutes.
I expected the usual problem: CPU over-provisioning. System pods requesting 100m but using 25m. The 4x waste that exists in every cluster.
Wasteful, but at least safe.
What I found: Missing memory limits everywhere.
Here’s the exact command:
kubectl get pods --all-namespaces -o json | jq ‘.items[] | select(.spec.containers[].resources.requests.memory == null) | .metadata.name’
Output:
kube-proxy-bv6crkube-proxy-l2lkc
Two critical system components with NO memory configuration at all.
CoreDNS wasn’t in this list. Why?
Because it HAS a memory request (70Mi) set by default.
But check the limits:
kubectl get pods -n kube-system -o custom-columns=\NAME:.metadata.name,\QOS:.status.qosClass,\MEM-REQ:.spec.containers[0].resources.requests.memory,\MEM-LIM:.spec.containers[0].resources.limits.memory
Output :
coredns-xxxxx: Burstable, 70Mi,kube-proxy-xxxxx: BestEffort, ,metrics-server-xxxxx: Guaranteed, 200Mi, 200Mi
The full breakdown:
→ 2 pods with NO memory config (BestEffort QoS)
→ 1 pod with partial config (Burstable QoS)
→ 1 pod properly configured (Guaranteed QoS)
3 out of 4 system pods are vulnerable.
This isn’t a misconfiguration.This is how Kubernetes ships.
WHAT HAPPENS WHEN MEMORY SPIKES
Let me walk through what happens when these missing configs meet a real workload problem.
This is based on how Linux OOM Killer actually works - the mechanics are real, even if I’m using a hypothetical scenario to illustrate.
Scenario: You deploy a workload without memory limits.
Could be anything:
→ Batch job that gets more data than expected
→ An App with a memory leak
→ A background task that spikes every Saturday morning
Normally it uses 2GB. Under load? 8GB.
Your node is running on limited resources. Let’s say 8GB RAM total.
That workload climbs: 500MB → 1GB → 2GB → 4GB → 8GB
No limit = nothing stops it.
Node hits Memory Pressure.
Linux OOM Killer activates.
Looks for victims.
Here’s what it evaluates:
Your workload:→ PriorityClass: None (priority 0)→ QoS: BestEffort (no requests/limits)→ Memory usage: 8GB
CoreDNS:→ PriorityClass: system-cluster-critical (priority 2000000000)→ QoS: Burstable (request: 70Mi, no limit)→ Memory usage: 150Mi (spiked from 70Mi under DNS load)
kube-proxy:→ PriorityClass: system-node-critical (priority 2000001000)→ QoS: BestEffort (no config)→ Memory usage: 80Mi
The OOM Killer decision process is different from Kubelet Eviction.
When a node runs low on memory, the Kubelet tries to stay in control by Evicting pods. It uses your PriorityClass to decide who to kick off. In this case, your priority 0 workload would be evicted first. This is the “clean” way to fail.
But when memory spikes too fast for the Kubelet to react, the Linux Kernel OOM Killer takes over.
The Kernel doesn’t know about your PriorityClass. It only sees the QoS Class (via oom_score_adj).
Within same priority, check QoS
BestEffort killed first
Burstable second
Guaranteed last
Check memory usage relative to request
Who’s exceeding their request most?
By priority, your workload should die first.
And most of the time, it will.
But here’s the problem:
CoreDNS has NO memory limit.
With only a request (70Mi) and no limit, it can consume unlimited memory.
Under heavy DNS load (like when apps are failing and retrying), CoreDNS can spike to 150Mi, 200Mi, 300Mi.
That’s 2x, 3x, 4x its request.
If the node hits a hard memory wall before the Kubelet can run its eviction loop, the Linux Kernel looks at kube-proxy (BestEffort) and CoreDNS (Burstable and 4x over its request) and sees them as prime targets.
What happens next:
OOM Killer picks CoreDNS (or kube-proxy, or both).
DNS resolution fails.
Services can’t be resolved.
Apps can’t find backends.
Everything breaks.
Someone gets paged.
The fix that would have prevented this:
For CoreDNS (complete the partial config):
resources: requests: memory: 70Mi limits: memory: 170Mi # 2-3x for DNS spikes
For kube-proxy (add everything):
resources: requests: memory: 250Mi limits: memory: 250Mi
For your workload (proper configuration):
resources: requests: memory: 2Gi limits: memory: 4GipriorityClassName: low-priority-batch
Total: 12 lines of YAMLCost: $0Time to implement: 15 minutes
But most teams don’t do this.
Because “it’s working fine.”
Until it isn’t.
THE PROTECTION YOU ACTUALLY NEED
Everyone tells you: Set requests = limits for Guaranteed QoS.
That’s good advice for your applications.
But it’s incomplete.
You need TWO things:
Proper QoS Class (memory request + limit) - Protects against Kernel OOM Kills
Proper PriorityClass (explicit priority hierarchy) - Protects against Kubelet Eviction
Let me explain both.
PART 1: QoS Class (Memory Configuration)
QoS = Quality of Service
Kubernetes assigns every pod one of three QoS classes:
BestEffort:→ No requests or limits set→ Killed FIRST by the Kernel OOM Killer→ This is kube-proxy by default
Burstable:→ Request set, limit missing OR request < limit→ Killed SECOND by the Kernel OOM Killer→ This is CoreDNS by default
Guaranteed:→ Request = limit (for all resources)→ Killed LAST by the Kernel OOM Killer→ This is what you want for critical pods
The fix for different scenarios:
For pods with NO config (kube-proxy):
resources: requests: memory: 250Mi cpu: 100m limits: memory: 250Mi cpu: 100m
For pods with partial config (CoreDNS):
Current default:
resources: requests: memory: 70Mi cpu: 100m
Add limits:
resources: requests: memory: 70Mi cpu: 100m limits: memory: 170Mi # Allow 2-3x for DNS spikes cpu: 100m
This changes them from BestEffort/Burstable → Guaranteed.
But QoS alone isn’t enough.
PART 2: PriorityClass (Explicit Priority Hierarchy)
The Kubelet checks PriorityClass BEFORE deciding which pods to evict when the node is under pressure.
Check your system pods:
kubectl get pods -n kube-system -o custom-columns=\NAME:.metadata.name,\PRIORITY:.spec.priorityClassName
You’ll see:
coredns-xxxxx: system-cluster-criticalkube-proxy-xxxxx: system-node-critical
These have Kubernetes system priority classes.
Check your app pods:
kubectl get pods -n <your-namespace> -o custom-columns=\ NAME:.metadata.name,\ PRIORITY:.spec.priorityClassName
Probably shows:
No PriorityClass = priority 0 (lowest).
If the Kubelet has time to act, it will evict based on these numbers. But remember: if you have a priority 0 pod with Guaranteed QoS and a priority 2 billion pod with BestEffort QoS, and the memory spikes instantly, the Kernel might still kill the high-priority system pod first because it only cares about QoS.
The complete protection requires BOTH.
Create PriorityClasses for your applications:
apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: high-priority-appvalue: 1000000globalDefault: falsedescription: “High priority for critical application pods”
apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: medium-priority-appvalue: 100000globalDefault: falsedescription: “Standard priority for application pods”
apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: low-priority-batchvalue: 10000globalDefault: falsedescription: “Low priority for batch jobs”
Priority reference:
→ system-cluster-critical: 2000000000→ system-node-critical: 2000001000→ Your critical apps: 1000000→ Your standard apps: 100000→ Your batch jobs: 10000→ Default (no class): 0
Don’t set your apps to system-critical (bad practice).
Set them BELOW system pods but ABOVE default.
Then apply to your pods:
Critical apps:
spec:priorityClassName: high-priority-appcontainers:- name: app resources: requests: memory: 500Mi limits: memory: 500Mi
Batch jobs:
spec:priorityClassName: low-priority-batchcontainers:- name: batch resources: requests: memory: 1Gi limits: memory: 4Gi
Now the complete OOM/Eviction hierarchy:
Batch jobs (low priority) ← Evicted/Killed first
Standard apps (medium priority)
Critical apps (high priority)
System node-critical pods
System cluster-critical pods ← Killed last
Within each priority level:→ BestEffort killed before Burstable→ Burstable killed before Guaranteed
This is complete protection:
✅ Memory limits (QoS)✅ Priority hierarchy (PriorityClass)✅ System pods protected✅ Critical apps protected✅ Batch jobs sacrificed first
HOW TO CHECK YOUR CLUSTER RIGHT NOW
Step 1: Find pods with no memory config
kubectl get pods --all-namespaces -o json | jq ‘.items[] | select(.spec.containers[].resources.requests.memory == null) | .metadata.name’
These are BestEffort QoS (most vulnerable).
Step 2: Check the complete picture
kubectl get pods --all-namespaces -o custom-columns=\ NAME:.metadata.name,\ NAMESPACE:.metadata.namespace,\ PRIORITY:.spec.priorityClassName,\ QOS:.status.qosClass,\ MEM-REQ:.spec.containers[0].resources.requests.memory,\ MEM-LIM:.spec.containers[0].resources.limits.memory
Look for:❌ BestEffort + no priority = most vulnerable❌ Burstable + no priority = vulnerable❌ System pods with Burstable = incomplete config✅ Guaranteed + proper priority = protected
Step 3: Fix system pods first
Priority order:
System pods with incomplete config → CoreDNS (add limit)
System pods with no config → kube-proxy (add request + limit)
Critical application pods → Add memory + priority
Standard application pods
Batch jobs
Start with pods where failure = platform failure:→ CoreDNS (DNS for entire cluster)→ kube-proxy (network routing)→ Your revenue-generating apps
Step 4: Find the right values
Option A: Use Goldilocks (recommended)
Wait 24-48 hours for data collection.
Check dashboard for recommendations based on actual usage.
Option B: Manual calculation
kubectl top pods -n <namespace>
Take current usage.Add 30-50% buffer.Set request = limit for Guaranteed QoS.
Starting points:→ CoreDNS: 70Mi request → 170Mi limit→ kube-proxy: 250Mi (both)→ Your apps: Measure actual usage + buffer
Step 5: Apply the fixes
For CoreDNS (add limit to existing config):
kubectl edit deployment coredns -n kube-system
Change:
resources: requests: memory: 70Mi
To:
resources: requests: memory: 70Mi limits: memory: 170Mi
For kube-proxy (add everything):
kubectl edit daemonset kube-proxy -n kube-system
Add:
resources: requests: memory: 250Mi cpu: 100m limits: memory: 250Mi cpu: 100m
For your apps (add memory + priority):
Create PriorityClasses first (see templates above).
Then update deployments:
spec: priorityClassName: high-priority-app containers: - name: app resources: requests: memory: 500Mi limits: memory: 500Mi
Step 6: Monitor for a week
Check for:→ OOM kills:
kubectl get events --all-namespaces | grep -i oom
→ Pod restarts:
kubectl get pods --all-namespaces
→ Memory pressure:
kubectl describe nodes | grep -i pressure
→ QoS verification:
kubectl get pods -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass
If a pod hits its limit and gets OOM killed:→ Increase limit by 25-50%→ OR investigate for memory leak
Don’t just raise limits without investigating why.
THE REALITY
I found this in MY OWN practice cluster.
Not a client.Not a consulting engagement.My own infrastructure that I use to explore Goldilocks and K8s cost optimization.
And it has the same problems everyone’s clusters have:→ System pods ship incomplete (CoreDNS with request, no limit)→ System pods ship empty (kube-proxy with nothing)→ Nobody completes the configuration→ Everything works fine
Until memory pressure hits.
The fix takes 30-60 minutes:
System pods: 15-20 minutes (add limits)PriorityClasses: 10 minutes (create hierarchy)Application pods: 20-30 minutes (memory + priority)
One Saturday morning of work.
But most teams don’t do it.
Because there’s no incident yet.Because it’s “working fine.”Because nobody’s checking.
I’m checking.
And I’m sharing what I find.
Because 3 out of 4 system pods being vulnerable to OOM kill is not “working fine.”
It’s a time bomb.
Check your cluster.
Complete the configs.
Set the priorities.
Don’t wait for the 3am page.
See you next Tuesday- Naveen
P.S. - If this was useful, forward it to your DevOps team. They’ll thank you when they’re NOT explaining to management why CoreDNS died despite having system-cluster-critical priority, because nobody ever added the memory limit that Kubernetes doesn’t set by default.
Thanks for reading BeyondOps Newsletter! This post is public so feel free to share it.
