The Memory Time Bomb Hiding in Your Kubernetes Cluster

I found a ticking time bomb in my Kubernetes cluster this week.

Not over-provisioned.Not under-provisioned.Just... missing.

I ran Goldilocks on my Kubernetes practice cluster expecting to find the usual CPU waste.

What I found was worse:

3 out of 4 system pods have incomplete or missing memory configuration.

kube-proxy: No memory config at allaws-node: No memory config at allCoreDNS: Has request (70Mi), no limitmetrics-server: Properly configured

Both kube-proxy and aws-node route ALL traffic in my cluster.Both are one memory spike away from OOM kill.Both capable of taking down the entire platform.

And this is the DEFAULT configuration.

HOW I FOUND THIS

I installed Goldilocks last week. Took 5 minutes.

I expected the usual problem: CPU over-provisioning. System pods requesting 100m but using 25m. The 4x waste that exists in every cluster.

Wasteful, but at least safe.

What I found: Missing memory limits everywhere.

Here’s the exact command:

kubectl get pods --all-namespaces -o json | jq ‘.items[] | select(.spec.containers[].resources.requests.memory == null) | .metadata.name’

Output:

kube-proxy-bv6crkube-proxy-l2lkc

Two critical system components with NO memory configuration at all.

CoreDNS wasn’t in this list. Why?

Because it HAS a memory request (70Mi) set by default.

But check the limits:

kubectl get pods -n kube-system -o custom-columns=\NAME:.metadata.name,\QOS:.status.qosClass,\MEM-REQ:.spec.containers[0].resources.requests.memory,\MEM-LIM:.spec.containers[0].resources.limits.memory

Output :

coredns-xxxxx: Burstable, 70Mi,kube-proxy-xxxxx: BestEffort, ,metrics-server-xxxxx: Guaranteed, 200Mi, 200Mi

❝

The full breakdown:

→ 2 pods with NO memory config (BestEffort QoS)

→ 1 pod with partial config (Burstable QoS)

→ 1 pod properly configured (Guaranteed QoS)

3 out of 4 system pods are vulnerable.

This isn’t a misconfiguration.This is how Kubernetes ships.

WHAT HAPPENS WHEN MEMORY SPIKES

Let me walk through what happens when these missing configs meet a real workload problem.

This is based on how Linux OOM Killer actually works - the mechanics are real, even if I’m using a hypothetical scenario to illustrate.

Scenario: You deploy a workload without memory limits.

Could be anything:

→ Batch job that gets more data than expected

→ An App with a memory leak

→ A background task that spikes every Saturday morning

Normally it uses 2GB. Under load? 8GB.

Your node is running on limited resources. Let’s say 8GB RAM total.

That workload climbs: 500MB → 1GB → 2GB → 4GB → 8GB

No limit = nothing stops it.

❝

Node hits Memory Pressure.

Linux OOM Killer activates.

Looks for victims.

Here’s what it evaluates:

Your workload:→ PriorityClass: None (priority 0)→ QoS: BestEffort (no requests/limits)→ Memory usage: 8GB

CoreDNS:→ PriorityClass: system-cluster-critical (priority 2000000000)→ QoS: Burstable (request: 70Mi, no limit)→ Memory usage: 150Mi (spiked from 70Mi under DNS load)

kube-proxy:→ PriorityClass: system-node-critical (priority 2000001000)→ QoS: BestEffort (no config)→ Memory usage: 80Mi

❝

The OOM Killer decision process is different from Kubelet Eviction.

When a node runs low on memory, the Kubelet tries to stay in control by Evicting pods. It uses your PriorityClass to decide who to kick off. In this case, your priority 0 workload would be evicted first. This is the “clean” way to fail.

But when memory spikes too fast for the Kubelet to react, the Linux Kernel OOM Killer takes over.

The Kernel doesn’t know about your PriorityClass. It only sees the QoS Class (via oom_score_adj).

Within same priority, check QoS
1. BestEffort killed first
2. Burstable second
3. Guaranteed last
Check memory usage relative to request
1. Who’s exceeding their request most?

By priority, your workload should die first.

And most of the time, it will.

But here’s the problem:

CoreDNS has NO memory limit.

With only a request (70Mi) and no limit, it can consume unlimited memory.

Under heavy DNS load (like when apps are failing and retrying), CoreDNS can spike to 150Mi, 200Mi, 300Mi.

That’s 2x, 3x, 4x its request.

❝

If the node hits a hard memory wall before the Kubelet can run its eviction loop, the Linux Kernel looks at kube-proxy (BestEffort) and CoreDNS (Burstable and 4x over its request) and sees them as prime targets.

What happens next:

OOM Killer picks CoreDNS (or kube-proxy, or both).
DNS resolution fails.
Services can’t be resolved.
Apps can’t find backends.
Everything breaks.

Someone gets paged.

The fix that would have prevented this:

For CoreDNS (complete the partial config):

resources: requests: memory: 70Mi limits: memory: 170Mi # 2-3x for DNS spikes

For kube-proxy (add everything):

resources: requests: memory: 250Mi limits: memory: 250Mi

For your workload (proper configuration):

resources: requests: memory: 2Gi limits: memory: 4GipriorityClassName: low-priority-batch

Total: 12 lines of YAMLCost: $0Time to implement: 15 minutes

But most teams don’t do this.

Because “it’s working fine.”

Until it isn’t.

THE PROTECTION YOU ACTUALLY NEED

Everyone tells you: Set requests = limits for Guaranteed QoS.

That’s good advice for your applications.

But it’s incomplete.

You need TWO things:

Proper QoS Class (memory request + limit) - Protects against Kernel OOM Kills
Proper PriorityClass (explicit priority hierarchy) - Protects against Kubelet Eviction

Let me explain both.

PART 1: QoS Class (Memory Configuration)

QoS = Quality of Service

Kubernetes assigns every pod one of three QoS classes:

BestEffort:→ No requests or limits set→ Killed FIRST by the Kernel OOM Killer→ This is kube-proxy by default

Burstable:→ Request set, limit missing OR request < limit→ Killed SECOND by the Kernel OOM Killer→ This is CoreDNS by default

Guaranteed:→ Request = limit (for all resources)→ Killed LAST by the Kernel OOM Killer→ This is what you want for critical pods

The fix for different scenarios:

For pods with NO config (kube-proxy):

resources: requests: memory: 250Mi cpu: 100m limits: memory: 250Mi cpu: 100m

For pods with partial config (CoreDNS):

Current default:

resources: requests: memory: 70Mi cpu: 100m

Add limits:

resources: requests: memory: 70Mi cpu: 100m limits: memory: 170Mi # Allow 2-3x for DNS spikes cpu: 100m

This changes them from BestEffort/Burstable → Guaranteed.

But QoS alone isn’t enough.

PART 2: PriorityClass (Explicit Priority Hierarchy)

❝

The Kubelet checks PriorityClass BEFORE deciding which pods to evict when the node is under pressure.

Check your system pods:

kubectl get pods -n kube-system -o custom-columns=\NAME:.metadata.name,\PRIORITY:.spec.priorityClassName

You’ll see:

coredns-xxxxx: system-cluster-criticalkube-proxy-xxxxx: system-node-critical

These have Kubernetes system priority classes.

Check your app pods:

kubectl get pods -n <your-namespace> -o custom-columns=\ NAME:.metadata.name,\ PRIORITY:.spec.priorityClassName

Probably shows:

No PriorityClass = priority 0 (lowest).

❝

If the Kubelet has time to act, it will evict based on these numbers. But remember: if you have a priority 0 pod with Guaranteed QoS and a priority 2 billion pod with BestEffort QoS, and the memory spikes instantly, the Kernel might still kill the high-priority system pod first because it only cares about QoS.

The complete protection requires BOTH.

Create PriorityClasses for your applications:

apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: high-priority-appvalue: 1000000globalDefault: falsedescription: “High priority for critical application pods”

apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: medium-priority-appvalue: 100000globalDefault: falsedescription: “Standard priority for application pods”

apiVersion: scheduling.k8s.io/v1kind: PriorityClassmetadata: name: low-priority-batchvalue: 10000globalDefault: falsedescription: “Low priority for batch jobs”

Priority reference:

→ system-cluster-critical: 2000000000→ system-node-critical: 2000001000→ Your critical apps: 1000000→ Your standard apps: 100000→ Your batch jobs: 10000→ Default (no class): 0

❝

Don’t set your apps to system-critical (bad practice).

Set them BELOW system pods but ABOVE default.

Then apply to your pods:

Critical apps:

spec:priorityClassName: high-priority-appcontainers:- name: app resources: requests: memory: 500Mi limits: memory: 500Mi

Batch jobs:

spec:priorityClassName: low-priority-batchcontainers:- name: batch resources: requests: memory: 1Gi limits: memory: 4Gi

Now the complete OOM/Eviction hierarchy:

Batch jobs (low priority) ← Evicted/Killed first
Standard apps (medium priority)
Critical apps (high priority)
System node-critical pods
System cluster-critical pods ← Killed last

Within each priority level:→ BestEffort killed before Burstable→ Burstable killed before Guaranteed

This is complete protection:

✅ Memory limits (QoS)✅ Priority hierarchy (PriorityClass)✅ System pods protected✅ Critical apps protected✅ Batch jobs sacrificed first

HOW TO CHECK YOUR CLUSTER RIGHT NOW

Step 1: Find pods with no memory config

kubectl get pods --all-namespaces -o json | jq ‘.items[] | select(.spec.containers[].resources.requests.memory == null) | .metadata.name’

These are BestEffort QoS (most vulnerable).

Step 2: Check the complete picture

kubectl get pods --all-namespaces -o custom-columns=\ NAME:.metadata.name,\ NAMESPACE:.metadata.namespace,\ PRIORITY:.spec.priorityClassName,\ QOS:.status.qosClass,\ MEM-REQ:.spec.containers[0].resources.requests.memory,\ MEM-LIM:.spec.containers[0].resources.limits.memory

Look for:❌ BestEffort + no priority = most vulnerable❌ Burstable + no priority = vulnerable❌ System pods with Burstable = incomplete config✅ Guaranteed + proper priority = protected

Step 3: Fix system pods first

Priority order:

System pods with incomplete config → CoreDNS (add limit)
System pods with no config → kube-proxy (add request + limit)
Critical application pods → Add memory + priority
Standard application pods
Batch jobs

Start with pods where failure = platform failure:→ CoreDNS (DNS for entire cluster)→ kube-proxy (network routing)→ Your revenue-generating apps

Step 4: Find the right values

Option A: Use Goldilocks (recommended)

Install: https://goldilocks.docs.fairwinds.com/installation/

Wait 24-48 hours for data collection.

Check dashboard for recommendations based on actual usage.

Option B: Manual calculation

kubectl top pods -n <namespace>

Take current usage.Add 30-50% buffer.Set request = limit for Guaranteed QoS.

Starting points:→ CoreDNS: 70Mi request → 170Mi limit→ kube-proxy: 250Mi (both)→ Your apps: Measure actual usage + buffer

Step 5: Apply the fixes

For CoreDNS (add limit to existing config):

kubectl edit deployment coredns -n kube-system

Change:

resources: requests: memory: 70Mi

To:

resources: requests: memory: 70Mi limits: memory: 170Mi

For kube-proxy (add everything):

kubectl edit daemonset kube-proxy -n kube-system

Add:

resources: requests: memory: 250Mi cpu: 100m limits: memory: 250Mi cpu: 100m

For your apps (add memory + priority):

Create PriorityClasses first (see templates above).

Then update deployments:

spec: priorityClassName: high-priority-app containers: - name: app resources: requests: memory: 500Mi limits: memory: 500Mi

Step 6: Monitor for a week

Check for:→ OOM kills:

kubectl get events --all-namespaces | grep -i oom

→ Pod restarts:

kubectl get pods --all-namespaces

→ Memory pressure:

kubectl describe nodes | grep -i pressure

→ QoS verification:

kubectl get pods -o custom-columns=NAME:.metadata.name,QOS:.status.qosClass

If a pod hits its limit and gets OOM killed:→ Increase limit by 25-50%→ OR investigate for memory leak

Don’t just raise limits without investigating why.

THE REALITY

I found this in MY OWN practice cluster.

Not a client.Not a consulting engagement.My own infrastructure that I use to explore Goldilocks and K8s cost optimization.

And it has the same problems everyone’s clusters have:→ System pods ship incomplete (CoreDNS with request, no limit)→ System pods ship empty (kube-proxy with nothing)→ Nobody completes the configuration→ Everything works fine

Until memory pressure hits.

The fix takes 30-60 minutes:

System pods: 15-20 minutes (add limits)PriorityClasses: 10 minutes (create hierarchy)Application pods: 20-30 minutes (memory + priority)

One Saturday morning of work.

But most teams don’t do it.

Because there’s no incident yet.Because it’s “working fine.”Because nobody’s checking.

I’m checking.

And I’m sharing what I find.

Because 3 out of 4 system pods being vulnerable to OOM kill is not “working fine.”

It’s a time bomb.

❝

Check your cluster.

Complete the configs.

Set the priorities.

Don’t wait for the 3am page.

See you next Tuesday- Naveen

P.S. - If this was useful, forward it to your DevOps team. They’ll thank you when they’re NOT explaining to management why CoreDNS died despite having system-cluster-critical priority, because nobody ever added the memory limit that Kubernetes doesn’t set by default.

Thanks for reading BeyondOps Newsletter! This post is public so feel free to share it.