Three days of debugging a cgroup memory accounting bug

A backend service started getting OOM-killed in staging. The memory limit was 4 GiB, the resident set was plainly 1.8 GiB according to every tool we had, and yet the kernel was confident the cgroup was over its limit. I lost three days to this.

What I saw first

The pod status said OOMKilled. The events were clear:

kubectl get events --field-selector reason=OOMKilling -n app
# LAST SEEN   TYPE      REASON      OBJECT          MESSAGE
# 2m          Warning   OOMKilling  Pod/api-7b8c6   Memory cgroup out of memory

But the working set from the metrics was nowhere near the limit. I pulled memory.current from the cgroup directly:

cat /sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/.../memory.current
# 4294836224

Yes, 4 GiB exactly. At the limit. So the kernel was not lying. Something was accounted to this cgroup that I was not seeing in the usual RSS number.

Where the memory was hiding

The answer is almost always page cache or kernel memory when ps and the cgroup disagree. I read memory.stat, which is gold:

cat /sys/fs/cgroup/.../memory.stat
# anon 1824923648
# file 2387432448
# kernel_stack 12582912
# slab 44301312
# sock 1048576
# anon_thp 0
# file_mapped 83886080
# ...

There it was. file at 2.3 GiB. The service was reading a large set of files repeatedly and the page cache was being charged to the cgroup. Under v1 memory controller behaviour most of our tools learned to show anon-only “working set”, and that habit carried over. In v2 with no soft limit, the cache counts toward the hard limit when it cannot be reclaimed fast enough.

Why it was not reclaimable

This is the part that took me two of the three days. Normally page cache is reclaimable under memory pressure. The kernel’s OOM is supposed to kick in only after reclaim fails. I pulled memory.events:

cat /sys/fs/cgroup/.../memory.events
# low 0
# high 0
# max 12
# oom 1
# oom_kill 1

high 0 was the tell. We had not set memory.high at all, so there was no soft pressure signal. The kernel was going straight from “under limit” to “at limit, kill something”. Meanwhile the workload was bursty: in a tight loop it mmaped a couple of large files, did a scan, unmapped, repeat. Each iteration pulled pages into the cache. Between iterations there was no slow pressure to trigger background reclaim, so when a spike happened the kernel had to reclaim synchronously, timed out, and invoked the OOM killer.

I confirmed this with pressure stall information:

cat /sys/fs/cgroup/.../memory.pressure
# some avg10=78.20 avg60=41.11 avg300=12.02 total=183742991
# full avg10=23.90 avg60=11.03 avg300=3.01 total=45112887

full pressure at 23% in the 10-second window. The cgroup was spending a quarter of its time with no tasks running because they were all waiting on memory. That is an unhappy cgroup.

Fix

Two changes:

Set memory.high a bit below memory.max in the pod spec via a memory.limits.enable setting our cluster admin had hidden behind a feature gate. In practice this meant tuning the kubelet’s --memory-manager-policy and adding a cgroup-level high via the containerd config. After a few hours I gave up on that route and instead added MADV_DONTNEED calls in the service after the scan, which dropped the cache proactively.
Set vm.vfs_cache_pressure higher on the node. Default of 100 was fine, but pushing to 200 on these machines made the kernel more willing to dump inode/dentry caches.

The posix_fadvise approach would also work:

posix_fadvise(fd, 0, 0, POSIX_FADV_DONTNEED);

That was cleaner than the madvise route because the service did not want to keep the mapping at all.

Reflection

The hard lesson was that cgroup v2 accounts more than I was used to seeing. In v1, kmem and page cache were in different knobs and our dashboards were built around that. When the cluster moved to v2, our working-set metric became a lie on this particular workload. We rebuilt the dashboard to show anon, file, slab, and kernel_stack separately, and we alert on memory.pressure full > 10% for 5 minutes regardless of the RSS number. That would have saved me two of the three days.

Related posts: see my post on an OOM killer picking postgres and on the btrfs scrub that found what smartctl missed for more “the kernel sees what you do not” stories.