On our staging VM, a test runner process would occasionally leak memory. When the VM ran out, the OOM killer would pick postgres and take down every service that depended on it. The test process, the actual culprit, would happily keep running. Here is how I finally understood and fixed this.

The scene

Single VM, 16 GB RAM. Runs:

  • postgres (legitimate RSS around 3-4 GB)
  • a handful of app processes (RSS around 200 MB each)
  • a test runner that occasionally balloons to 10 GB

When the runner balloons, the kernel OOM handler fires. It picks a victim based on an internal score. Counter to my intuition, it picked postgres instead of the leaking test runner.

Looking at the scores

The kernel stores a per-process OOM score in /proc/<pid>/oom_score. Higher means more likely to be picked. It also has oom_score_adj, a knob in /proc/<pid>/oom_score_adj you can set to bias the decision (range -1000 to +1000; -1000 means “never kill this”).

For postgres:

pgrep -a postgres
# 2341 /usr/lib/postgresql/15/bin/postgres -D ...

cat /proc/2341/oom_score
# 820
cat /proc/2341/oom_score_adj
# 0

For the test runner in its happy state:

cat /proc/8812/oom_score
# 120
cat /proc/8812/oom_score_adj
# 0

Then the runner balloons. Its score goes way up. You would expect the runner to be picked. But the scoring algorithm is kinder to some processes in ways you might not expect.

The algorithm, roughly

Simplified version of what Linux does:

  1. Compute badness for each process as a function of RSS + swap usage.
  2. Apply the oom_score_adj bias.
  3. Skip kernel threads, init, and anything with adj=-1000.
  4. Pick the highest.

In practice the postgres score was high because postgres is a parent process with many children that collectively hold shared memory segments. The shared memory ends up accounted to the main process, which is one of many postgres quirks. In this specific case, postgres’s score was 820 at idle. When the runner ballooned to 10 GB its score was 950 or so. Postgres should have lost. But there is a second thing: when postgres allocates via huge pages, a chunk of its RSS is counted twice in certain kernel paths. That was enough to tip.

I actually learned about the huge-pages double-count when I read the kernel source during a particularly grumpy evening. For our config it was decisive.

The fix

Three layers, from cheapest to most thorough:

  1. Protect postgres with oom_score_adj. Set it to -500 so it is still killable in a true emergency but loses against any reasonable competitor.

    # /etc/systemd/system/postgresql.service.d/oom.conf
    [Service]
    OOMScoreAdjust=-500
    

    After systemctl daemon-reload && systemctl restart postgresql:

    cat /proc/$(pgrep postgres | head -1)/oom_score_adj
    # -500
    

    Now under memory pressure, postgres is demoted significantly. The OOM killer has to find something meaner to eat first.

  2. Put the runner in a cgroup with a hard memory limit. Systemd makes this one line:

    # /etc/systemd/system/test-runner.service.d/mem.conf
    [Service]
    MemoryMax=8G
    MemoryHigh=6G
    

    MemoryHigh triggers memory pressure and reclaim at 6G, and MemoryMax hard-limits at 8G. If the runner exceeds 8G, the kernel OOMs within the runner’s cgroup, killing the runner specifically and leaving the rest of the system alone.

  3. Add pressure-based alerting. /proc/pressure/memory tells me when the system is struggling before a kill happens:

    cat /proc/pressure/memory
    # some avg10=18.00 avg60=9.03 avg300=2.02 total=783742991
    # full avg10=6.00 avg60=3.01 avg300=0.77 total=221112887
    

    I wired a small Prometheus exporter that reads these and alerts at full avg10 > 10%. That gives me a heads-up before the kernel starts picking victims.

Caveats

  • Setting OOMScoreAdjust=-1000 makes a process un-kill-able. That is a big knob. If you set it on postgres and postgres itself starts allocating boundlessly (a bug), the kernel will eventually OOM the entire node. Leave postgres killable as a last resort.
  • Cgroup memory limits interact with the kernel page cache in subtle ways, which I wrote about in see my post on a cgroup memory accounting bug.
  • On a node running k8s, the kubelet already puts pods in cgroups and sets oom_score_adj based on QoS class. Running postgres as a plain systemd service next to k8s pods on the same node gets confusing; the non-pod processes fall into the root cgroup which has no explicit protections.

Reflection

The OOM killer is one of those subsystems you do not think about until you have to. I used to think of it as a randomizer. It is not. It has a consistent, legible, sometimes slightly surprising scoring function. Reading /proc/<pid>/oom_score and understanding why it is what it is, for your particular workloads, is a rewarding afternoon. Adjusting OOMScoreAdjust for your known-important processes and cgroup-boxing your known-sloppy ones is the durable answer.

Related: see my post on cgroup memory accounting and the three days I lost to it.