Redis maxmemory, eviction, and the day we served stale for 20 minutes

Redis has been boring for us for years. One day it suddenly wasn’t.

What users saw

New signups showed a blank feed for about 20 minutes. Existing users were fine. Our dashboards showed everything green. Our SRE was very confused.

What was actually happening

The service writes into a Redis cache during user onboarding. Specifically, it populates a “default feed” for a new user based on a recommendation computation. That write was failing — silently — because Redis was rejecting new keys with:

OOM command not allowed when used memory > 'maxmemory'.

This is Redis telling you: “you told me I have a memory budget, and I am at that budget, and you didn’t tell me what to evict, so I’m refusing new writes.”

Why didn’t eviction save us

Because maxmemory-policy was set to noeviction. Which is the default. Which I didn’t know.

For a cache you almost always want allkeys-lru or allkeys-lfu. For a general-purpose Redis where some keys are durable and some are caches, you want volatile-lru and set TTLs on the cache keys. noeviction is the right choice only when every key matters and you’d rather refuse writes than lose any data.

Our Redis was sized for the cache use case. Nobody had explicitly set a policy. It was being used as a write-through cache with no fallback. When it hit maxmemory, it just started failing new writes.

The service wasn’t observing the failure

The write path did something morally equivalent to:

try:
    redis.set(key, json.dumps(feed), ex=3600)
except RedisError as e:
    logger.warning("cache write failed", exc_info=True)

Which is… fine? Except the error happened inside the pipeline context, and the actual OOM errors were swallowed by a broader exception handler a layer up that just logged them at INFO level. And the logs for INFO were being sampled heavily for cost reasons.

So the service kept “succeeding” in its own view, with blank caches. New users saw blank feeds. Existing users were reading from already-populated caches and were fine.

The fix

Three changes, all obvious in retrospect.

First, set an actual eviction policy:

maxmemory-policy allkeys-lru

For our workload (everything in Redis is a cache, no durable data) this is correct.

Second, maxmemory sizing. We were at 95% of maxmemory utilization steady-state, which was asking for trouble. We bumped the size and set alerts on utilization > 80%.

Third, the service’s cache write path now escalates OOM errors to ERROR log level and emits a metric. Any cache write failure counts against an SLO budget; sustained failures page.

try:
    redis.set(key, json.dumps(feed), ex=3600)
except redis.exceptions.OutOfMemoryError as e:
    logger.error("redis oom during cache write", extra={"key": key})
    metrics.increment("cache.write.oom")
    raise
except redis.exceptions.RedisError as e:
    logger.warning("cache write failed", exc_info=True)
    metrics.increment("cache.write.error")

Why the OOM happened in the first place

The onboarding feed had grown. We started including more items per user (from 100 to 500) a few weeks before. Each user’s cached feed was now 5x larger. The Redis wasn’t under more load in terms of QPS, it was just storing more stuff per key, and crept up to maxmemory gradually.

This is a pattern. Cache-friendly changes often look innocuous in isolation — “just store a bit more per user.” Multiplied by all users, over time, they’re the biggest drivers of cache size drift.

Things I’d put in a runbook

Alert on info memory -> used_memory approaching maxmemory at 75% / 85%.
Alert on info stats -> evicted_keys rate. Non-zero eviction means you’re thrashing.
Alert on info stats -> rejected_connections or OOM errors in the slow log.
Never run Redis with noeviction unless you’ve explicitly decided you want write-reject-on-full behavior. Write that decision down.

Related: The label that killed Prometheus is a sibling story — “silent drift accumulates, then everything breaks.”