Logical replication slot lag ate our WAL

Logical replication in Postgres is one of those features that’s magical when it works and terrifying when it doesn’t. Last quarter we had a subscriber go offline for 36 hours. When I finally looked at the primary, pg_wal had grown to 380GB. We had two hours of runway before the disk filled and the whole database stopped accepting writes.

What happened

We run a CDC pipeline that uses a logical replication slot to stream changes to a Kafka bridge. The bridge had a bug (spoiler: cert renewal + an overconfident retry loop) and stopped consuming. But the slot stayed alive — which is correct behavior, because Postgres doesn’t know whether the consumer is “paused” or “dead forever.” It has to retain WAL.

The thing I didn’t appreciate is that retention is per slot. One idle slot can hold back the whole server’s WAL.

The timeline

Tuesday 02:00: bridge starts failing. Logs quietly pile up. Nobody paged.
Tuesday 14:00: I check on something else, notice disk usage graph has a 45-degree line.
Tuesday 14:05: SSH in, du -sh /var/lib/postgresql/*/pg_wal = 142G. Confusion.
Tuesday 14:08: SELECT slot_name, pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) shows one slot behind by 142GB.
Tuesday 14:30: Fix the bridge. Slot catches up in about 40 minutes.

It was fine in the end. It was not fine that we were two hours from catastrophe and nobody knew.

The monitoring that’s obvious in retrospect

We now alert on:

SELECT
  slot_name,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag_bytes,
  pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn) AS lag_bytes_num,
  active
FROM pg_replication_slots;

Two alerts:

lag_bytes_num > 10GB: warn.
lag_bytes_num > 50GB: page.

These thresholds depend on your volume. For us 10GB means a few minutes behind — normal consumer hiccup. 50GB means something’s genuinely wrong.

We also alert on active = false for any slot. A slot with no consumer is either misconfigured or abandoned.

The guardrails we added

Beyond monitoring, two changes.

First, max_slot_wal_keep_size in postgresql.conf. This is a hard cap: if a slot falls behind by more than this, the slot is invalidated (consumer will need to resync). We set it to a conservative value:

max_slot_wal_keep_size = 100GB

This means we’ll lose a misbehaving slot before we lose the database. That’s the right tradeoff. Consumers can recover from “slot invalidated” with more work than they can recover from “database down.”

Second, a janitor that periodically lists slots and cross-references against a known-slots allowlist. Unknown slot? Alert. Helps with “engineer created a slot for testing and forgot.”

import psycopg
KNOWN_SLOTS = {"cdc_bridge", "analytics_snapshot", "replica_us_west"}

with psycopg.connect(DSN) as conn:
    slots = {r[0] for r in conn.execute("SELECT slot_name FROM pg_replication_slots")}
    unknown = slots - KNOWN_SLOTS
    if unknown:
        alert(f"unknown replication slots: {unknown}")

The things I still don’t love about logical replication

Slot invalidation recovery is a real pain. Resyncing a large table over a logical slot is slow and involves a lot of operational coordination.
There’s no built-in way to say “if the consumer is away for more than X, drop the slot.” The slot outlives the consumer by default.
Monitoring isn’t baked in. You have to set up the queries above; nothing alerts you out of the box.
The pg_replication_slots.active flag is flap-prone on some client libraries that reconnect frequently. Alert on sustained inactivity, not instantaneous.

Reflection

I had read all of this in the docs. I had even used logical replication on a smaller project. What I didn’t have was the operational instinct that “slot lag = disk fills” — it’s written in the docs but it’s hidden behind four levels of abstraction and you don’t feel it until it’s your problem.

If you’re setting up logical replication today, write the monitoring query first. Put it in Grafana. Set max_slot_wal_keep_size. Allowlist your slots. Then build the consumer.

Related: Postgres partitioning attach locks has more on Postgres operational surprises.