When GIN on JSONB is wrong
I keep seeing teams reach for CREATE INDEX ... USING GIN (payload) as their default response to “we need to index a JSONB column.” It often works, sometimes spectacularly, and sometimes it ruins their afternoon. Here’s the mental model I’ve settled on.
GIN is for containment and full-scan-within-doc
GIN is great for queries that look like:
SELECT * FROM events WHERE payload @> '{"type": "signup"}';
SELECT * FROM events WHERE payload ? 'user_id';
That’s what it’s built for. The operators @>, ?, ?&, ?|, jsonb_path_ops equivalents — these are fast with GIN.
It is not fast for:
SELECT * FROM events WHERE (payload->>'user_id')::uuid = 'abc...';
SELECT * FROM events WHERE payload->'user'->>'email' = 'x@y.com';
These are extraction queries. You’re treating a single path in the JSONB as a typed value and matching it. GIN can’t use its inverted-index structure for these. You’ll get a seq scan or worse, a bitmap scan that falls back to filter on every row.
Expression indexes for extraction
The fix is a boring B-tree on the extracted expression:
CREATE INDEX events_user_id_idx
ON events ((payload->>'user_id'));
This lets:
SELECT * FROM events WHERE payload->>'user_id' = 'abc...';
do an index scan. It’s typed as text, but you can cast:
CREATE INDEX events_user_id_uuid_idx
ON events (((payload->>'user_id')::uuid));
Caveats:
The query has to use the same expression as the index. Case, whitespace, explicit casts — all matter. Postgres doesn’t know that
payload->>'user_id'andpayload->> 'user_id'are the same thing if you add a cast to one and not the other.Expression indexes only cover one path each. If you have 10 paths you query on, that’s 10 indexes. If you have 100 and don’t know in advance which, you might actually want GIN with
jsonb_path_ops. It’s a tradeoff.
A story from the trenches
Team had a table events (id, tenant_id, created_at, payload jsonb). They’d indexed with USING GIN (payload jsonb_path_ops). Queries like payload @> '{"foo": "bar"}' were snappy.
Then product wanted to show “events for user X.” That query looked like:
SELECT * FROM events
WHERE tenant_id = $1
AND (payload->>'user_id')::uuid = $2
AND created_at > now() - interval '7 days'
ORDER BY created_at DESC
LIMIT 100;
GIN couldn’t help. There was a b-tree on (tenant_id, created_at DESC) which scoped the search somewhat. But within a tenant for a week, there could be 500k events, and Postgres filter-on-rows-from-index for the user_id was the bottleneck. 1.2 seconds.
Adding:
CREATE INDEX CONCURRENTLY events_tenant_user_created_idx
ON events (tenant_id, ((payload->>'user_id')::uuid), created_at DESC);
…took us to 18ms. An 80x improvement by adding a specific index for a specific extraction, not by tuning GIN.
When GIN really is right
- You genuinely query by containment with varying shapes. Example: a webhook event store where downstream consumers filter by arbitrary path matches.
- The document structure is not known in advance. Think heterogeneous vendor data.
- You do full-text or lexeme-style search inside a JSONB (
to_tsvector(payload::text)etc.).
In those cases GIN (particularly jsonb_path_ops, which is smaller and faster for containment) is the right call.
Size and maintenance cost
GIN indexes are slower to update than B-tree. For a high-write JSONB column, GIN index maintenance can be a significant fraction of write cost. We’ve measured ~30% write amplification on some tables.
Also: GIN indexes on large JSONB documents can be enormous. One of our tables had a 400GB GIN index on top of a 600GB table. We dropped it after confirming nobody actually used the containment queries it was built for.
Check your index usage before you trust it:
SELECT indexrelname, idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE schemaname = 'public'
ORDER BY idx_scan DESC;
Zero idx_scan after a few days means the index is dead weight.
Reflection
“JSONB column, therefore GIN” is the default advice. It’s often right, but it masks the real question: how are you actually querying this column? For containment and shape-matching, GIN. For extraction, expression B-tree. For unknown future queries, you probably want GIN-as-safety-net and targeted expression indexes as you learn the real query patterns.