Notes: pgvector HNSW m parameter rule of thumb
Spent part of today benchmarking pgvector HNSW against our dataset. I went in thinking m (number of connections per layer) was the biggest knob. It kinda is. It also isn’t.
Rule of thumb I settled on:
m = 8: tiny datasets (<100k vectors), low-dimensional. Fast build, small index, recall ~85-90%.m = 16(default): general purpose. Works well up to ~5M vectors. Recall ~92-95% at reasonable ef_search.m = 32orm = 48: high-dim (>512 dim) or large datasets (>10M). Recall target of 97%+.
The catch is build time. Build time is roughly linear in m. Index size is also linear in m.
Also: ef_construction is at least as important. Bumping from 64 to 128 gets you 2-3 percentage points of recall without changing m. Cheaper to tune than m, in terms of build cost.
My current defaults for new projects:
CREATE INDEX ON docs USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 128);
Query time:
SET hnsw.ef_search = 80;
80 is a reasonable all-rounder. Bump to 120 if recall is critical, drop to 40 if latency is critical.
Future me: don’t start by tuning m. Start with defaults and ef_construction=128. If that’s not recalled enough, then look at m.