docs/services/prometheus.md

# prometheus

Runbook for Prometheus. Config lives under
`stacks/monitoring/prometheus/`.

See also: mercemay.top/src/homelab-compose/

## URLs

- Web UI: https://prometheus.home.arpa
- Healthz: https://prometheus.home.arpa/-/healthy
- Ready:   https://prometheus.home.arpa/-/ready
- Metrics: https://prometheus.home.arpa/metrics

## Config reload

Rules and scrape configs are reloaded without a restart via SIGHUP:

```
docker compose -f stacks/monitoring/docker-compose.yml kill -s HUP prometheus
```

If that errors out, the rules file has a syntax problem - check the
journal, then fix `stacks/monitoring/prometheus/rules/*.yml` and
retry.

## Disk pressure

TSDB is capped at 50 GiB retention. If `/prometheus/wal` grows fast,
usually because a new scrape emits unbounded labels:

1. Look at cardinality: open the UI -> Status -> TSDB status.
2. Find the biggest series and add a `metric_relabel_configs` drop
   rule in `prometheus.yml`.
3. Reload.

## Target is DOWN

1. `curl -sk https://prometheus.home.arpa/api/v1/targets | jq '.data.activeTargets[]|select(.health=="down")'`
2. Reach the service from the `prometheus` container:
   `docker exec prometheus wget -qO- http://<service>:<port>/metrics | head`
3. If that 503s, the service is the problem; see the service-specific
   runbook in `docs/services/`.

## Restoring after a corrupt WAL

If the container loops restarting with "cannot process WAL":

```
docker compose stop prometheus
sudo rm -rf /srv/homelab/stacks/monitoring/prometheus/data/wal
docker compose up -d prometheus
```

You lose up to the last scrape interval; alerts are recomputed from
the on-disk blocks.

## Federation

There is no second Prometheus to federate with. If we ever add one,
put the remote_write config in a separate file and import-include from
the main config so the two setups stay diffable.

## Capacity numbers

- ~180 series/scrape on average across jobs
- ~2.5k samples/s steady state
- Peak disk write ~400 KiB/s
- 30d retention -> ~22 GiB on disk