# prometheus
Runbook for Prometheus. Config lives under
`stacks/monitoring/prometheus/`.
See also: mercemay.top/src/homelab-compose/
## URLs
- Web UI: https://prometheus.home.arpa
- Healthz: https://prometheus.home.arpa/-/healthy
- Ready: https://prometheus.home.arpa/-/ready
- Metrics: https://prometheus.home.arpa/metrics
## Config reload
Rules and scrape configs are reloaded without a restart via SIGHUP:
```
docker compose -f stacks/monitoring/docker-compose.yml kill -s HUP prometheus
```
If that errors out, the rules file has a syntax problem - check the
journal, then fix `stacks/monitoring/prometheus/rules/*.yml` and
retry.
## Disk pressure
TSDB is capped at 50 GiB retention. If `/prometheus/wal` grows fast,
usually because a new scrape emits unbounded labels:
1. Look at cardinality: open the UI -> Status -> TSDB status.
2. Find the biggest series and add a `metric_relabel_configs` drop
rule in `prometheus.yml`.
3. Reload.
## Target is DOWN
1. `curl -sk https://prometheus.home.arpa/api/v1/targets | jq '.data.activeTargets[]|select(.health=="down")'`
2. Reach the service from the `prometheus` container:
`docker exec prometheus wget -qO- http://<service>:<port>/metrics | head`
3. If that 503s, the service is the problem; see the service-specific
runbook in `docs/services/`.
## Restoring after a corrupt WAL
If the container loops restarting with "cannot process WAL":
```
docker compose stop prometheus
sudo rm -rf /srv/homelab/stacks/monitoring/prometheus/data/wal
docker compose up -d prometheus
```
You lose up to the last scrape interval; alerts are recomputed from
the on-disk blocks.
## Federation
There is no second Prometheus to federate with. If we ever add one,
put the remote_write config in a separate file and import-include from
the main config so the two setups stay diffable.
## Capacity numbers
- ~180 series/scrape on average across jobs
- ~2.5k samples/s steady state
- Peak disk write ~400 KiB/s
- 30d retention -> ~22 GiB on disk