# Runbook
"When X breaks, do Y." The way I keep this up to date is: every
time something goes wrong at home, I fix it, and then I add a
paragraph here so I don't have to think through it again. This is
the document that keeps the homelab sustainable.
Big picture is in
[docs/architecture.md](/src/homelab-compose/docs-architecture-md/).
Backup procedure is
[docs/backup-strategy.md](/src/homelab-compose/docs-backup-strategy-md/).
## General triage
Start here no matter which service looks broken.
1. `docker compose ps` - all services healthy?
2. `df -h` - any disk near full?
3. `uptime` and `dmesg | tail -50` - kernel unhappy?
4. Pi-hole working? If DNS is broken everything looks broken.
5. Check `/var/log/homelab-backup.log` and
`/var/log/homelab-health.log` - the health check may already
have pinpointed it.
If all green and the problem persists, go to the per-service
sections below.
## Caddy
### Symptom: every service returns 502
Caddy is the only ingress. If it's down, everything is down from
the outside.
docker compose logs --tail 100 caddy
Most common causes:
- Bad Caddyfile change. Check with
`docker compose exec caddy caddy validate --config /etc/caddy/Caddyfile`.
- Expired certs. `/data/certs/home.example.net.crt` older than 90
days? Re-run `renew-cert.sh` and `docker compose restart caddy`.
- Docker bridge flap. `docker network inspect homelab` - services
listed? If not, `docker compose down && up -d`.
### Symptom: one service returns 502, others fine
Caddy is reaching Caddy fine; the upstream isn't responding. Go to
that service's section.
## Jellyfin
### Symptom: transcodes are slow
- Confirm `/dev/dri` is passed through:
`docker compose exec jellyfin ls /dev/dri` should list
`renderD128`.
- Check transcode directory size: `du -sh /srv/data/jellyfin/transcodes`.
Clean if > 5 GB.
- `intel_gpu_top` on the host - is the GPU actually being used?
### Symptom: library scan stuck
- `docker compose logs --tail 200 jellyfin | grep -i scan`
- Symptoms of corrupt SQLite: rare but has happened. Stop Jellyfin,
`sqlite3 library.db 'PRAGMA integrity_check;'`. If it's bad,
restore from last night's snapshot (see backup-strategy.md).
### Symptom: DRM-related errors in logs
I don't play DRM content. If you see DRM errors, some metadata
scraper is attaching them from upstream. Usually transient; if
persistent, disable the misbehaving plugin.
## Immich
### Symptom: uploads time out
- Check the ML container is healthy: it's the most common culprit.
`docker compose ps immich-ml` and its logs.
- If the ML container is restarting, it's probably OOM - the model
is pinned in `12fce10` to a smaller variant for this reason.
Larger model versions want more RAM than this box has.
- Upload size limits: Caddy has `request_body max_size 2GB` for
the immich host; not usually the issue.
### Symptom: Postgres won't start
docker compose logs immich-db
- Usually disk full (see "General triage"). Immich's WAL grows if
you interrupt a big upload.
- `docker volume inspect homelab_immich_db_data` to find the
location. Check space there.
- As a last resort, restore the DB from last night's dump:
docker compose stop immich-server immich-ml
docker compose run --rm immich-db \
psql -U postgres -c 'DROP DATABASE immich; CREATE DATABASE immich;'
cat /srv/backup/$(date -d yesterday +%F)/immich.sql \
| docker compose run --rm -T immich-db psql -U postgres immich
docker compose up -d
## Paperless
### Symptom: documents stuck in "pending"
- The consume directory is `/srv/docs/consume`. If files are sitting
there, Paperless's worker isn't running.
- `docker compose logs paperless | grep -i consumer`.
- Usually a permissions problem on a file dropped by a different
user: `chown -R 1000:1000 /srv/docs/consume`.
### Symptom: OCR on a single file fails forever
- Paperless retries with backoff. Find the file in the admin UI
and delete it; re-drop the source file. Once in a blue moon a
PDF has something OCR can't handle and you end up with a
poison-pill entry.
## Gitea
### Symptom: `git push` hangs
- Gitea's SSH listener is on a high port (2222), forwarded to the
container. If that port is blocked, pushes hang instead of
failing. Confirm from another machine: `nc -zv nuc 2222`.
- Check Gitea logs for "lfs" errors - LFS storage may have filled.
`du -sh /srv/data/gitea/lfs` and prune if needed.
### Symptom: can't clone HTTPS
- Caddy issue, not Gitea. See Caddy section.
## Syncthing
### Symptom: a folder is stuck "out of sync"
- Go to the folder in the UI, look at "Errors."
- Usual suspects:
- Permissions on the shared folder got flipped by an editor:
`chown -R 1000:1000 /srv/sync`.
- A conflict file needs manual resolution.
- A peer is offline and has the only copy of a file.
### Symptom: discovery doesn't find peers
- Syncthing uses public relays for discovery. Confirm LAN egress on
port 22000 TCP and UDP.
- Force re-announce: `docker compose restart syncthing`.
## Pi-hole
### Symptom: all DNS stopped working
The panic situation. If Pi-hole is down, nothing resolves.
- `docker compose ps pihole` - is it running?
- `dig @192.168.1.10 google.com` from another machine.
- Fallback: point your router's DNS at 9.9.9.9 temporarily.
- `docker compose restart pihole`. 99% of the time this is it.
### Symptom: Pi-hole blocking something it shouldn't
- Admin UI, "Tools" -> "Query Log" to see which domain and list.
- Whitelist the domain, not the IP.
- Update blocklists manually: `pihole -g` inside the container.
## Prometheus / Grafana
### Symptom: Grafana shows "no data"
- Prometheus container healthy? Scrape targets up?
`metrics.home.example.net/targets` in browser.
- Check time sync on the NUC. Grafana vs Prometheus skew > a few
seconds shows as gaps. `systemctl status systemd-timesyncd`.
### Symptom: Prometheus disk filling
- TSDB retention default is 15 days. Check
`/srv/data/prometheus/wal` size.
- If over, reduce `--storage.tsdb.retention.time` or prune heavy
metrics (Jellyfin emits a lot; node_exporter per-CPU series add
up on bigger boxes).
## Docker host issues
### Symptom: container can't start, "no space left on device"
- Usually `/var/lib/docker` is on the same volume as something
filling up. `docker system df`.
- `docker image prune` to clean dangling.
- `docker system prune -a --volumes` if you're brave; this will
wipe things, including any volume not referenced by a running
container.
### Symptom: container keeps OOMing
- `docker compose logs <svc> | grep -i killed`.
- Check per-container limits in
[`docker-compose.yml`](/src/homelab-compose/docker-compose-yml/).
Jellyfin's transcode buffers and Immich's ML are the two that
hit limits.
- Host RAM: `free -h`. If the whole box is out, increase the limit
of the biggest offender or stop something.
## UPS / power
### Symptom: UPS alarming
- `upsc serverups` on the host. Look at `ups.status`.
- `OB` = on battery. Top concern: graceful shutdown if it goes on
for long.
- `LB` = low battery. NUT is configured to `shutdown -h now` at
10% remaining. Confirm with `cat /etc/nut/upsmon.conf`.
### Symptom: UPS just switched to mains from battery
- Power blip. Nothing to do.
- Recurring? Check wiring, consider a better battery.
## Storage / RAID
### Symptom: RAID reports a failing drive
btrfs `dev stat` prints per-device error counters.
btrfs device stats /srv/media
If non-zero, scrub:
btrfs scrub start /srv/media
If scrub errors climb, the drive is going. Replace with:
btrfs replace start <old-devid> /dev/sdX /srv/media
Do not panic; RAID1 gives us time. Confirm backups are running (see
backup-strategy.md) before doing anything invasive.
### Symptom: filesystem read-only
btrfs goes read-only on severe errors. `journalctl -k | tail -100`.
Usually means a metadata corruption. Stop all services, run
`btrfs check --readonly /dev/sdX`, and decide whether to `btrfs
check --repair` (dangerous) or reformat and restore from backup.
I have never had to. Documenting the path anyway.
## Backups
See [backup-strategy.md](/src/homelab-compose/docs-backup-strategy-md/).
Basic checks:
tail -50 /var/log/homelab-backup.log # did it run?
ls -la /srv/backup/ | tail -5 # snapshot exists?
rclone size b2:homelab-backup # remote in sync?
### Symptom: B2 upload failing
- Check rclone credentials in
[`.env`](/src/homelab-compose/env-example/).
- B2 bucket retention policy - maybe you hit the "keep only N"
cap. Adjust in the B2 UI or in the backup script's retention
policy.
## Common "everything went wrong" sequence
A real one I've used:
1. NUC came back from a power cut with no network.
2. `systemctl status networking` - dhcp client silently dead.
`dhclient enp0s31f6` brought it back.
3. Pi-hole then came up, which made everything else's health check
stop returning name-not-resolved.
4. Immich needed a couple of minutes for its DB to settle.
5. Run `scripts/health-check.sh` to confirm all green.
Total time maybe 10 minutes. Knowing the order made the difference;
before I wrote this runbook I used to poke services that couldn't
work because DNS was still down.
## Adding to this runbook
New incident? Paragraph goes:
### Symptom: <what the user sees>
- What to check first
- What the usual cause has been
- How to fix it
- (Optional) how to prevent it recurring
Keep it short. Every word here is maintenance debt.
## Quarterly checks
Once a quarter, I walk through:
- Restore last month's B2 snapshot to `/tmp/restore` and diff a
handful of files. See
[backup-strategy.md](/src/homelab-compose/docs-backup-strategy-md/).
- `btrfs scrub start` on both RAIDs. Takes a few hours.
- Review `docker compose config` for any old image tags, remove
unused services, bump compose file schema if needed.
- Read this runbook end to end. Any step that no longer applies?
Remove it.
## Things that used to be problems
Kept for reference:
- Watchtower auto-updated Paperless to a breaking release mid-day.
Removed in `5512de8`. Not doing it again.
- Grafana dashboards were unprotected on LAN. Added basic_auth in
`9a0bdf4`.
- Health check exited 0 on warning, so cron's MAILTO didn't fire.
Now exits 2 on warn (`eab2c71`), mail arrives.