docs/runbook.md

10.6 KB · 321 lines · 2024-11-02 · eab2c71
# Runbook

"When X breaks, do Y." The way I keep this up to date is: every
time something goes wrong at home, I fix it, and then I add a
paragraph here so I don't have to think through it again. This is
the document that keeps the homelab sustainable.

Big picture is in
[docs/architecture.md](/src/homelab-compose/docs-architecture-md/).
Backup procedure is
[docs/backup-strategy.md](/src/homelab-compose/docs-backup-strategy-md/).

## General triage

Start here no matter which service looks broken.

1. `docker compose ps` - all services healthy?
2. `df -h` - any disk near full?
3. `uptime` and `dmesg | tail -50` - kernel unhappy?
4. Pi-hole working? If DNS is broken everything looks broken.
5. Check `/var/log/homelab-backup.log` and
   `/var/log/homelab-health.log` - the health check may already
   have pinpointed it.

If all green and the problem persists, go to the per-service
sections below.

## Caddy

### Symptom: every service returns 502

Caddy is the only ingress. If it's down, everything is down from
the outside.

    docker compose logs --tail 100 caddy

Most common causes:

- Bad Caddyfile change. Check with
  `docker compose exec caddy caddy validate --config /etc/caddy/Caddyfile`.
- Expired certs. `/data/certs/home.example.net.crt` older than 90
  days? Re-run `renew-cert.sh` and `docker compose restart caddy`.
- Docker bridge flap. `docker network inspect homelab` - services
  listed? If not, `docker compose down && up -d`.

### Symptom: one service returns 502, others fine

Caddy is reaching Caddy fine; the upstream isn't responding. Go to
that service's section.

## Jellyfin

### Symptom: transcodes are slow

- Confirm `/dev/dri` is passed through:
  `docker compose exec jellyfin ls /dev/dri` should list
  `renderD128`.
- Check transcode directory size: `du -sh /srv/data/jellyfin/transcodes`.
  Clean if > 5 GB.
- `intel_gpu_top` on the host - is the GPU actually being used?

### Symptom: library scan stuck

- `docker compose logs --tail 200 jellyfin | grep -i scan`
- Symptoms of corrupt SQLite: rare but has happened. Stop Jellyfin,
  `sqlite3 library.db 'PRAGMA integrity_check;'`. If it's bad,
  restore from last night's snapshot (see backup-strategy.md).

### Symptom: DRM-related errors in logs

I don't play DRM content. If you see DRM errors, some metadata
scraper is attaching them from upstream. Usually transient; if
persistent, disable the misbehaving plugin.

## Immich

### Symptom: uploads time out

- Check the ML container is healthy: it's the most common culprit.
  `docker compose ps immich-ml` and its logs.
- If the ML container is restarting, it's probably OOM - the model
  is pinned in `12fce10` to a smaller variant for this reason.
  Larger model versions want more RAM than this box has.
- Upload size limits: Caddy has `request_body max_size 2GB` for
  the immich host; not usually the issue.

### Symptom: Postgres won't start

    docker compose logs immich-db

- Usually disk full (see "General triage"). Immich's WAL grows if
  you interrupt a big upload.
- `docker volume inspect homelab_immich_db_data` to find the
  location. Check space there.
- As a last resort, restore the DB from last night's dump:

        docker compose stop immich-server immich-ml
        docker compose run --rm immich-db \
          psql -U postgres -c 'DROP DATABASE immich; CREATE DATABASE immich;'
        cat /srv/backup/$(date -d yesterday +%F)/immich.sql \
          | docker compose run --rm -T immich-db psql -U postgres immich
        docker compose up -d

## Paperless

### Symptom: documents stuck in "pending"

- The consume directory is `/srv/docs/consume`. If files are sitting
  there, Paperless's worker isn't running.
- `docker compose logs paperless | grep -i consumer`.
- Usually a permissions problem on a file dropped by a different
  user: `chown -R 1000:1000 /srv/docs/consume`.

### Symptom: OCR on a single file fails forever

- Paperless retries with backoff. Find the file in the admin UI
  and delete it; re-drop the source file. Once in a blue moon a
  PDF has something OCR can't handle and you end up with a
  poison-pill entry.

## Gitea

### Symptom: `git push` hangs

- Gitea's SSH listener is on a high port (2222), forwarded to the
  container. If that port is blocked, pushes hang instead of
  failing. Confirm from another machine: `nc -zv nuc 2222`.
- Check Gitea logs for "lfs" errors - LFS storage may have filled.
  `du -sh /srv/data/gitea/lfs` and prune if needed.

### Symptom: can't clone HTTPS

- Caddy issue, not Gitea. See Caddy section.

## Syncthing

### Symptom: a folder is stuck "out of sync"

- Go to the folder in the UI, look at "Errors."
- Usual suspects:
  - Permissions on the shared folder got flipped by an editor:
    `chown -R 1000:1000 /srv/sync`.
  - A conflict file needs manual resolution.
  - A peer is offline and has the only copy of a file.

### Symptom: discovery doesn't find peers

- Syncthing uses public relays for discovery. Confirm LAN egress on
  port 22000 TCP and UDP.
- Force re-announce: `docker compose restart syncthing`.

## Pi-hole

### Symptom: all DNS stopped working

The panic situation. If Pi-hole is down, nothing resolves.

- `docker compose ps pihole` - is it running?
- `dig @192.168.1.10 google.com` from another machine.
- Fallback: point your router's DNS at 9.9.9.9 temporarily.
- `docker compose restart pihole`. 99% of the time this is it.

### Symptom: Pi-hole blocking something it shouldn't

- Admin UI, "Tools" -> "Query Log" to see which domain and list.
- Whitelist the domain, not the IP.
- Update blocklists manually: `pihole -g` inside the container.

## Prometheus / Grafana

### Symptom: Grafana shows "no data"

- Prometheus container healthy? Scrape targets up?
  `metrics.home.example.net/targets` in browser.
- Check time sync on the NUC. Grafana vs Prometheus skew > a few
  seconds shows as gaps. `systemctl status systemd-timesyncd`.

### Symptom: Prometheus disk filling

- TSDB retention default is 15 days. Check
  `/srv/data/prometheus/wal` size.
- If over, reduce `--storage.tsdb.retention.time` or prune heavy
  metrics (Jellyfin emits a lot; node_exporter per-CPU series add
  up on bigger boxes).

## Docker host issues

### Symptom: container can't start, "no space left on device"

- Usually `/var/lib/docker` is on the same volume as something
  filling up. `docker system df`.
- `docker image prune` to clean dangling.
- `docker system prune -a --volumes` if you're brave; this will
  wipe things, including any volume not referenced by a running
  container.

### Symptom: container keeps OOMing

- `docker compose logs <svc> | grep -i killed`.
- Check per-container limits in
  [`docker-compose.yml`](/src/homelab-compose/docker-compose-yml/).
  Jellyfin's transcode buffers and Immich's ML are the two that
  hit limits.
- Host RAM: `free -h`. If the whole box is out, increase the limit
  of the biggest offender or stop something.

## UPS / power

### Symptom: UPS alarming

- `upsc serverups` on the host. Look at `ups.status`.
- `OB` = on battery. Top concern: graceful shutdown if it goes on
  for long.
- `LB` = low battery. NUT is configured to `shutdown -h now` at
  10% remaining. Confirm with `cat /etc/nut/upsmon.conf`.

### Symptom: UPS just switched to mains from battery

- Power blip. Nothing to do.
- Recurring? Check wiring, consider a better battery.

## Storage / RAID

### Symptom: RAID reports a failing drive

btrfs `dev stat` prints per-device error counters.

    btrfs device stats /srv/media

If non-zero, scrub:

    btrfs scrub start /srv/media

If scrub errors climb, the drive is going. Replace with:

    btrfs replace start <old-devid> /dev/sdX /srv/media

Do not panic; RAID1 gives us time. Confirm backups are running (see
backup-strategy.md) before doing anything invasive.

### Symptom: filesystem read-only

btrfs goes read-only on severe errors. `journalctl -k | tail -100`.
Usually means a metadata corruption. Stop all services, run
`btrfs check --readonly /dev/sdX`, and decide whether to `btrfs
check --repair` (dangerous) or reformat and restore from backup.

I have never had to. Documenting the path anyway.

## Backups

See [backup-strategy.md](/src/homelab-compose/docs-backup-strategy-md/).
Basic checks:

    tail -50 /var/log/homelab-backup.log        # did it run?
    ls -la /srv/backup/ | tail -5                # snapshot exists?
    rclone size b2:homelab-backup                # remote in sync?

### Symptom: B2 upload failing

- Check rclone credentials in
  [`.env`](/src/homelab-compose/env-example/).
- B2 bucket retention policy - maybe you hit the "keep only N"
  cap. Adjust in the B2 UI or in the backup script's retention
  policy.

## Common "everything went wrong" sequence

A real one I've used:

1. NUC came back from a power cut with no network.
2. `systemctl status networking` - dhcp client silently dead.
   `dhclient enp0s31f6` brought it back.
3. Pi-hole then came up, which made everything else's health check
   stop returning name-not-resolved.
4. Immich needed a couple of minutes for its DB to settle.
5. Run `scripts/health-check.sh` to confirm all green.

Total time maybe 10 minutes. Knowing the order made the difference;
before I wrote this runbook I used to poke services that couldn't
work because DNS was still down.

## Adding to this runbook

New incident? Paragraph goes:

    ### Symptom: <what the user sees>
    - What to check first
    - What the usual cause has been
    - How to fix it
    - (Optional) how to prevent it recurring

Keep it short. Every word here is maintenance debt.

## Quarterly checks

Once a quarter, I walk through:

- Restore last month's B2 snapshot to `/tmp/restore` and diff a
  handful of files. See
  [backup-strategy.md](/src/homelab-compose/docs-backup-strategy-md/).
- `btrfs scrub start` on both RAIDs. Takes a few hours.
- Review `docker compose config` for any old image tags, remove
  unused services, bump compose file schema if needed.
- Read this runbook end to end. Any step that no longer applies?
  Remove it.

## Things that used to be problems

Kept for reference:

- Watchtower auto-updated Paperless to a breaking release mid-day.
  Removed in `5512de8`. Not doing it again.
- Grafana dashboards were unprotected on LAN. Added basic_auth in
  `9a0bdf4`.
- Health check exited 0 on warning, so cron's MAILTO didn't fire.
  Now exits 2 on warn (`eab2c71`), mail arrives.