docs/backup-strategy.md

7.9 KB · 234 lines · 2025-02-14 · 3b7a2c9
# Backup Strategy

Three rules: three copies, two kinds of media, one offsite. For a
single-NUC homelab the specific mapping is:

1. Primary storage on btrfs RAID1 (live data)
2. Local snapshot backups on NVMe via rsync --link-dest
3. Offsite copy on Backblaze B2 via rclone

This doc describes what gets backed up, how, how long it's kept,
and how I verify restores actually work.

Runbook for restore specifics in
[docs/runbook.md](/src/homelab-compose/docs-runbook-md/). Overall
architecture in
[docs/architecture.md](/src/homelab-compose/docs-architecture-md/).

## What gets backed up

Three categories:

### Irreplaceable

- Paperless document archive (`/srv/docs/archive`)
- Immich originals (`/srv/photos/upload`)
- Gitea repositories and LFS (`/srv/data/gitea/git`, `/srv/data/gitea/lfs`)
- Syncthing shared folder (`/srv/sync`)
- Postgres dumps for Immich, Paperless, Gitea

### Reproducible but annoying

- Jellyfin library database (`/srv/data/jellyfin/library.db`) - I
  could rebuild by re-scanning, but that takes hours and loses
  watched state.
- Pi-hole blocklist and custom entries.
- Grafana dashboards.

### Don't bother

- Media library (Jellyfin catalog) itself. I can re-rip from discs
  or re-download. Backing up 4 TB of video to B2 at ~$5/TB/month
  is not worth it.
- Docker container filesystems. Images are pulled from the
  registry; volumes map to `/srv/data`.

The distinction matters because everything in "irreplaceable" gets
backed up nightly with aggressive retention; "reproducible" gets a
looser schedule; "don't bother" doesn't get touched.

## Nightly flow

Cron triggers
[`scripts/backup.sh`](/src/homelab-compose/scripts-backup-sh/) at
03:15.

    15 3 * * * /srv/homelab/scripts/backup.sh >> /var/log/homelab-backup.log 2>&1

The script's steps:

1. **Stop nothing.** Everything is hot-backed. Postgres dumps use
   `pg_dump`, not filesystem copies.
2. **Dump databases.** `docker compose exec immich-db pg_dump ...`
   writes to `/srv/backup/current/dumps/immich.sql`. Same for
   Paperless and Gitea.
3. **Rsync data directories.** `rsync -aH --delete --link-dest=...`
   into `/srv/backup/YYYY-MM-DD/`. The `--link-dest` hardlinks
   unchanged files against yesterday's snapshot, so a 30-day
   retention of 500 GB of data is ~550 GB of disk, not 15 TB.
4. **Rclone to B2.** `rclone copy /srv/backup/current/
   b2:homelab-backup/YYYY-MM-DD/` uploads the newest snapshot.
   rclone handles deduplication on the remote side.
5. **Prune locally and remote.** `7 daily + 12 monthly` rules
   enforced in both places via `rclone lsjson`-parsed age
   checks. Commit `77d1a46` switched this from filename-date
   parsing to lsjson which handles timezones correctly.

Log output is noisy but single-line per step so a `grep FAIL` is
enough in health checks.

## Retention policy

`3b7a2c9` settled the numbers after a couple of revisions:

- **Local**: 30 daily + 12 monthly. First-of-month snapshots are
  pinned (created with a different file name suffix that the prune
  step skips).
- **B2**: 30 daily + 12 monthly. Same policy, different machine.
- **Verification**: quarterly restore drill; see below.

That gives me:
- 30 days of granular recovery
- a year of monthly history for "when did this file disappear"
  questions
- total local footprint about 600 GB
- total B2 footprint about 350 GB (B2 deduplicates identical blocks
  so the numbers don't match; also I compress dumps before upload)

## Encryption

B2 upload goes through rclone's crypt backend. Two layers of bucket:

- `b2:homelab-backup-crypt` (visible; random filenames)
- rclone crypt remote on top (`crypt:homelab-backup`)

My backup script uses `crypt:homelab-backup`. The passphrase and
salt are in `rclone.conf` on the NUC, which is itself backed up to
a password manager. Losing the NUC and the password manager at the
same time means losing backups; I accept this.

Local snapshots are on-disk plaintext. The NUC disk is LUKS
encrypted at rest.

## Restore

### Single file

    cp /srv/backup/YYYY-MM-DD/relative/path ~

or from B2 directly:

    rclone copy crypt:homelab-backup/YYYY-MM-DD/relative/path ~/restore/

### Whole service

1. Stop the service: `docker compose stop <svc>`.
2. Replace its data directory from a snapshot:

        rsync -aH --delete /srv/backup/YYYY-MM-DD/data/<svc>/ \
          /srv/data/<svc>/

3. If the service has a Postgres DB, restore the dump:

        docker compose up -d <svc>-db
        cat /srv/backup/YYYY-MM-DD/dumps/<svc>.sql \
          | docker compose exec -T <svc>-db psql -U postgres <svc>

4. Start: `docker compose up -d <svc>`.

Test this against a throwaway service name at least once before
you need it in anger.

### Whole NUC

The worst case: NUC is a brick. Rebuild path:

1. Install Debian stable on new hardware.
2. Install docker + docker compose, enable cron, restore
   `/etc/cron.d/homelab`, `/etc/docker/daemon.json`.
3. Mount storage (new disks: restore `/srv/data` from B2; old disks:
   just mount them).
4. `git clone` this repo into `/srv/homelab`.
5. `cp .env.example .env` and fill in.
6. `docker compose up -d`.
7. For each service that doesn't auto-restore from volume:
   run the restore flow above.

I timed this once in a VM: about three hours from a blank image.

## Verifying restores actually work

Backups you haven't tested aren't backups. Quarterly drill:

1. Pick a date 2-8 weeks ago.
2. `rclone copy crypt:homelab-backup/YYYY-MM-DD/ /tmp/restore/` -
   confirm B2 is readable.
3. Spot-check 10 random files for byte equality against current
   state (assuming the file hasn't changed) or plausible shape
   (if it has).
4. Run `psql -f /tmp/restore/dumps/immich.sql` into a scratch
   Postgres container, check row counts.
5. Delete `/tmp/restore`.

Quarterly, not monthly, because B2 egress costs a few dollars per
drill. Worth it.

## What can go wrong

### The backup script fails silently

Every step is wrapped:

    if ! rsync ...; then
      echo "FAIL: rsync failed at $(date)"
      exit 1
    fi

Cron's MAILTO is configured to mail on non-zero exit. Health check
hourly also runs an "is the most recent snapshot less than 26
hours old" check and alerts if not.

### B2 credentials rotate

I rotate the app key yearly (calendar reminder). Update
`rclone.conf` and test with `rclone ls crypt:homelab-backup`. If
the credentials become stale, the script fails and the health
check catches it.

### btrfs RAID1 silently corrupts

btrfs scrub every two weeks detects bit rot. If a scrub fails,
we still have B2 and local snapshots - the corruption almost
certainly lives in the live data only, not in link-dest
hardlinks, because the hardlink target is the file at the time
of snapshot.

### rclone sync deletion

I use `copy`, not `sync`, for the B2 leg. A local wipeout does
not propagate to B2. Prune is explicit and date-ranged; it won't
wipe "current" even if local is empty.

## Cost

- B2 storage: ~$1.75/month for 350 GB
- B2 class-B downloads (restore drills + occasional pulls):
  effectively zero
- Local storage: bought once, amortised over years
- Time: backup script runs in ~12 minutes nightly while I'm
  asleep

At these prices, there's no argument for not having offsite.

## Non-cloud offsite?

I considered rotating a USB drive to a relative's house. Rejected:

- Manual. I forget things.
- No effective encryption story without more effort.
- B2 is cheap and works.

If I had something irreplaceable beyond what I already back up -
say, video recordings of my kids - I would add a second offsite
(Arq to another cloud provider) for belt-and-suspenders. For the
current data set, one offsite is enough.

## Summary

- Nightly local + offsite. Hot-backed. Dumps for Postgres.
- 30 daily + 12 monthly, enforced on both sides.
- Quarterly restore drill. Catch breakage before it matters.
- Encrypted upload. Plaintext local on LUKS.

This is the lowest-effort backup story I trust. Anything simpler
skipped testing; anything fancier added failure modes. It is the
Goldilocks shape after a few years of iteration.