# Backup Strategy
Three rules: three copies, two kinds of media, one offsite. For a
single-NUC homelab the specific mapping is:
1. Primary storage on btrfs RAID1 (live data)
2. Local snapshot backups on NVMe via rsync --link-dest
3. Offsite copy on Backblaze B2 via rclone
This doc describes what gets backed up, how, how long it's kept,
and how I verify restores actually work.
Runbook for restore specifics in
[docs/runbook.md](/src/homelab-compose/docs-runbook-md/). Overall
architecture in
[docs/architecture.md](/src/homelab-compose/docs-architecture-md/).
## What gets backed up
Three categories:
### Irreplaceable
- Paperless document archive (`/srv/docs/archive`)
- Immich originals (`/srv/photos/upload`)
- Gitea repositories and LFS (`/srv/data/gitea/git`, `/srv/data/gitea/lfs`)
- Syncthing shared folder (`/srv/sync`)
- Postgres dumps for Immich, Paperless, Gitea
### Reproducible but annoying
- Jellyfin library database (`/srv/data/jellyfin/library.db`) - I
could rebuild by re-scanning, but that takes hours and loses
watched state.
- Pi-hole blocklist and custom entries.
- Grafana dashboards.
### Don't bother
- Media library (Jellyfin catalog) itself. I can re-rip from discs
or re-download. Backing up 4 TB of video to B2 at ~$5/TB/month
is not worth it.
- Docker container filesystems. Images are pulled from the
registry; volumes map to `/srv/data`.
The distinction matters because everything in "irreplaceable" gets
backed up nightly with aggressive retention; "reproducible" gets a
looser schedule; "don't bother" doesn't get touched.
## Nightly flow
Cron triggers
[`scripts/backup.sh`](/src/homelab-compose/scripts-backup-sh/) at
03:15.
15 3 * * * /srv/homelab/scripts/backup.sh >> /var/log/homelab-backup.log 2>&1
The script's steps:
1. **Stop nothing.** Everything is hot-backed. Postgres dumps use
`pg_dump`, not filesystem copies.
2. **Dump databases.** `docker compose exec immich-db pg_dump ...`
writes to `/srv/backup/current/dumps/immich.sql`. Same for
Paperless and Gitea.
3. **Rsync data directories.** `rsync -aH --delete --link-dest=...`
into `/srv/backup/YYYY-MM-DD/`. The `--link-dest` hardlinks
unchanged files against yesterday's snapshot, so a 30-day
retention of 500 GB of data is ~550 GB of disk, not 15 TB.
4. **Rclone to B2.** `rclone copy /srv/backup/current/
b2:homelab-backup/YYYY-MM-DD/` uploads the newest snapshot.
rclone handles deduplication on the remote side.
5. **Prune locally and remote.** `7 daily + 12 monthly` rules
enforced in both places via `rclone lsjson`-parsed age
checks. Commit `77d1a46` switched this from filename-date
parsing to lsjson which handles timezones correctly.
Log output is noisy but single-line per step so a `grep FAIL` is
enough in health checks.
## Retention policy
`3b7a2c9` settled the numbers after a couple of revisions:
- **Local**: 30 daily + 12 monthly. First-of-month snapshots are
pinned (created with a different file name suffix that the prune
step skips).
- **B2**: 30 daily + 12 monthly. Same policy, different machine.
- **Verification**: quarterly restore drill; see below.
That gives me:
- 30 days of granular recovery
- a year of monthly history for "when did this file disappear"
questions
- total local footprint about 600 GB
- total B2 footprint about 350 GB (B2 deduplicates identical blocks
so the numbers don't match; also I compress dumps before upload)
## Encryption
B2 upload goes through rclone's crypt backend. Two layers of bucket:
- `b2:homelab-backup-crypt` (visible; random filenames)
- rclone crypt remote on top (`crypt:homelab-backup`)
My backup script uses `crypt:homelab-backup`. The passphrase and
salt are in `rclone.conf` on the NUC, which is itself backed up to
a password manager. Losing the NUC and the password manager at the
same time means losing backups; I accept this.
Local snapshots are on-disk plaintext. The NUC disk is LUKS
encrypted at rest.
## Restore
### Single file
cp /srv/backup/YYYY-MM-DD/relative/path ~
or from B2 directly:
rclone copy crypt:homelab-backup/YYYY-MM-DD/relative/path ~/restore/
### Whole service
1. Stop the service: `docker compose stop <svc>`.
2. Replace its data directory from a snapshot:
rsync -aH --delete /srv/backup/YYYY-MM-DD/data/<svc>/ \
/srv/data/<svc>/
3. If the service has a Postgres DB, restore the dump:
docker compose up -d <svc>-db
cat /srv/backup/YYYY-MM-DD/dumps/<svc>.sql \
| docker compose exec -T <svc>-db psql -U postgres <svc>
4. Start: `docker compose up -d <svc>`.
Test this against a throwaway service name at least once before
you need it in anger.
### Whole NUC
The worst case: NUC is a brick. Rebuild path:
1. Install Debian stable on new hardware.
2. Install docker + docker compose, enable cron, restore
`/etc/cron.d/homelab`, `/etc/docker/daemon.json`.
3. Mount storage (new disks: restore `/srv/data` from B2; old disks:
just mount them).
4. `git clone` this repo into `/srv/homelab`.
5. `cp .env.example .env` and fill in.
6. `docker compose up -d`.
7. For each service that doesn't auto-restore from volume:
run the restore flow above.
I timed this once in a VM: about three hours from a blank image.
## Verifying restores actually work
Backups you haven't tested aren't backups. Quarterly drill:
1. Pick a date 2-8 weeks ago.
2. `rclone copy crypt:homelab-backup/YYYY-MM-DD/ /tmp/restore/` -
confirm B2 is readable.
3. Spot-check 10 random files for byte equality against current
state (assuming the file hasn't changed) or plausible shape
(if it has).
4. Run `psql -f /tmp/restore/dumps/immich.sql` into a scratch
Postgres container, check row counts.
5. Delete `/tmp/restore`.
Quarterly, not monthly, because B2 egress costs a few dollars per
drill. Worth it.
## What can go wrong
### The backup script fails silently
Every step is wrapped:
if ! rsync ...; then
echo "FAIL: rsync failed at $(date)"
exit 1
fi
Cron's MAILTO is configured to mail on non-zero exit. Health check
hourly also runs an "is the most recent snapshot less than 26
hours old" check and alerts if not.
### B2 credentials rotate
I rotate the app key yearly (calendar reminder). Update
`rclone.conf` and test with `rclone ls crypt:homelab-backup`. If
the credentials become stale, the script fails and the health
check catches it.
### btrfs RAID1 silently corrupts
btrfs scrub every two weeks detects bit rot. If a scrub fails,
we still have B2 and local snapshots - the corruption almost
certainly lives in the live data only, not in link-dest
hardlinks, because the hardlink target is the file at the time
of snapshot.
### rclone sync deletion
I use `copy`, not `sync`, for the B2 leg. A local wipeout does
not propagate to B2. Prune is explicit and date-ranged; it won't
wipe "current" even if local is empty.
## Cost
- B2 storage: ~$1.75/month for 350 GB
- B2 class-B downloads (restore drills + occasional pulls):
effectively zero
- Local storage: bought once, amortised over years
- Time: backup script runs in ~12 minutes nightly while I'm
asleep
At these prices, there's no argument for not having offsite.
## Non-cloud offsite?
I considered rotating a USB drive to a relative's house. Rejected:
- Manual. I forget things.
- No effective encryption story without more effort.
- B2 is cheap and works.
If I had something irreplaceable beyond what I already back up -
say, video recordings of my kids - I would add a second offsite
(Arq to another cloud provider) for belt-and-suspenders. For the
current data set, one offsite is enough.
## Summary
- Nightly local + offsite. Hot-backed. Dumps for Postgres.
- 30 daily + 12 monthly, enforced on both sides.
- Quarterly restore drill. Catch breakage before it matters.
- Encrypted upload. Plaintext local on LUKS.
This is the lowest-effort backup story I trust. Anything simpler
skipped testing; anything fancier added failure modes. It is the
Goldilocks shape after a few years of iteration.