A disk in my homelab had been showing “Passed” on SMART short and long tests for months. I swap out drives on a rotation because I like predictability. But the trigger for replacing this particular drive was not SMART; it was a btrfs scrub that started showing checksum errors on files I had not touched in a year.

Setup

My NAS runs btrfs RAID-1 across four 4 TB drives. scrub is scheduled weekly via systemd timer:

# /etc/systemd/system/btrfs-scrub.timer
[Unit]
Description=Weekly btrfs scrub

[Timer]
OnCalendar=Sun 03:00
Persistent=true
Unit=btrfs-scrub.service

[Install]
WantedBy=timers.target

The corresponding service runs btrfs scrub start -B /mnt/tank so it blocks until done and logs to the journal.

The errors

One Sunday the scrub took much longer than usual and ended with non-zero output:

btrfs scrub status /mnt/tank
# UUID:             86aabc...
# Scrub started:    Sun Sep 22 03:00:01 2024
# Status:           finished
# Duration:         6:42:11
# Total to scrub:   3.31TiB
# Rate:             144.24MiB/s
# Error summary:    csum=312
# Corrected:        312
# Uncorrectable:    0
# Unverified:       0

312 checksum errors, all corrected from the RAID-1 mirror. No uncorrectable. Good news, btrfs had done its job. But 312 is not zero and it is not a stable number. I ran another scrub a few hours later; the same drive produced 318 new errors. The number was growing.

Drill down

btrfs device stats gives per-device counters:

btrfs device stats /mnt/tank
# [/dev/sdd1].write_io_errs   0
# [/dev/sdd1].read_io_errs    0
# [/dev/sdd1].flush_io_errs   0
# [/dev/sdd1].corruption_errs 312
# [/dev/sdd1].generation_errs 0

All 312 corruption errors on /dev/sdd1. SMART on that disk:

smartctl -a /dev/sdd | grep -iE 'reallocated|pending|uncorrectable|health'
# SMART overall-health self-assessment test result: PASSED
# 5  Reallocated_Sector_Ct  0x0033 100 100 010 Pre-fail Always - 0
# 197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
# 198 Offline_Uncorrectable  0x0030 100 100 000 Old_age Always - 0

Zero reallocated sectors. Zero pending sectors. Zero uncorrectable. SMART had no idea anything was wrong. The drive was returning data without errors but the data did not match the checksums btrfs had stored when the data was written.

What was actually happening

Silent bit rot. Possibly one of:

  • The drive’s internal ECC had a bug and was returning incorrect data without flagging it.
  • The SATA cable had a marginal connection and occasional bits were flipping. The onboard controller’s own checksums should catch this, but the drive’s firmware might not be flagging them upward.
  • A firmware bug on the drive.

For my purposes the diagnosis did not need to be more precise than “this drive is returning bad data”. btrfs knew because it checksums every data block and compared stored checksum to read value. SMART did not know because SMART was designed to detect mechanical and electronic failures that present as read errors, not silent corruption.

Replacement

I have learned to just replace a drive that starts growing corruption errors. I ordered a new one and did the swap:

btrfs replace start /dev/sdd1 /dev/sde1 /mnt/tank
btrfs replace status /mnt/tank
# 5.2% done, 14:23 left

btrfs replace is online and incremental; the filesystem stays mounted. Took about 8 hours for this drive.

Once done I ran another scrub across the whole array. Zero errors.

What I now do

  1. Scrub weekly, alert on non-zero. I have a tiny shell script that parses btrfs scrub status and sends a message if csum is nonzero. systemd timer on the NAS calls it.

  2. Replace on growth, not on threshold. A single corruption error might be a cosmic ray. A growing count is the drive.

  3. Track device stats in prometheus. A smartctl_exporter plus a btrfs exporter gives a dashboard. The corruption_errs counter is the one that saved me.

  4. Don’t trust SMART alone. SMART is necessary but not sufficient. RAID with checksums is sufficient. If you are running a filesystem that doesn’t checksum (ext4, xfs without ZFS-like features), you are taking the drive’s word for what it returns. That word is sometimes a lie.

A defense of btrfs

I know btrfs has a mixed reputation, especially for RAID-5/6. For RAID-1 on a homelab NAS it has been rock solid for me. I have had two silent-corruption events over four years across half a dozen drives; both were caught by scrub, both were resolved by replace, neither lost data. ZFS would have been fine too. ext4 or xfs would have silently served corrupt files.

Reflection

The feature that matters most in a filesystem for data you actually care about is end-to-end checksumming. Redundancy without checksumming is “which mirror is right?”. Checksumming without redundancy tells you “you lost this block”. Both together give you silent-repair, which is the dream.

Related: see my post on tuning ZFS ARC on my Proxmox box for the other end of the same spectrum.