systemd timers and the clock drift that ate our backups

I got paged on a Tuesday morning because our nightly backup hadn’t landed in S3 for nine days. The on-call rotation found nothing wrong on the surface: the backup job was green on the last run we had a record of, and the dashboard just had a gap. The machines were up, disks were fine, nothing else had changed.

The timer

The job was a systemd timer on a small utility VM. Roughly:

[Unit]
Description=Nightly borg backup

[Timer]
OnCalendar=*-*-* 02:30:00
RandomizedDelaySec=45m
Persistent=true
Unit=borg-backup.service

[Install]
WantedBy=timers.target

Nothing clever. The job sometimes ran at 02:35, sometimes at 03:10, which was fine. Persistent=true meant that even if the VM was off at 02:30, the job would catch up on next boot.

What I saw

First thing I looked at:

systemctl list-timers --all | grep borg
# LAST                        UNIT             ACTIVATES
# Sun 2024-04-14 02:58:12 UTC borg-backup.timer borg-backup.service

Last run was nine days earlier. Next run showed nothing. That’s strange, because the timer unit was still active (waiting):

systemctl status borg-backup.timer
# Active: active (waiting)
# Trigger: n/a

Trigger: n/a on a nominally healthy timer is a red flag. I checked the service:

systemctl status borg-backup.service
# Main PID: 0 (exited)
# Result: exit-code

The last run had failed. But Persistent=true should have queued it. And OnCalendar should have scheduled the next nightly run regardless. So why no trigger?

The clock

Just out of habit I checked the clock:

timedatectl
# Local time: Tue 2024-04-23 09:45:12 UTC
# Universal time: Tue 2024-04-23 09:45:12 UTC
# RTC time: Tue 2024-04-23 09:48:43
# ...
# System clock synchronized: yes
# NTP service: active

Three and a half minutes of drift on the RTC, but the system clock was fine and NTP was running. That normally doesn’t matter. But I noticed:

timedatectl timesync-status
# Leap: not in progress
# Version: 4
# Stratum: 16
# Reference: 00000000

Stratum 16 is the unsynchronized stratum. systemd-timesyncd thought it was running, but it was not actually getting time from anyone. journalctl -u systemd-timesyncd showed:

journalctl -u systemd-timesyncd --since '2 weeks ago' | tail -20
# Apr 14 02:58:12 utilbox systemd-timesyncd[412]: Timed out waiting for reply from time1.example.com
# Apr 14 02:59:14 utilbox systemd-timesyncd[412]: Network configuration changed, trying to establish connection.
# ... a lot of retries ...
# Apr 15 00:00:00 utilbox systemd-timesyncd[412]: Contacted time server time1.example.com (time1.example.com).

It had eventually reconnected. But here is the kicker. Between the last failed backup on April 14 and the recovery on April 15, the system time had jumped backwards by several hours because timesyncd finally managed to slam the clock to the correct value after running without a sync for a day. A backward time jump in the middle of the night resulted in the OnCalendar evaluator thinking the next 02:30 was now many hours in the future, and importantly RandomizedDelaySec computed against the new time.

Combined with a subtle issue: Persistent=true depends on the last activation timestamp stored in /var/lib/systemd/timers/stamp-borg-backup.timer. When the clock jumps backward, systemd sees that the stamp is “in the future” and becomes cautious. Reading through the source, it effectively does not reschedule until the stamp is in the past again. A forward-moving clock eventually fixes this, but our clock had jittered around in a way that kept the stamp ahead.

The fix

Short term:

rm /var/lib/systemd/timers/stamp-borg-backup.timer
systemctl daemon-reload
systemctl start borg-backup.service
systemctl list-timers borg-backup.timer
# NEXT                        LEFT
# Wed 2024-04-24 02:30:00 UTC 16h left

Long term, three changes:

Replaced systemd-timesyncd with chronyd on this and our other time-sensitive boxes. timesyncd is fine for laptops. For servers, chrony handles jitter, multiple servers, and long-running drift much better. Our config now points at two internal stratum-2 servers and our local Pi on the NTP network.
Added a check in our monitoring that alerts if chronyc tracking shows Leap Status other than Normal or if offset exceeds 500 ms.
Added an external “did we actually get the backup” check. A serverless function lists the S3 bucket once an hour and pages if nothing landed in the last 30. Trust no timer.

Reflection

The cute lesson is “monitor your backups from outside the thing doing the backups”. The technical lesson is more interesting to me. systemd timers are very good when your clock is well-behaved. When it is not, their behavior is subtle. Persistent=true is useful but not a substitute for actual clock discipline. If you are running VMs on a host that hibernates, or on a cloud provider that does strange things when the hypervisor migrates, or anywhere with flaky NTP upstream, chrony is worth the small effort of installation.

Related: see my post on journald eating my disk, another “systemd behaves exactly as documented but not as I expected” story.