The integrity of user data is of paramount importance to SourceHut. Most of our data-critical systems are triple-redundant or better.
All of our data-critical systems use ZFS with at least 3 drives, which allows up to one drive to fail without data loss. Our large storage systems use 5+ drives, which allows several drives to fail.
Our standard hardware loadout calls for hard drives (or SSDs) sourced from a variety of different vendors and drive models, to avoid using several hard drives from the same production batch. This reduces the risk of cascading failures during RAID recovery.
We do an automatic scrub of all ZFS pools on the 1st of each month and forward a report to the ops mailing list.
We have an off-site backup system in a separate datacenter (in a different city) from our primary datacenter. We use borg backup to send backups to this server, typically hourly. The standard backup script looks something like this, but is tweaked for each service:
#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'
backup_start="$(date -u +'%s')"
echo "borg create"
borg create \
::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \
/var/lib/git \
-e /var/lib/git/.ssh \
-e /var/lib/git/.gnupg \
-e /var/lib/git/.ash_history \
-e /var/lib/git/.viminfo \
-e '/var/lib/git/*/*/objects/incoming-*' \
-e '*.keep' \
--compression lz4 \
--one-file-system \
--info --stats "$@"
echo "borg prune"
borg prune \
--keep-hourly 48 \
--keep-daily 60 \
--keep-weekly -1 \
--info --stats
stats() {
backup_end="$(date -u +'%s')"
printf '# TYPE last_backup gauge\n'
printf '# HELP last_backup Unix timestamp of last backup\n'
printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
printf '# TYPE backup_duration gauge\n'
printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
}
stats | curl --data-binary @- https://push.metrics.srht.network/metrics/job/git.sr.ht
Our check
script is:
#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'
check() {
cat <<-EOF
To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht>
From: git.sr.ht backups <borg@git.sr.ht>
Subject: git.sr.ht backups report $(date)
EOF
borg check --last 2 --info 2>&1
}
check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht'
Each backup reports its timestamp and duration to our Prometheus Pushgateway (see monitoring). We have an alarm configured for when the backup age exceeds 48 hours. The age of all borg backups may be viewed here (in hours).
We also conduct a weekly borg check
(on Sunday night, UTC) and forward the
results to the ops mailing list.
commit aa91af4fa09eb84be3388f5d8ff4c5bb3059ae5e Author: Runxi Yu <me@runxiyu.org> Date: 2025-03-15T15:39:35+08:00 lists.sr.ht: HTML emails are rejected by most lists, not always Signed-off-by: Runxi Yu <me@runxiyu.org> References: https://git.sr.ht/~sircmpwn/lists.sr.ht/commit/d2470931a39c6816db9427abfd03b3b3093987e3