The integrity of user data is of paramount importance to SourceHut. Most of our data-critical systems are triple-redundant or better.

Local redundancy

All of our data-critical systems use ZFS with at least 3 drives, which allows up to one drive to fail without data loss. Our large storage systems use 5+ drives, which allows several drives to fail.

Our standard hardware loadout calls for hard drives (or SSDs) sourced from a variety of different vendors and drive models, to avoid using several hard drives from the same production batch. This reduces the risk of cascading failures during RAID recovery.

Monitoring

We do an automatic scrub of all ZFS pools on the 1st of each month and forward a report to the ops mailing list.

Areas for improvement
  1. Automatic ZFS snapshots are only configured for off-site backup hosts. We should configure this on the primary as well. We also need monitoring to ensure that our snapshots are actually being taken.
  2. Investigate something like repospanner to block git pushes until the data is known to be received and stored across multiple servers — would make git backups real-time

Off-site backups

We have an off-site backup system in a separate datacenter (in a different city) from our primary datacenter. We use borg backup to send backups to this server, typically hourly. The standard backup script looks something like this, but is tweaked for each service:

#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'

backup_start="$(date -u +'%s')"

echo "borg create"
borg create \
	::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \
	/var/lib/git \
	-e /var/lib/git/.ssh \
	-e /var/lib/git/.gnupg \
	-e /var/lib/git/.ash_history \
	-e /var/lib/git/.viminfo \
	-e '/var/lib/git/*/*/objects/incoming-*' \
	-e '*.keep' \
	--compression lz4 \
	--one-file-system \
	--info --stats "$@"

echo "borg prune"
borg prune \
	--keep-hourly 48 \
	--keep-daily 60 \
	--keep-weekly -1 \
	--info --stats

stats() {
	backup_end="$(date -u +'%s')"
	printf '# TYPE last_backup gauge\n'
	printf '# HELP last_backup Unix timestamp of last backup\n'
	printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
	printf '# TYPE backup_duration gauge\n'
	printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
	printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
}

stats | curl --data-binary @- https://push.metrics.srht.network/metrics/job/git.sr.ht

Our check script is:

#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'

check() {
	cat <<-EOF
	To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht>
	From: git.sr.ht backups <borg@git.sr.ht>
	Subject: git.sr.ht backups report $(date)

	EOF
	borg check --last 2 --info 2>&1
}

check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht'
Monitoring

Each backup reports its timestamp and duration to our Prometheus Pushgateway (see monitoring). We have an alarm configured for when the backup age exceeds 48 hours. The age of all borg backups may be viewed here (in hours).

We also conduct a weekly borg check (on Sunday night, UTC) and forward the results to the ops mailing list.

Areas for improvement
  1. Our PostgreSQL replication strategy is somewhat poor, due to several different approaches being experimented with on the same server, and lack of monitoring. This needs to be rethought. Related to high availability.
  2. It would be nice if we could find a way to encapsulate our borg scripts in an installable Alpine package.

About this wiki

commit aa91af4fa09eb84be3388f5d8ff4c5bb3059ae5e
Author: Runxi Yu <me@runxiyu.org>
Date:   2025-03-15T15:39:35+08:00

lists.sr.ht: HTML emails are rejected by most lists, not always

Signed-off-by: Runxi Yu <me@runxiyu.org>
References: https://git.sr.ht/~sircmpwn/lists.sr.ht/commit/d2470931a39c6816db9427abfd03b3b3093987e3
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)