The integrity of user data is of paramount importance to SourceHut. Most of our data-critical systems are triple-redundant or better.

#Local redundancy

All of our data-critical systems use ZFS with at least 3 drives, which allows up to one drive to fail without data loss. Our large storage systems use 5+ drives, which allows several drives to fail.

Our standard hardware loadout calls for hard drives (or SSDs) sourced from a variety of different vendors and drive models, to avoid using several hard drives from the same production batch. This reduces the risk of cascading failures during RAID recovery.

#Monitoring

We do an automatic scrub of all ZFS pools on the 1st of each month and forward a report to the ops mailing list.

#Areas for improvement
  1. Automatic ZFS snapshots are only configured for off-site backup hosts. We should configure this on the primary as well. We also need monitoring to ensure that our snapshots are actually being taken.
  2. Investigate something like repospanner to block git pushes until the data is known to be received and stored across multiple servers — would make git backups real-time

#Off-site backups

We have an off-site backup system in a separate datacenter (in a different city) from our primary datacenter. We use borg backup to send backups to this server, typically hourly. The standard backup script looks something like this, but is tweaked for each service:

#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'

backup_start="$(date -u +'%s')"

echo "borg create"
borg create \
	::git.sr.ht-repos-"$(date +"%Y-%m-%d_%H:%M")" \
	/var/lib/git \
	-e /var/lib/git/.ssh \
	-e /var/lib/git/.gnupg \
	-e /var/lib/git/.ash_history \
	-e /var/lib/git/.viminfo \
	-e '/var/lib/git/*/*/objects/incoming-*' \
	-e '*.keep' \
	--compression lz4 \
	--one-file-system \
	--info --stats "$@"

echo "borg prune"
borg prune \
	--keep-hourly 48 \
	--keep-daily 60 \
	--keep-weekly -1 \
	--info --stats

stats() {
	backup_end="$(date -u +'%s')"
	printf '# TYPE last_backup gauge\n'
	printf '# HELP last_backup Unix timestamp of last backup\n'
	printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
	printf '# TYPE backup_duration gauge\n'
	printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
	printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
}

stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/git.sr.ht

Our check script is:

#!/bin/sh -eu
export BORG_REPO='ssh://gitsrht@konpaku.sr.ht/~/backup'
export BORG_PASSPHRASE='redacted'

check() {
	cat <<-EOF
	To: SourceHut Ops <~sircmpwn/sr.ht-ops@lists.sr.ht>
	From: git.sr.ht backups <borg@git.sr.ht>
	Subject: git.sr.ht backups report $(date)

	EOF
	borg check --last 2 --info 2>&1
}

check | sendmail '~sircmpwn/sr.ht-ops@lists.sr.ht'
#Monitoring

Each backup reports its timestamp and duration to our Prometheus Pushgateway (see monitoring). We have an alarm configured for when the backup age exceeds 48 hours. The age of all borg backups may be viewed here (in hours).

We also conduct a weekly borg check (on Sunday night, UTC) and forward the results to the ops mailing list.

#Areas for improvement
  1. Our PostgreSQL replication strategy is somewhat poor, due to several different approaches being experimented with on the same server, and lack of monitoring. This needs to be rethought. Related to high availability.
  2. It would be nice if we could find a way to encapsulate our borg scripts in an installable Alpine package.

About this wiki

commit f9f2b235c7bc0abf0a0be6f836a15b4a631522d9
Author: Drew DeVault <sir@cmpwn.com>
Date:   2021-06-22T16:02:58-04:00

packages.md: help wanted for Debian
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)