High-availability has not been a priority for SourceHut during early alpha development, but is becoming more important heading into the beta. This page is more about our plans than it is about our implementation.

The priorities are, in order:

  1. Highly available web services
  2. Highly available database
  3. Highly available mail system

Web services

The web services are already mostly designed to avoid keeping local state around, with this eventual goal in mind. Should investigate load balancing with haproxy(?) so we can bring nodes into and out of service without downtime. Should also make this the norm for deployments.

Special considerations for deployments
  • SQL migrations should be designed so that both the old and new systems work correctly on both the old and new schemas. This will often require splitting migrations over several releases.
Special considerations for git.sr.ht, hg.sr.ht

We need to use something like repospanner to distribute git pushes among several nodes.

Can we do something similar for Mercurial?

Related to backups.

Special considerations for builds.sr.ht

The builds.sr.ht worker needs to be updated so that we can reboot it without terminating anyone's jobs. One idea would be to move the job supervisor into a separate process. An issue with this would be the new work scheduler adopting job processes after a restart, and avoiding taking on new work from Celery until the resources are freed up.

Possible workaround is not accepting new jobs, letting the jobs drain while other build hosts pick up the slack, then rebooting and accepting new jobs once more.

Database

????

pgbouncer will probably be of some use. I suspect that we will find it difficult to reach zero-downtime failovers. Ideally, we would be able to do PostgreSQL major version upgrades with minimal downtime.

Care will need to be taken to avoid silently dropping writes.

We need to set up an experimental test network for testing out these ideas, and make a plan.

Highly available mail system

This should be fairly trivial. We need to move the work distribution Redis server from the mail host to the lists host (duh), and then just set up multiple MX records. Zero-downtime migrations can be accomplished by removing an MX record, letting the mail flush, and then doing whatever maintenance is necessary.

IRC bouncer

Because the IRC bouncer hosted on chat.sr.ht cannot be restarted without killing all connections (both upstream and downstream), the restarts should be performed as seldom as possible. During the public beta, our goal is to not reboot more than once per week, except in cases of security vulnerabilities and other urgent issues. The reboot is announced 1h in advance via an IRC bouncer-wide broadcast (/msg BouncerServ server notice <message>).

Before upgrading soju on chat.sr.ht, the new version should be tested for a while on a smaller instance to catch regressions.

After the beta, our goal is to stick to quarterly upgrades.

About this wiki

commit 6778928f971ff2e3f4628b7a47c6f6b433c2b11d
Author: Shulhan <ms@kilabit.info>
Date:   2024-06-08T02:18:59+07:00

git.sr.ht: remove duplicate DELETE API for repos
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)