On January 11th at approximately 13:38 UTC our main database server, remilia.sr.ht, suffered a hardware failure and became unreachable. The lack of database connectivity caused outages across all services.
Around 13:40 UTC on January 11th multiple alerts started firing for most SourceHut services being unavailable. In addition, one alert was for remilia.sr.ht, a physical host and our main database server, also being unavailable. Drew determined that the database server itself was completely unreachable remotely.
At this point it became clear that we needed someone on-site. Our options for this are the DC operator staff ("remote hands"), and our own emergency contact in that area, Calvin. Drew called the DC at 8:45 AM local time. The official daytime availability for the datacenter begins at 9 AM, and no one picked up the phone. We have the ability to page them outside of business hours, but it seemed unlikely that this would decrease the response time. Instead, Drew dispatched Calvin, who was immediately responsive but was located an hour's travel time from the DC. Calvin started making his way to the datacenter.
While waiting, we explored various scenarios and our options to deal with them. The plan was to attempt to diagnose the issue with the database server, and, if it was unrecoverable, move the hard drives to another host and bring it up as a clone of the database server. Our choice of secondary server was cirno1.sr.ht, one of our build workers, because it is redundant with cirno2.sr.ht and pulling it out of service would not affect service availability.
Once the DC staff was on-site we had them power-cycle the server. It did not come back up and was non-responsive. However, this did suggest that the disks were not at fault. Shortly thereafter, Calvin arrived on-site. After some more diagnostic work, he proceeded to swap disks with cirno1.sr.ht. This brought back remilia in its old state, running in cirno1's chassis.
Once the new remilia was up and running, all services started to recover. The absence of cirno1 leaves builds.sr.ht with reduced build capacity, but enabled us to quickly restore general service.
See the timeline below for more details.
We'd like to especially thank Calvin for his work on-site.
All times in UTC.
2023-01-11 13:38: Prometheus starts failing to scrape metrics from remilia
2023-01-11 13:40: Alerts start to fire for a wide range of services.
2023-01-11 13:42: Finding no means to reach the server remotely, Drew starts making phone calls.
2023-01-11 13:44: Postfix is shut down to avoid the mail delivery pipeline spilling over and alertmanager is disabled to silence the myriad of alarms.
2023-01-11 13:52: Being unable to reach anyone in the DC, Drew activates emergency on-call contact in DC area. Calvin starts heading to the datacenter.
2023-01-11 15:03: Drew reaches DC ops, determines that the hardware (sans the disks) is likely a lost cause.
2023-01-11 15:39: On-call contact arrives in DC. After confirming DC ops diagnosis, Calvin swaps the disks with cirno1.sr.ht and brings it online.
2023-01-11 16:04: Prometheus successfully scrapes metrics from remilia (running in cirno1's chassis).
2023-01-11 16:06: Affected services begin to come back up.
2023-01-11 16:09: All services back online.
commit 93f98ec8f78010de4f1ef9edc6b89c11afce9216 Author: Conrad Hoffmann <ch@bitfehler.net> Date: 2024-09-24T17:03:25+02:00 ops: remove stale links to topology page