So, everything is on fire. What do you do?

Don't panic

Take a deep breath. Panicked sysadmins make mistakes. Just relax! An outage is just a bug that needs to be fixed sooner than most.

If you're feeling overwhelmed, stop and take a break. Brew a cup of coffee. Top up your water bottle.

Urgency levels

We have three levels of urgency:

  • urgent: requires an immediate response
  • important: requires a response within 24 hours
  • interesting: does not require a timely response

urgent is used, for example, when a service is down. important is used when there is an imminent (but not immediate) concern, such as when a disk is almost full. important incidents require a response within 24 hours, but that response may involve making a plan which takes more than 24 hours to fully resolve.

Alarms from metrics.sr.ht have an urgency level associated with them which automatically prioritizes the incident accordingly.

Incident response process

The goal of an incident response is:

  1. Understand the problem
  2. Solve the problem
  3. Prevent it from happening again

In the case of an urgent incident, step 3 can be treated as important priority following the restoration of service.

Point admin

If the problem is in your domain, notify #sr.ht.ops that you'll be taking point and start investigating. The point person is the only one who can take actions which change the state of the system, such as:

  • Updating status.sr.ht
  • Restarting services
  • SQL mutations
  • Editing config files
  • etc

So long as the deployment pipeline is still working (including the availability of an authorized person to merge into the affected repository), prefer to deploy new releases to solve problems rather than hotfixing.

The term is used lovingly, I promise. Onlookers can offer a support role to the point person, be they other SourceHut sysadmins or members of the SourceHut community. Their role is doing independent research and offering useful information and suggestions to the point person based on their findings. This includes:

  • Reading service logs
  • Reading relevant source code
  • Reading documentation
  • Executing read-only SQL queries
  • Acting as a sounding board for the point person

The point person is in charge. They also do not have to listen or respond to your messages while they are busy dealing with the problem.

The point person may delegate any of their tasks to a member of the peanut gallery if they see fit.

After the incident

A brief explanation of the problem and its solution is generally welcome on status.sr.ht, to keep users in the loop. Aim for honest and transparency, and don't shy away from technical details.

Generally, everyone who needs to be in the loop will be in the loop during an incident. But, if not, make sure that any stakeholders understand what happened, how it was resolved, and what mitigations were established to prevent it from happening again.

About this wiki

commit 6778928f971ff2e3f4628b7a47c6f6b433c2b11d
Author: Shulhan <ms@kilabit.info>
Date:   2024-06-08T02:18:59+07:00

git.sr.ht: remove duplicate DELETE API for repos
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)