So, everything is on fire. What do you do?
Take a deep breath. Panicked sysadmins make mistakes. Just relax! An outage is just a bug that needs to be fixed sooner than most.
If you're feeling overwhelmed, stop and take a break. Brew a cup of coffee. Top up your water bottle.
We have three levels of urgency:
urgent is used, for example, when a service is down. important is used when there is an imminent (but not immediate) concern, such as when a disk is almost full. important incidents require a response within 24 hours, but that response may involve making a plan which takes more than 24 hours to fully resolve.
Alarms from metrics.srht.network have an urgency level associated with them which automatically prioritizes the incident accordingly.
The goal of an incident response is:
In the case of an urgent incident, step 3 can be treated as important priority following the restoration of service.
If the problem is in your domain, notify #sr.ht.ops that you'll be taking point and start investigating. The point person is the only one who can take actions which change the state of the system, such as:
So long as the deployment pipeline is still working (including the availability of an authorized person to merge into the affected repository), prefer to deploy new releases to solve problems rather than hotfixing.
The term is used lovingly, I promise. Onlookers can offer a support role to the point person, be they other SourceHut sysadmins or members of the SourceHut community. Their role is doing independent research and offering useful information and suggestions to the point person based on their findings. This includes:
The point person is in charge. They also do not have to listen or respond to your messages while they are busy dealing with the problem.
The point person may delegate any of their tasks to a member of the peanut gallery if they see fit.
A brief explanation of the problem and its solution is generally welcome on status.sr.ht, to keep users in the loop. Aim for honest and transparency, and don't shy away from technical details.
Generally, everyone who needs to be in the loop will be in the loop during an incident. But, if not, make sure that any stakeholders understand what happened, how it was resolved, and what mitigations were established to prevent it from happening again.
commit 66d4f9c91f0c5472f85e2bd590f7a9c999df4132 Author: Drew DeVault <drew@ddevault.org> Date: 2025-04-22T10:08:38+02:00 git tutorial: add note about init.defaultBranch