On several occasions, outages have been simulated and the motions carried out for resolving them. This is useful for:

  1. Testing that our systems can tolerate or recover from such failures
  2. Familiarizing operators with the resolution procedures

This has been conducted informally. We should put some more structure to it, and plan these events regularly.

Ideas:

  • Simulate disk failures (yank out a hard drive!)
  • Simulate outages for redundant services (see availability)
  • Kill celery workers and see how they cope with catching up again
  • Restore systems from backup, then put the restored system into normal service and tear down the original

About this wiki

commit 66d4f9c91f0c5472f85e2bd590f7a9c999df4132
Author: Drew DeVault <drew@ddevault.org>
Date:   2025-04-22T10:08:38+02:00

git tutorial: add note about init.defaultBranch
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)