On several occasions, outages have been simulated and the motions carried out for resolving them. This is useful for:

  1. Testing that our systems can tolerate or recover from such failures
  2. Familiarizing operators with the resolution procedures

This has been conducted informally. We should put some more structure to it, and plan these events regularly.

Ideas:

  • Simulate disk failures (yank out a hard drive!)
  • Simulate outages for redundant services (see availability)
  • Kill celery workers and see how they cope with catching up again
  • Restore systems from backup, then put the restored system into normal service and tear down the original

About this wiki

commit 6778928f971ff2e3f4628b7a47c6f6b433c2b11d
Author: Shulhan <ms@kilabit.info>
Date:   2024-06-08T02:18:59+07:00

git.sr.ht: remove duplicate DELETE API for repos
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)