We monitor everything with Prometheus, and configure alarms with alertmanager.
Our Prometheus instance is publically available at metrics.srht.network.
A pushgateway is running at push.metrics.srht.network. It's firewalled to only accept connections from our own hosts.
We use alertmanager to forward alerts to various sinks.
Some security-related alarms are sent directly to Drew and are not made public.
Our alerts are configured here:
https://git.sr.ht/~sircmpwn/metrics.sr.ht
Alarm urgency levels correspond to the appropriate response times during an incident; configure them accordingly. Alarms should not be too noisy, ideally any alarm should always require attention to reduce the risk of alarm fatigue.
Generally we should aim to set up alarms to predict problems before they occur. How far in advance should be determined by the lead time on a solution. For example, the lead time on securing new hard drives is a few weeks, so "drive full" alarms are planned out based on the expected growth rate of the filesystem to occur a few weeks before they will be full (with a generous margin for error). chat.sr.ht has alarms for when we have a number of users for each IRC network exceeding the number of slots allocated to us by that network with sufficient advance notice to coordinate an increase to our allotment.
commit aa91af4fa09eb84be3388f5d8ff4c5bb3059ae5e Author: Runxi Yu <me@runxiyu.org> Date: 2025-03-15T15:39:35+08:00 lists.sr.ht: HTML emails are rejected by most lists, not always Signed-off-by: Runxi Yu <me@runxiyu.org> References: https://git.sr.ht/~sircmpwn/lists.sr.ht/commit/d2470931a39c6816db9427abfd03b3b3093987e3