We monitor everything with Prometheus, and configure alarms with alertmanager.

Public metrics

Our Prometheus instance is publically available at metrics.srht.network.

Areas for improvement
  1. We should make dashboards. It would be pretty to look at and could be a useful tool for root cause analysis. Note that some users who have their own Grafana instance have pointed it at our public Prometheus data and made some simple dashboards — I would be open to having community ownership over this.

Pushgateway

A pushgateway is running at push.metrics.srht.network. It's firewalled to only accept connections from our own hosts.

Alertmanager

We use alertmanager to forward alerts to various sinks.

  • interesting alerts are forwarded to the IRC channel, #sr.ht.ops
  • important alerts are sent the ops mailing list, and the IRC channel
  • urgent alerts page Drew's phone, are sent to the mailing list, and the IRC channel

Some security-related alarms are sent directly to Drew and are not made public.

Our alerts are configured here:

https://git.sr.ht/~sircmpwn/metrics.sr.ht

Configuring good alarms

Alarm urgency levels correspond to the appropriate response times during an incident; configure them accordingly. Alarms should not be too noisy, ideally any alarm should always require attention to reduce the risk of alarm fatigue.

Generally we should aim to set up alarms to predict problems before they occur. How far in advance should be determined by the lead time on a solution. For example, the lead time on securing new hard drives is a few weeks, so "drive full" alarms are planned out based on the expected growth rate of the filesystem to occur a few weeks before they will be full (with a generous margin for error). chat.sr.ht has alarms for when we have a number of users for each IRC network exceeding the number of slots allocated to us by that network with sufficient advance notice to coordinate an increase to our allotment.

Areas for improvement

  1. Would be nice to have centralized logging. There is sensitive information in some of our logs, so this probably can't be made public.

About this wiki

commit aa91af4fa09eb84be3388f5d8ff4c5bb3059ae5e
Author: Runxi Yu <me@runxiyu.org>
Date:   2025-03-15T15:39:35+08:00

lists.sr.ht: HTML emails are rejected by most lists, not always

Signed-off-by: Runxi Yu <me@runxiyu.org>
References: https://git.sr.ht/~sircmpwn/lists.sr.ht/commit/d2470931a39c6816db9427abfd03b3b3093987e3
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)