We monitor everything with Prometheus, and configure alarms with alertmanager.

Public metrics

Our Prometheus instance is publically available at metrics.sr.ht.

Areas for improvement
  1. We should make dashboards. It would be pretty to look at and could be a useful tool for root cause analysis. Note that some users who have their own Grafana instance have pointed it at our public Prometheus data and made some simple dashboards — I would be open to having community ownership over this.

Pushgateway

A pushgateway is running at push.metrics.sr.ht. It's firewalled to only accept connections from our subnet.

Aggregation gateway

prom-aggregation-gateway is running at aggr.metrics.sr.ht. It's firewalled to only accept connections from our subnet.

Alertmanager

We use alertmanager to forward alerts to various sinks.

  • interesting alerts are forwarded to the IRC channel, #sr.ht.ops
  • important alerts are sent the ops mailing list, and the IRC channel
  • urgent alerts page Drew's phone, are sent to the mailing list, and the IRC channel

Some security-related alarms are sent directly to Drew and are not made public.

Our alerts are configured here:

https://git.sr.ht/~sircmpwn/metrics.sr.ht

Configuring good alarms

Alarm urgency levels correspond to the appropriate response times during an incident; configure them accordingly. Alarms should not be too noisy, ideally any alarm should always require attention to reduce the risk of alarm fatigue.

Generally we should aim to set up alarms to predict problems before they occur. How far in advance should be determined by the lead time on a solution. For example, the lead time on securing new hard drives is a few weeks, so "drive full" alarms are planned out based on the expected growth rate of the filesystem to occur a few weeks before they will be full (with a generous margin for error). chat.sr.ht has alarms for when we have a number of users for each IRC network exceeding the number of slots allocated to us by that network with sufficient advance notice to coordinate an increase to our allotment.

Areas for improvement

  1. Would be nice to have centralized logging. There is sensitive information in some of our logs, so this probably can't be made public.

About this wiki

commit 6778928f971ff2e3f4628b7a47c6f6b433c2b11d
Author: Shulhan <ms@kilabit.info>
Date:   2024-06-08T02:18:59+07:00

git.sr.ht: remove duplicate DELETE API for repos
Clone this wiki
https://git.sr.ht/~sircmpwn/sr.ht-docs (read-only)
git@git.sr.ht:~sircmpwn/sr.ht-docs (read/write)