blumeops/docs/how-to/runbooks/first-alert-and-runbook.md

---
title: First Alert and Runbook
modified: 2026-03-22
tags:
  - how-to
  - alerting
---

# First Alert and Runbook

Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.

## What to Do

### 1. Choose the First Alert

The best candidate is a **blackbox probe failure** because:
- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals
- The metric `probe_success` is already in Prometheus
- It maps directly to what services-check does (HTTP health checks)
- A single alert rule with a `service` label can cover all probed services

### 2. Create the Alert Rule

Provision via YAML in the alerting provisioning ConfigMap. The rule should:
- Query `probe_success == 0` from Prometheus
- Fire after the condition persists for 2 minutes (avoid flapping)
- Include labels: `severity: warning`, `service: {{ $labels.instance }}`
- Include annotations: `summary`, `runbook_url` pointing to the runbook doc

### 3. Create the Runbook

Write `docs/how-to/runbooks/runbook-service-probe-failure.md` as a how-to doc explaining:
- What the alert means
- How to check which service is down
- Common causes and resolution steps
- How to silence the alert if the downtime is planned

### 4. Verify End-to-End

- Stop one of the probed services (e.g., scale miniflux to 0)
- Wait for the alert to fire (~2 minutes)
- Confirm ntfy notification arrives with correct summary and runbook link
- Click the runbook link and verify it reaches docs.eblu.me
- Scale the service back up
- Confirm "resolved" notification arrives
- Confirm no repeat notification during the 24h window

## Key Details

- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies
- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
- The runbook URL format: `https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure`

## Verification

- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules
- [ ] Simulated failure triggers ntfy notification within ~3 minutes
- [ ] Notification includes service name, summary, and clickable runbook link
- [ ] Resolution triggers a "resolved" notification
- [ ] No repeat notification within 24h window

## Related

- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working
- [[deploy-infra-alerting]] — Parent goal
- [[port-services-check-alerts]] — Next: port remaining checks
- [[runbook-service-probe-failure]] — The runbook created for this alert
C2: Deploy infrastructure alerting pipeline (#303) ## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. Design: - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes Chain (bottom-up): 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/303 2026-03-22 14:52:56 -07:00			`---`
			`title: First Alert and Runbook`
			`modified: 2026-03-22`
			`tags:`
			`- how-to`
			`- alerting`
			`---`

			`# First Alert and Runbook`

			`Create one end-to-end alert as proof of concept — an alert rule that fires, delivers a notification to ntfy with a runbook link, and has a corresponding runbook doc.`

			`## What to Do`

			`### 1. Choose the First Alert`

			`The best candidate is a blackbox probe failure because:`
			`- Alloy's blackbox exporter already probes 5 services (miniflux, kiwix, transmission, devpi, argocd) at 30s intervals`
			- The metric `probe_success` is already in Prometheus
			`- It maps directly to what services-check does (HTTP health checks)`
			- A single alert rule with a `service` label can cover all probed services

			`### 2. Create the Alert Rule`

			`Provision via YAML in the alerting provisioning ConfigMap. The rule should:`
			- Query `probe_success == 0` from Prometheus
			`- Fire after the condition persists for 2 minutes (avoid flapping)`
			- Include labels: `severity: warning`, `service: {{ $labels.instance }}`
			- Include annotations: `summary`, `runbook_url` pointing to the runbook doc

			`### 3. Create the Runbook`

			Write `docs/how-to/runbooks/runbook-service-probe-failure.md` as a how-to doc explaining:
			`- What the alert means`
			`- How to check which service is down`
			`- Common causes and resolution steps`
			`- How to silence the alert if the downtime is planned`

			`### 4. Verify End-to-End`

			`- Stop one of the probed services (e.g., scale miniflux to 0)`
			`- Wait for the alert to fire (~2 minutes)`
			`- Confirm ntfy notification arrives with correct summary and runbook link`
			`- Click the runbook link and verify it reaches docs.eblu.me`
			`- Scale the service back up`
			`- Confirm "resolved" notification arrives`
			`- Confirm no repeat notification during the 24h window`

			`## Key Details`

			`- Grafana alert rules can be provisioned as YAML files alongside contact points and notification policies`
			- The blackbox probe metrics from Alloy use the job name `blackbox` and include an `instance` label with the service name
			- The runbook URL format: `https://docs.eblu.me/how-to/runbooks/runbook-service-probe-failure`

			`## Verification`

			`- [ ] Alert rule appears in Grafana UI under Alerting → Alert Rules`
			`- [ ] Simulated failure triggers ntfy notification within ~3 minutes`
			`- [ ] Notification includes service name, summary, and clickable runbook link`
			`- [ ] Resolution triggers a "resolved" notification`
			`- [ ] No repeat notification within 24h window`

			`## Related`

			`- [[configure-grafana-alerting-pipeline]] — Prerequisite: pipeline must be working`
			`- [[deploy-infra-alerting]] — Parent goal`
			`- [[port-services-check-alerts]] — Next: port remaining checks`
			`- [[runbook-service-probe-failure]] — The runbook created for this alert`