- Add ArgoCD metrics scrape target to Prometheus (argocd-metrics:8082)
- Add ArgoCDAppOutOfSync alert: fires when argocd_app_info has
sync_status != Synced for 30 minutes
- Add runbook with diagnostic steps and common fixes
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TextfileStale: fires when a .prom textfile on indri hasn't been
updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic,
zot, minikube, jellyfin exporters.
- FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m.
- Add runbooks for both alerts.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
No unhealthy pods = no query results = noData state. With noDataState
set to NoData, Grafana fires an alert with empty labels ("Pod in is
not ready"). Change to OK since no results means everything is healthy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
CronJob pods (e.g., zim-watcher) are expected to complete and become
not-ready. Exclude them with `unless on (namespace, pod) kube_pod_owner{owner_kind="Job"}`.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use the correct provisioning field name for Grafana webhook custom
payloads: settings.payload.template (not payloadTemplate).
Found by reading the Go source (grafana/alerting receivers/webhook/v1/config.go):
Payload *CustomPayload `json:"payload,omitempty"`
CustomPayload.Template string `json:"template,omitempty"`
The template uses coll.Dict, coll.Append, and data.ToJSON to produce
ntfy-native JSON with topic, title, message, priority, and action
buttons linking to runbooks.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add reduce step between Prometheus query and threshold to preserve
per-service labels. Without it, Grafana can't distinguish the 5
probe_success series and errors with "duplicate results with labels {}".
Chain: A (prometheus query) → B (reduce last) → C (threshold < 1)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add ServiceProbeFailure alert rule to Grafana alerting provisioning
- Queries probe_success metric from Alloy blackbox exporter
- Extracts service name from job label via label_replace
- Fires after 2 minutes of failure, noDataState=Alerting
- Annotations include summary with service name and runbook URL
- Add runbook at docs/how-to/alerts/runbook-service-probe-failure.md
- Covers all 5 probed services (miniflux, kiwix, transmission, devpi, argocd)
- Diagnostic steps, common causes, silencing instructions
- Add alerting section to observability.md reference doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace single aggregate camera_fps check with per-camera FPS validation
and NFS storage accessibility check. Motivated by an outage where Frigate
API responded OK but NFS mount was inaccessible, causing "no frames" in UI.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also catches kiwix's transmission sidecar up from v4.0.6-r4 to
v4.1.1-r1, matching the torrent service (upgraded in PR #282 but
the kiwix sidecar was missed). No breaking changes — old RPC
protocol is supported through 4.x.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point all services at the 613f05d images which carry the new
consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4
version, no matching build) and docs/quartz (no 613f05d build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every container now carries title, description, version, source, and
vendor labels per the OCI image spec. Version is derived from the
existing CONTAINER_APP_VERSION ARG at build time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The top-level "uid" in Grafana dashboard JSON is at 2-space indent
near the end of the file, not the first occurrence. Match on ^ "uid"
to avoid clobbering nested datasource uid references.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous sed replaced ALL "uid" fields in dashboard JSON files,
including datasource references inside panels, causing dashboards to
go dark. Scope the replacement to only the first occurrence (the
top-level dashboard UID) using GNU sed 0,/pattern/ addressing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add init container to pre-populate ConfigMap dashboards before Grafana
starts, eliminating the race between the sidecar and the provisioner
that caused dashboard DB records to be deleted and re-created with new
IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards
fetched from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four project-scoped subagents that formalize existing mise task
workflows as constrained, specialized AI agents:
- infra-health: background health monitor (wraps services-check)
- doc-reviewer: persistent-memory documentation reviewer
- change-classifier: C0/C1/C2 triage before work begins
- mikado-navigator: C2 chain state advisor (wraps docs-mikado)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
StatefulSet volumeClaimTemplates are immutable and minikube's hostpath
provisioner doesn't enforce PVC size limits anyway. Add comments noting
the data grows freely on the 1.8TB backing disk.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename dashboard title since borgmatic is just the execution layer.
Add Backup Duration Over Time panel next to New Data Per Backup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Forgejo: show only notifications and pull requests
- Jellyfin: show only movies/series/episodes, hide now playing
- Grafana: hide data sources, show dashboards and alerts only
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add useEqualHeights: true so service tiles within each row expand to
match the tallest tile, fixing uneven layout from widget metrics.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
style: row makes each group span the full page width (one per row),
while columns: 4 tiles services horizontally within each group.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add maxGroupColumns: 1 so each category gets its own full-width row,
with service tiles arranged side-by-side within each group.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move NVR, Jellyfin, and DJ to new Home group. Move Grafana from Content
to Infrastructure. Switch all layout groups from column to row style.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Mealie SQLite dump hook used `minikube-indri` (the context name on
gilbert), but on indri itself the context is just `minikube`. This caused
the before_backup hook to fail, aborting all backups since the hook was added.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Increase retention: continuous 3→180d, detections 14→30d, alerts 30→730d.
Plenty of NFS headroom (~9.4 TiB free, ~6.6 GB/day for one camera).
Add frigate-recording check to services-check that verifies camera_fps > 0,
which would have caught the 6-day outage from the mqtt config removal.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frigate's config schema requires an `mqtt` field even when MQTT isn't
used. Commit 40f1568 removed it along with Mosquitto, causing Frigate
to fail validation on startup. Add `mqtt.enabled: false` to satisfy
the schema without needing a broker.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point alloy-k8s at v1.14.0-61f02a0 (Dockerfile) and both ringtail
deployments at v1.14.0-61f02a0-nix (Nix build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Add `containers/alloy/` with dual Dockerfile + Nix build files for Grafana Alloy v1.14.0
- Both builds fetch source from forge mirror (`forge.ops.eblu.me/mirrors/alloy.git`), build the web UI (Node), then compile the Go binary with `netgo embedalloyui` tags
- Update all three alloy deployments (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail) to use `registry.ops.eblu.me/blumeops/alloy`
- `promtail_journal_enabled` tag omitted — requires systemd headers and none of our configs use `loki.source.journal`
## Build verification
- **Dockerfile:** Tested locally via `docker build`, binary reports `v1.14.0` with correct tags
- **Nix:** Tested on ringtail via `nix-build`, all three hashes (fetchgit, npmDeps, goModules) resolved and build succeeds
## Post-merge steps
1. Wait for CI to build the container from main (both Dockerfile and Nix workflows)
2. `mise run container-list alloy` to find the `[main]` tagged image
3. C0 follow-up to update `newTag` in all three kustomizations from `v1.14.0-placeholder` to the real tag
4. Sync ArgoCD apps and verify pods come up healthy
Reviewed-on: #300
Mealie's orderBy=random requires a paginationSeed parameter, otherwise
the API returns 422. Added the seed to all random query examples.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Agent-facing guide for generating unified cooking timelines from
Mealie meal plans. Covers querying the API, picking balanced meals
(protein/carb/vegetable), and interleaving recipe steps into a
relative timeline so everything finishes together.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>