blumeops/argocd/manifests/grafana-config/dashboards
Erich Blume 03d71544ec Add multi-cluster observability with ringtail metrics and dashboards (#270)
## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
  - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
  - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels

## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods

## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
2026-02-25 22:01:00 -08:00
..
configmap-borgmatic.yaml K8s Migration Phase 2: Grafana to Kubernetes (#30) 2026-01-19 14:40:25 -08:00
configmap-cv-apm.yaml Fix cache hit rate on APM and Fly.io dashboards (#177) 2026-02-12 18:40:48 -08:00
configmap-devpi.yaml Log filtering cleanup and observability improvements (#45) 2026-01-22 17:30:08 -08:00
configmap-docs-apm.yaml Fix cache hit rate on APM and Fly.io dashboards (#177) 2026-02-12 18:40:48 -08:00
configmap-flyio.yaml Fix cache hit rate on APM and Fly.io dashboards (#177) 2026-02-12 18:40:48 -08:00
configmap-forgejo.yaml Add Forgejo repository health metrics and Grafana dashboard (#245) 2026-02-22 11:16:03 -08:00
configmap-frigate.yaml Fix Frigate detection events rate metric name in Grafana dashboard 2026-02-25 16:51:57 -08:00
configmap-jellyfin.yaml Add Jellyfin media server deployment (#77) 2026-01-30 16:57:26 -08:00
configmap-kubernetes.yaml Add multi-cluster observability with ringtail metrics and dashboards (#270) 2026-02-25 22:01:00 -08:00
configmap-loki.yaml K8s Migration Phase 2: Grafana to Kubernetes (#30) 2026-01-19 14:40:25 -08:00
configmap-macos.yaml Log filtering cleanup and observability improvements (#45) 2026-01-22 17:30:08 -08:00
configmap-postgresql.yaml Fix XID Age graph to show threshold context (#69) 2026-01-29 07:08:21 -08:00
configmap-ringtail.yaml Add multi-cluster observability with ringtail metrics and dashboards (#270) 2026-02-25 22:01:00 -08:00
configmap-sifaka-disks.yaml Operations and observability for sifaka NAS (#135) 2026-02-09 17:44:05 -08:00
configmap-teslamate-battery-health.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-charge-level.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-charges.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-charging-stats.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-drive-stats.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-drives.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-efficiency.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-locations.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-mileage.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-overview.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-projected-range.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-states.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-statistics.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-timeline.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-trip.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-updates.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-vampire-drain.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-teslamate-visited.yaml Add 'Tesla' prefix to all TeslaMate dashboard titles (#68) 2026-01-29 06:55:44 -08:00
configmap-zot.yaml K8s Migration Phase 2: Grafana to Kubernetes (#30) 2026-01-19 14:40:25 -08:00