Add multi-cluster observability with ringtail metrics and dashboards #270

Merged

eblume merged 4 commits from feature/ringtail-metrics-dashboards into main

2026-02-25 22:01:01 -08:00

eblume commented

2026-02-25 21:21:49 -08:00

Owner

Summary

Add cluster label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
- Kubernetes Clusters dashboard — multi-cluster with cluster and namespace template variables
- Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels

Deployment and Testing

Sync apps on indri ArgoCD to pick up new app definitions (kube-state-metrics-ringtail, alloy-ringtail)
Sync prometheus → verify cluster label on scraped metrics
Sync alloy-k8s → verify cluster=indri on remote-written metrics and logs
Run mise run provision-indri -- --tags alloy → verify cluster=indri on host Alloy metrics/logs
Sync kube-state-metrics-ringtail → verify pods running on ringtail
Sync alloy-ringtail → verify pods running, check Prometheus for kube_pod_info{cluster="ringtail"}
Sync grafana-config → verify dashboards appear, cluster variable populates both values
Check Loki for {cluster="ringtail"} logs from ringtail pods

Notes

Alloy on ringtail uses insecure_skip_verify=true for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
DNS resolution for *.tail8d86e.ts.net from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

eblume added 1 commit

2026-02-25 21:21:50 -08:00

Add multi-cluster observability with ringtail metrics and dashboards cd832cceee

Add cluster labels (indri/ringtail) to all Prometheus scrape jobs,
Alloy k8s remote_write and pod logs, and Alloy host metrics/logs.
Deploy kube-state-metrics and Alloy on ringtail's k3s cluster to
collect pod metrics and logs, remote-writing to indri's Prometheus
and Loki. Replace single-cluster minikube and services dashboards
with a multi-cluster Kubernetes dashboard (cluster + namespace
variables) and a dedicated Ringtail dashboard with GPU monitoring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-25 21:50:55 -08:00

Ringtail dashboard: add host metrics, rename to general system health 0e2ad47433

Add prometheus.exporter.unix to ringtail Alloy with host /proc, /sys,
and rootfs mounts so node_* metrics flow from the NixOS host. Rewrite
the ringtail dashboard from k8s-only to full system health: uptime,
CPU usage by mode, memory usage, filesystem table, network traffic,
GPU overview, and k8s summary — matching the macOS dashboard pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-25 21:57:56 -08:00

Fix numeric log levels showing as errors in Grafana 7a4719ed0c

1Password Connect uses numeric log levels (1=error, 2=warn, 3=info,
4=debug) which Grafana's logs panel doesn't recognize, rendering all
lines with error styling. Add stage.replace rules in both Alloy
configs (indri + ringtail) to map numeric levels to standard strings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-25 22:00:28 -08:00

Revert "Fix numeric log levels showing as errors in Grafana" 55d1760c28