Add multi-cluster observability with ringtail metrics and dashboards #270

Merged
eblume merged 4 commits from feature/ringtail-metrics-dashboards into main 2026-02-25 22:01:01 -08:00
Owner

Summary

  • Add cluster label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
  • Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
  • Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
  • Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
    • Kubernetes Clusters dashboard — multi-cluster with cluster and namespace template variables
    • Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels

Deployment and Testing

  1. Sync apps on indri ArgoCD to pick up new app definitions (kube-state-metrics-ringtail, alloy-ringtail)
  2. Sync prometheus → verify cluster label on scraped metrics
  3. Sync alloy-k8s → verify cluster=indri on remote-written metrics and logs
  4. Run mise run provision-indri -- --tags alloy → verify cluster=indri on host Alloy metrics/logs
  5. Sync kube-state-metrics-ringtail → verify pods running on ringtail
  6. Sync alloy-ringtail → verify pods running, check Prometheus for kube_pod_info{cluster="ringtail"}
  7. Sync grafana-config → verify dashboards appear, cluster variable populates both values
  8. Check Loki for {cluster="ringtail"} logs from ringtail pods

Notes

  • Alloy on ringtail uses insecure_skip_verify=true for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
  • DNS resolution for *.tail8d86e.ts.net from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
  • The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard
## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard
Add cluster labels (indri/ringtail) to all Prometheus scrape jobs,
Alloy k8s remote_write and pod logs, and Alloy host metrics/logs.
Deploy kube-state-metrics and Alloy on ringtail's k3s cluster to
collect pod metrics and logs, remote-writing to indri's Prometheus
and Loki. Replace single-cluster minikube and services dashboards
with a multi-cluster Kubernetes dashboard (cluster + namespace
variables) and a dedicated Ringtail dashboard with GPU monitoring.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add prometheus.exporter.unix to ringtail Alloy with host /proc, /sys,
and rootfs mounts so node_* metrics flow from the NixOS host. Rewrite
the ringtail dashboard from k8s-only to full system health: uptime,
CPU usage by mode, memory usage, filesystem table, network traffic,
GPU overview, and k8s summary — matching the macOS dashboard pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1Password Connect uses numeric log levels (1=error, 2=warn, 3=info,
4=debug) which Grafana's logs panel doesn't recognize, rendering all
lines with error styling. Add stage.replace rules in both Alloy
configs (indri + ringtail) to map numeric levels to standard strings.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eblume merged commit 03d71544ec into main 2026-02-25 22:01:01 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!270
No description provided.