## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
- **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
- **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels
## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods
## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack.
Summary
- Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal)
- Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses
- Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics
- Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net)
- Add ACL rule for port 9187 (CNPG metrics)
- Delete obsolete ansible roles for prometheus and loki
Changes
- argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress
- argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress
- argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications
- argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS
- argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint
- argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics
- ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints
- pulumi/policy.hujson - ACL for port 9187
- Deleted ansible/roles/prometheus/ and ansible/roles/loki/
Deployment and Testing
- Stop prometheus and loki on indri
- Sync ArgoCD apps (apps, prometheus, loki, grafana)
- Run mise run provision-indri -- --tags alloy
- Verify Grafana dashboards show data
🤖 Generated with https://claude.ai/claude-code
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42