## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
- **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
- **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels
## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods
## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
79 lines
1.8 KiB
YAML
79 lines
1.8 KiB
YAML
apiVersion: v1
|
|
kind: ServiceAccount
|
|
metadata:
|
|
name: kube-state-metrics
|
|
namespace: monitoring
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: ClusterRole
|
|
metadata:
|
|
name: kube-state-metrics
|
|
rules:
|
|
- apiGroups: [""]
|
|
resources:
|
|
- configmaps
|
|
- secrets
|
|
- nodes
|
|
- pods
|
|
- services
|
|
- serviceaccounts
|
|
- resourcequotas
|
|
- replicationcontrollers
|
|
- limitranges
|
|
- persistentvolumeclaims
|
|
- persistentvolumes
|
|
- namespaces
|
|
- endpoints
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["apps"]
|
|
resources:
|
|
- statefulsets
|
|
- daemonsets
|
|
- deployments
|
|
- replicasets
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["batch"]
|
|
resources:
|
|
- cronjobs
|
|
- jobs
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["autoscaling"]
|
|
resources:
|
|
- horizontalpodautoscalers
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["networking.k8s.io"]
|
|
resources:
|
|
- networkpolicies
|
|
- ingresses
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["coordination.k8s.io"]
|
|
resources:
|
|
- leases
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["certificates.k8s.io"]
|
|
resources:
|
|
- certificatesigningrequests
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["storage.k8s.io"]
|
|
resources:
|
|
- storageclasses
|
|
- volumeattachments
|
|
verbs: ["list", "watch"]
|
|
- apiGroups: ["admissionregistration.k8s.io"]
|
|
resources:
|
|
- mutatingwebhookconfigurations
|
|
- validatingwebhookconfigurations
|
|
verbs: ["list", "watch"]
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: ClusterRoleBinding
|
|
metadata:
|
|
name: kube-state-metrics
|
|
roleRef:
|
|
apiGroup: rbac.authorization.k8s.io
|
|
kind: ClusterRole
|
|
name: kube-state-metrics
|
|
subjects:
|
|
- kind: ServiceAccount
|
|
name: kube-state-metrics
|
|
namespace: monitoring
|