## Summary
Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki).
- **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus
- **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
- **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
- **Prometheus** scrapes Tempo operational metrics
### Architecture
```
ringtail (k3s) indri (minikube)
┌──────────────────────┐ ┌─────────────────────┐
│ Alloy+Beyla (eBPF) │──OTLP HTTP────────→ │ Tempo │
│ ↳ Frigate, ntfy, │ via tailnet │ ↳ trace storage │
│ Ollama, Immich │ │ ↳ RED → Prometheus │
└──────────────────────┘ │ │
│ Grafana │
│ ↳ Tempo datasource │
└─────────────────────┘
```
### New files (12)
- `docs/reference/services/tempo.md` — reference doc
- `docs/changelog.d/feature-otel-tracing.feature.md`
- `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files)
- `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files)
### Modified files (6)
- `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields
- `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target
- `service-versions.yaml` — tempo + alloy-tracing-ringtail entries
- `docs/reference/services/grafana.md` — Tempo in datasources table
- `docs/reference/reference.md` — Tempo in services index
- `docs/reference/operations/observability.md` — Tempo in components list
## Deployment and Testing
- [ ] Sync `apps` app to pick up new Application definitions
- [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo`
- [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo`
- [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready`
- [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring`
- [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail`
- [ ] Check Beyla discovery in alloy-tracing logs on ringtail
- [ ] Sync grafana-config for updated datasources
- [ ] Sync prometheus for updated scrape config
- [ ] Test Grafana Tempo datasource connection
- [ ] Generate test traffic and search traces in Grafana Explore → Tempo
- [ ] After merge: reset all ArgoCD app revisions back to main
Reviewed-on: #286
83 lines
2.3 KiB
YAML
83 lines
2.3 KiB
YAML
global:
|
|
scrape_interval: 15s
|
|
evaluation_interval: 15s
|
|
|
|
# Indri system metrics are pushed via Alloy remote_write
|
|
# K8s services are scraped directly
|
|
|
|
scrape_configs:
|
|
# Sifaka NAS exporters (via Caddy L4 TCP proxy on indri)
|
|
- job_name: "node-exporter-sifaka"
|
|
static_configs:
|
|
- targets: ["nas.ops.eblu.me:9100"]
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
- job_name: "smartctl-sifaka"
|
|
scrape_interval: 60s
|
|
static_configs:
|
|
- targets: ["nas.ops.eblu.me:9633"]
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
# CNPG PostgreSQL metrics (k8s internal)
|
|
- job_name: "cnpg-postgres"
|
|
static_configs:
|
|
- targets: ["blumeops-pg-metrics-tailscale.databases.svc.cluster.local:9187"]
|
|
labels:
|
|
instance: "blumeops-pg"
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
# Prometheus self-monitoring
|
|
- job_name: "prometheus"
|
|
static_configs:
|
|
- targets: ["localhost:9090"]
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
# Loki metrics
|
|
- job_name: "loki"
|
|
static_configs:
|
|
- targets: ["loki.monitoring.svc.cluster.local:3100"]
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
# Kubernetes state metrics (pods, deployments, resource usage, etc.)
|
|
- job_name: "kube-state-metrics"
|
|
static_configs:
|
|
- targets: ["kube-state-metrics.monitoring.svc.cluster.local:8080"]
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
# Transmission BitTorrent metrics (via sidecar exporter)
|
|
- job_name: "transmission"
|
|
static_configs:
|
|
- targets: ["transmission.torrent.svc.cluster.local:19091"]
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
# Tempo operational metrics
|
|
- job_name: "tempo"
|
|
static_configs:
|
|
- targets: ["tempo.monitoring.svc.cluster.local:3200"]
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: indri
|
|
|
|
# Frigate NVR metrics (via Caddy on indri — Frigate runs on ringtail)
|
|
- job_name: "frigate"
|
|
scheme: https
|
|
static_configs:
|
|
- targets: ["nvr.ops.eblu.me"]
|
|
metrics_path: /api/metrics
|
|
metric_relabel_configs:
|
|
- target_label: cluster
|
|
replacement: ringtail
|