Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286)
## Summary
Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki).
- **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus
- **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
- **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
- **Prometheus** scrapes Tempo operational metrics
### Architecture
```
ringtail (k3s) indri (minikube)
┌──────────────────────┐ ┌─────────────────────┐
│ Alloy+Beyla (eBPF) │──OTLP HTTP────────→ │ Tempo │
│ ↳ Frigate, ntfy, │ via tailnet │ ↳ trace storage │
│ Ollama, Immich │ │ ↳ RED → Prometheus │
└──────────────────────┘ │ │
│ Grafana │
│ ↳ Tempo datasource │
└─────────────────────┘
```
### New files (12)
- `docs/reference/services/tempo.md` — reference doc
- `docs/changelog.d/feature-otel-tracing.feature.md`
- `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files)
- `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files)
### Modified files (6)
- `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields
- `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target
- `service-versions.yaml` — tempo + alloy-tracing-ringtail entries
- `docs/reference/services/grafana.md` — Tempo in datasources table
- `docs/reference/reference.md` — Tempo in services index
- `docs/reference/operations/observability.md` — Tempo in components list
## Deployment and Testing
- [ ] Sync `apps` app to pick up new Application definitions
- [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo`
- [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo`
- [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready`
- [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring`
- [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail`
- [ ] Check Beyla discovery in alloy-tracing logs on ringtail
- [ ] Sync grafana-config for updated datasources
- [ ] Sync prometheus for updated scrape config
- [ ] Test Grafana Tempo datasource connection
- [ ] Generate test traffic and search traces in Grafana Explore → Tempo
- [ ] After merge: reset all ArgoCD app revisions back to main
Reviewed-on: #286
This commit is contained in:
parent
d15071aaf9
commit
c281fb5403
23 changed files with 1077 additions and 2 deletions
|
|
@ -36,6 +36,7 @@ The OIDC client secret is injected via [[external-secrets]] (`grafana-authentik-
|
|||
|------|------|--------|
|
||||
| Prometheus | prometheus | `prometheus.monitoring.svc.cluster.local:9090` |
|
||||
| Loki | loki | `loki.monitoring.svc.cluster.local:3100` |
|
||||
| Tempo | tempo | `tempo.monitoring.svc.cluster.local:3200` |
|
||||
| TeslaMate | postgres | `blumeops-pg-rw.databases.svc.cluster.local:5432` |
|
||||
|
||||
## Dashboard Provisioning
|
||||
|
|
@ -64,4 +65,5 @@ Optional annotation: `grafana_folder: "FolderName"`
|
|||
- [[authentik]] - OIDC identity provider for SSO
|
||||
- [[prometheus]] - Metrics datasource
|
||||
- [[loki]] - Logs datasource
|
||||
- [[tempo]] - Traces datasource
|
||||
- [[alloy|Alloy]] - Data collector
|
||||
|
|
|
|||
70
docs/reference/services/tempo.md
Normal file
70
docs/reference/services/tempo.md
Normal file
|
|
@ -0,0 +1,70 @@
|
|||
---
|
||||
title: Tempo
|
||||
modified: 2026-03-05
|
||||
tags:
|
||||
- service
|
||||
- observability
|
||||
---
|
||||
|
||||
# Grafana Tempo
|
||||
|
||||
Distributed tracing backend for BlumeOps infrastructure. Receives traces via OTLP, stores them locally, and generates RED metrics (rate, error, duration) for [[prometheus]].
|
||||
|
||||
## Quick Reference
|
||||
|
||||
| Property | Value |
|
||||
|----------|-------|
|
||||
| **URL** | https://tempo.ops.eblu.me (when Caddy route added) |
|
||||
| **Tailscale URL** | https://tempo.tail8d86e.ts.net |
|
||||
| **OTLP Endpoint** | https://tempo-otlp.tail8d86e.ts.net |
|
||||
| **Namespace** | `monitoring` |
|
||||
| **Image** | `grafana/tempo:2.10.1` |
|
||||
| **Storage** | 10Gi PVC (local filesystem) |
|
||||
| **Retention** | 7 days |
|
||||
|
||||
## Architecture
|
||||
|
||||
- Single-node deployment with local filesystem storage
|
||||
- OTLP receivers: gRPC (4317) and HTTP (4318)
|
||||
- `metrics_generator` produces span-metrics and service-graphs, remote-written to [[prometheus]]
|
||||
- Queried via [[grafana]] Tempo datasource
|
||||
- Two Tailscale Ingresses: one for query API (3200), one for OTLP HTTP receiver (4318)
|
||||
|
||||
## Trace Sources
|
||||
|
||||
**From ringtail (via Beyla eBPF in Alloy):**
|
||||
|
||||
| Service | Protocol | Coverage |
|
||||
|---------|----------|----------|
|
||||
| [[frigate]] | HTTP REST | Request rate, error rate, latency, trace spans |
|
||||
| [[ntfy]] | HTTP | Same |
|
||||
| [[ollama]] | HTTP REST | Same (model inference latency) |
|
||||
| [[immich]] | HTTP REST | Same |
|
||||
|
||||
Beyla auto-instruments HTTP services via eBPF kernel hooks — no code changes needed. MQTT (Mosquitto) is not instrumented (no eBPF parser for MQTT).
|
||||
|
||||
**Future: SDK instrumentation**
|
||||
Services with OTel SDK support (e.g., Hermes) can send traces directly to the OTLP endpoint for deeper internal spans (DB queries, business logic) alongside eBPF envelope traces.
|
||||
|
||||
## Storage Monitoring
|
||||
|
||||
Tempo exposes `tempodb_backend_bytes_total` via its `/metrics` endpoint (scraped by [[prometheus]]). To check storage utilization against the 10Gi PVC:
|
||||
|
||||
```promql
|
||||
tempodb_backend_bytes_total / 10737418240 * 100
|
||||
```
|
||||
|
||||
Full PVC-level monitoring (via kubelet volume stats) is not yet available — see backlog.
|
||||
|
||||
## Grafana Integration
|
||||
|
||||
- **Tempo datasource** with trace-to-log and trace-to-metrics correlation
|
||||
- **Service map** and **node graph** visualization
|
||||
- **Loki derived fields** link trace IDs in logs back to Tempo
|
||||
|
||||
## Related
|
||||
|
||||
- [[alloy|Alloy]] - Trace collector (Beyla eBPF on ringtail)
|
||||
- [[prometheus]] - Receives span-metrics from Tempo
|
||||
- [[loki]] - Log correlation via trace IDs
|
||||
- [[grafana]] - Trace visualization
|
||||
Loading…
Add table
Add a link
Reference in a new issue