diff --git a/docs/changelog.d/feature-otel-tracing.feature.md b/docs/changelog.d/feature-otel-tracing.feature.md new file mode 100644 index 0000000..5d5d4ab --- /dev/null +++ b/docs/changelog.d/feature-otel-tracing.feature.md @@ -0,0 +1 @@ +Add distributed tracing via Grafana Tempo and Beyla eBPF auto-instrumentation. Tempo runs on minikube-indri for trace storage, while a privileged Alloy DaemonSet on ringtail uses Beyla to instrument HTTP services (Frigate, ntfy, Ollama, Immich) without code changes. Grafana gets trace-to-log and trace-to-metrics correlation. diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md index 6c42193..5890147 100644 --- a/docs/reference/operations/observability.md +++ b/docs/reference/operations/observability.md @@ -7,11 +7,12 @@ tags: # Observability -Metrics, logs, and dashboards for BlumeOps infrastructure. +Metrics, logs, traces, and dashboards for BlumeOps infrastructure. ## Components - [[prometheus]] - Metrics storage and querying - [[loki]] - Log aggregation -- [[alloy|Alloy]] - Metrics and log collection +- [[tempo]] - Distributed tracing +- [[alloy|Alloy]] - Metrics, log, and trace collection - [[grafana]] - Dashboards and visualization diff --git a/docs/reference/reference.md b/docs/reference/reference.md index 9faa8e2..e9baa20 100644 --- a/docs/reference/reference.md +++ b/docs/reference/reference.md @@ -27,6 +27,7 @@ Individual service reference cards with URLs and configuration details. | [[jellyfin]] | Media server | indri | | [[kiwix]] | Offline Wikipedia & ZIM archives | k8s | | [[loki]] | Log aggregation | k8s | +| [[tempo]] | Distributed tracing | k8s | | [[miniflux]] | RSS feed reader | k8s | | [[navidrome]] | Music streaming | k8s | | [[ntfy]] | Push notifications | k8s (ringtail) | diff --git a/docs/reference/services/tempo.md b/docs/reference/services/tempo.md new file mode 100644 index 0000000..3aea029 --- /dev/null +++ b/docs/reference/services/tempo.md @@ -0,0 +1,60 @@ +--- +title: Tempo +modified: 2026-03-05 +tags: + - service + - observability +--- + +# Grafana Tempo + +Distributed tracing backend for BlumeOps infrastructure. Receives traces via OTLP, stores them locally, and generates RED metrics (rate, error, duration) for [[prometheus]]. + +## Quick Reference + +| Property | Value | +|----------|-------| +| **URL** | https://tempo.ops.eblu.me (when Caddy route added) | +| **Tailscale URL** | https://tempo.tail8d86e.ts.net | +| **OTLP Endpoint** | https://tempo-otlp.tail8d86e.ts.net | +| **Namespace** | `monitoring` | +| **Image** | `grafana/tempo:2.7.2` | +| **Storage** | 10Gi PVC (local filesystem) | +| **Retention** | 7 days | + +## Architecture + +- Single-node deployment with local filesystem storage +- OTLP receivers: gRPC (4317) and HTTP (4318) +- `metrics_generator` produces span-metrics and service-graphs, remote-written to [[prometheus]] +- Queried via [[grafana]] Tempo datasource +- Two Tailscale Ingresses: one for query API (3200), one for OTLP HTTP receiver (4318) + +## Trace Sources + +**From ringtail (via Beyla eBPF in Alloy):** + +| Service | Protocol | Coverage | +|---------|----------|----------| +| [[frigate]] | HTTP REST | Request rate, error rate, latency, trace spans | +| [[ntfy]] | HTTP | Same | +| [[ollama]] | HTTP REST | Same (model inference latency) | +| [[immich]] | HTTP REST | Same | + +Beyla auto-instruments HTTP services via eBPF kernel hooks — no code changes needed. MQTT (Mosquitto) is not instrumented (no eBPF parser for MQTT). + +**Future: SDK instrumentation** +Services with OTel SDK support (e.g., Hermes) can send traces directly to the OTLP endpoint for deeper internal spans (DB queries, business logic) alongside eBPF envelope traces. + +## Grafana Integration + +- **Tempo datasource** with trace-to-log and trace-to-metrics correlation +- **Service map** and **node graph** visualization +- **Loki derived fields** link trace IDs in logs back to Tempo + +## Related + +- [[alloy|Alloy]] - Trace collector (Beyla eBPF on ringtail) +- [[prometheus]] - Receives span-metrics from Tempo +- [[loki]] - Log correlation via trace IDs +- [[grafana]] - Trace visualization