Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) #286

Merged
eblume merged 9 commits from feature/otel-tracing into main 2026-03-05 10:51:07 -08:00
Owner

Summary

Adds the third observability pillar — distributed tracing — alongside existing metrics (Prometheus) and logs (Loki).

  • Grafana Tempo 2.10.1 on minikube-indri for trace storage with 7d retention, OTLP receivers, and metrics_generator that remote-writes span-metrics (RED) to Prometheus
  • Beyla eBPF auto-instrumentation via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
  • Grafana integration — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
  • Prometheus scrapes Tempo operational metrics

Architecture

ringtail (k3s)                                indri (minikube)
┌──────────────────────┐                      ┌─────────────────────┐
│ Alloy+Beyla (eBPF)   │──OTLP HTTP────────→ │ Tempo               │
│  ↳ Frigate, ntfy,    │  via tailnet         │  ↳ trace storage    │
│    Ollama, Immich     │                      │  ↳ RED → Prometheus │
└──────────────────────┘                      │                     │
                                              │ Grafana             │
                                              │  ↳ Tempo datasource │
                                              └─────────────────────┘

New files (12)

  • docs/reference/services/tempo.md — reference doc
  • docs/changelog.d/feature-otel-tracing.feature.md
  • argocd/apps/tempo.yaml + argocd/manifests/tempo/ (6 files)
  • argocd/apps/alloy-tracing-ringtail.yaml + argocd/manifests/alloy-tracing-ringtail/ (4 files)

Modified files (6)

  • argocd/manifests/grafana/datasources.yaml — Tempo datasource + Loki derivedFields
  • argocd/manifests/prometheus/prometheus.yml — Tempo scrape target
  • service-versions.yaml — tempo + alloy-tracing-ringtail entries
  • docs/reference/services/grafana.md — Tempo in datasources table
  • docs/reference/reference.md — Tempo in services index
  • docs/reference/operations/observability.md — Tempo in components list

Deployment and Testing

  • Sync apps app to pick up new Application definitions
  • argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo
  • Verify Tempo pod: kubectl --context=minikube-indri get pods -n monitoring -l app=tempo
  • Verify Tempo ready: port-forward 3200 and curl localhost:3200/ready
  • Verify Tailscale ingresses: kubectl --context=minikube-indri get ingress -n monitoring
  • argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail
  • Check Beyla discovery in alloy-tracing logs on ringtail
  • Sync grafana-config for updated datasources
  • Sync prometheus for updated scrape config
  • Test Grafana Tempo datasource connection
  • Generate test traffic and search traces in Grafana Explore → Tempo
  • After merge: reset all ArgoCD app revisions back to main
## Summary Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki). - **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus - **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes - **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking - **Prometheus** scrapes Tempo operational metrics ### Architecture ``` ringtail (k3s) indri (minikube) ┌──────────────────────┐ ┌─────────────────────┐ │ Alloy+Beyla (eBPF) │──OTLP HTTP────────→ │ Tempo │ │ ↳ Frigate, ntfy, │ via tailnet │ ↳ trace storage │ │ Ollama, Immich │ │ ↳ RED → Prometheus │ └──────────────────────┘ │ │ │ Grafana │ │ ↳ Tempo datasource │ └─────────────────────┘ ``` ### New files (12) - `docs/reference/services/tempo.md` — reference doc - `docs/changelog.d/feature-otel-tracing.feature.md` - `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files) - `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files) ### Modified files (6) - `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields - `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target - `service-versions.yaml` — tempo + alloy-tracing-ringtail entries - `docs/reference/services/grafana.md` — Tempo in datasources table - `docs/reference/reference.md` — Tempo in services index - `docs/reference/operations/observability.md` — Tempo in components list ## Deployment and Testing - [ ] Sync `apps` app to pick up new Application definitions - [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo` - [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo` - [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready` - [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring` - [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail` - [ ] Check Beyla discovery in alloy-tracing logs on ringtail - [ ] Sync grafana-config for updated datasources - [ ] Sync prometheus for updated scrape config - [ ] Test Grafana Tempo datasource connection - [ ] Generate test traffic and search traces in Grafana Explore → Tempo - [ ] After merge: reset all ArgoCD app revisions back to main
Tempo is the new distributed tracing backend for BlumeOps,
completing the third observability pillar alongside Prometheus
(metrics) and Loki (logs).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploys Grafana Tempo 2.10.1 on minikube-indri for distributed
trace storage. Includes OTLP receivers (gRPC + HTTP), local
filesystem storage with 7d retention, and metrics_generator
that remote-writes span-metrics to Prometheus.

Two Tailscale Ingresses: tempo (query API) and tempo-otlp
(OTLP HTTP receiver for cross-cluster trace ingestion).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Grafana: Tempo datasource with trace-to-log (Loki) and
trace-to-metrics (Prometheus) correlation. Loki gets
derivedFields to link trace IDs back to Tempo.

Prometheus: scrape Tempo operational metrics on port 3200.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deploys a privileged Alloy DaemonSet on ringtail's k3s that
uses Beyla eBPF to auto-instrument HTTP services (Frigate,
ntfy, Ollama, Immich) without code changes. Traces are
exported via OTLP HTTP to Tempo on indri.

Separate from the existing unprivileged alloy-ringtail to
preserve least-privilege for metrics/logs collection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Updates service-versions.yaml, Grafana datasources table,
ArgoCD apps registry, and Tempo image version to 2.10.1.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add PromQL query for checking Tempo storage utilization
against PVC capacity using tempodb_backend_bytes_total.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Panels: Storage Used, PVC Utilization (% of 10Gi), Total
Blocks, Heap Usage, Storage Over Time, Span Ingestion Rate,
Ingestion Throughput, and Query Latency (p50/p95).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Required by Grafana's TraceQL metrics queries. Keeps recent
traces in memory for query-time aggregation without
duplicating data to storage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The local-blocks processor requires its own dedicated traces WAL
(traces_storage.path), separate from the ingester WAL and the
metrics generator WAL. Without it, the processor fails with
"local blocks processor requires traces wal".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eblume merged commit c281fb5403 into main 2026-03-05 10:51:07 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!286
No description provided.