blumeops/docs/reference/services/tempo.md
Erich Blume bb55fa9566 Recurring review sweep: 4 doc cards + nvidia-device-plugin v0.19.2 (#366)
Knocks out the two daily recurring review tasks (doc review + service review) in one PR.

## Doc review (4 never-reviewed reference cards, `last-reviewed: 2026-06-04`)
- **cluster.md** — Kubernetes version v1.34.0 → **v1.35.0**; refreshed the stale ringtail workload list and noted the in-progress minikube→k3s migration (points to `[[ringtail]]` as the canonical list).
- **ntfy.md / tempo.md / alloy.md** — corrected image references: these are now **locally-built `registry.ops.eblu.me/blumeops/*` nix containers** (ntfy v2.19.2, tempo v2.10.3, alloy-k8s v1.16.0), not upstream Docker Hub. Fly.io alloy binary bumped to v1.16.1.

## Service review
- **nvidia-device-plugin** (ringtail GPU): v0.19.0 → **v0.19.2**. Upstream patch releases — CDI/Tegra fixes + dependency bumps, no breaking changes for our manifest-based CDI + RuntimeClass setup (the service-account change in the notes is helm-only).

## Not in this PR (need container rebuilds, deferred)
The other stale services are locally-built nix images, so upgrading them is a forge-runner rebuild rather than a clean tag bump — left untouched (not date-bumped, so they resurface): **prometheus** (v3.10.0→v3.12.0), **loki** (3.6.7→3.7.2), **kube-state-metrics**, **homepage**. Happy to do these as a follow-up rebuild PR.

## Deploy / verify
Not yet deployed — `nvidia-device-plugin` still points at `main`. After review:
```
argocd app set nvidia-device-plugin --revision reviews-jun4 && argocd app sync nvidia-device-plugin
# after merge:
argocd app set nvidia-device-plugin --revision main && argocd app sync nvidia-device-plugin
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #366
2026-06-04 13:37:02 -07:00

71 lines
2.4 KiB
Markdown

---
title: Tempo
modified: 2026-06-04
last-reviewed: 2026-06-04
tags:
- service
- observability
---
# Grafana Tempo
Distributed tracing backend for BlumeOps infrastructure. Receives traces via OTLP, stores them locally, and generates RED metrics (rate, error, duration) for [[prometheus]].
## Quick Reference
| Property | Value |
|----------|-------|
| **URL** | https://tempo.ops.eblu.me (when Caddy route added) |
| **Tailscale URL** | https://tempo.tail8d86e.ts.net |
| **OTLP Endpoint** | https://tempo-otlp.tail8d86e.ts.net |
| **Namespace** | `monitoring` |
| **Image** | `registry.ops.eblu.me/blumeops/tempo:v2.10.3-75f9ba4` (locally built) |
| **Storage** | 10Gi PVC (local filesystem) |
| **Retention** | 7 days |
## Architecture
- Single-node deployment with local filesystem storage
- OTLP receivers: gRPC (4317) and HTTP (4318)
- `metrics_generator` produces span-metrics and service-graphs, remote-written to [[prometheus]]
- Queried via [[grafana]] Tempo datasource
- Two Tailscale Ingresses: one for query API (3200), one for OTLP HTTP receiver (4318)
## Trace Sources
**From ringtail (via Beyla eBPF in Alloy):**
| Service | Protocol | Coverage |
|---------|----------|----------|
| [[frigate]] | HTTP REST | Request rate, error rate, latency, trace spans |
| [[ntfy]] | HTTP | Same |
| [[ollama]] | HTTP REST | Same (model inference latency) |
| [[immich]] | HTTP REST | Same |
Beyla auto-instruments HTTP services via eBPF kernel hooks — no code changes needed.
**Future: SDK instrumentation**
Services with OTel SDK support (e.g., Hermes) can send traces directly to the OTLP endpoint for deeper internal spans (DB queries, business logic) alongside eBPF envelope traces.
## Storage Monitoring
Tempo exposes `tempodb_backend_bytes_total` via its `/metrics` endpoint (scraped by [[prometheus]]). To check storage utilization against the 10Gi PVC:
```promql
tempodb_backend_bytes_total / 10737418240 * 100
```
Full PVC-level monitoring (via kubelet volume stats) is not yet available — see backlog.
## Grafana Integration
- **Tempo datasource** with trace-to-log and trace-to-metrics correlation
- **Service map** and **node graph** visualization
- **Loki derived fields** link trace IDs in logs back to Tempo
## Related
- [[alloy|Alloy]] - Trace collector (Beyla eBPF on ringtail)
- [[prometheus]] - Receives span-metrics from Tempo
- [[loki]] - Log correlation via trace IDs
- [[grafana]] - Trace visualization