Knocks out the two daily recurring review tasks (doc review + service review) in one PR. ## Doc review (4 never-reviewed reference cards, `last-reviewed: 2026-06-04`) - **cluster.md** — Kubernetes version v1.34.0 → **v1.35.0**; refreshed the stale ringtail workload list and noted the in-progress minikube→k3s migration (points to `[[ringtail]]` as the canonical list). - **ntfy.md / tempo.md / alloy.md** — corrected image references: these are now **locally-built `registry.ops.eblu.me/blumeops/*` nix containers** (ntfy v2.19.2, tempo v2.10.3, alloy-k8s v1.16.0), not upstream Docker Hub. Fly.io alloy binary bumped to v1.16.1. ## Service review - **nvidia-device-plugin** (ringtail GPU): v0.19.0 → **v0.19.2**. Upstream patch releases — CDI/Tegra fixes + dependency bumps, no breaking changes for our manifest-based CDI + RuntimeClass setup (the service-account change in the notes is helm-only). ## Not in this PR (need container rebuilds, deferred) The other stale services are locally-built nix images, so upgrading them is a forge-runner rebuild rather than a clean tag bump — left untouched (not date-bumped, so they resurface): **prometheus** (v3.10.0→v3.12.0), **loki** (3.6.7→3.7.2), **kube-state-metrics**, **homepage**. Happy to do these as a follow-up rebuild PR. ## Deploy / verify Not yet deployed — `nvidia-device-plugin` still points at `main`. After review: ``` argocd app set nvidia-device-plugin --revision reviews-jun4 && argocd app sync nvidia-device-plugin # after merge: argocd app set nvidia-device-plugin --revision main && argocd app sync nvidia-device-plugin ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #366
2.4 KiB
2.4 KiB
| title | modified | last-reviewed | tags | ||
|---|---|---|---|---|---|
| Tempo | 2026-06-04 | 2026-06-04 |
|
Grafana Tempo
Distributed tracing backend for BlumeOps infrastructure. Receives traces via OTLP, stores them locally, and generates RED metrics (rate, error, duration) for prometheus.
Quick Reference
| Property | Value |
|---|---|
| URL | https://tempo.ops.eblu.me (when Caddy route added) |
| Tailscale URL | https://tempo.tail8d86e.ts.net |
| OTLP Endpoint | https://tempo-otlp.tail8d86e.ts.net |
| Namespace | monitoring |
| Image | registry.ops.eblu.me/blumeops/tempo:v2.10.3-75f9ba4 (locally built) |
| Storage | 10Gi PVC (local filesystem) |
| Retention | 7 days |
Architecture
- Single-node deployment with local filesystem storage
- OTLP receivers: gRPC (4317) and HTTP (4318)
metrics_generatorproduces span-metrics and service-graphs, remote-written to prometheus- Queried via grafana Tempo datasource
- Two Tailscale Ingresses: one for query API (3200), one for OTLP HTTP receiver (4318)
Trace Sources
From ringtail (via Beyla eBPF in Alloy):
| Service | Protocol | Coverage |
|---|---|---|
| frigate | HTTP REST | Request rate, error rate, latency, trace spans |
| ntfy | HTTP | Same |
| ollama | HTTP REST | Same (model inference latency) |
| immich | HTTP REST | Same |
Beyla auto-instruments HTTP services via eBPF kernel hooks — no code changes needed.
Future: SDK instrumentation Services with OTel SDK support (e.g., Hermes) can send traces directly to the OTLP endpoint for deeper internal spans (DB queries, business logic) alongside eBPF envelope traces.
Storage Monitoring
Tempo exposes tempodb_backend_bytes_total via its /metrics endpoint (scraped by prometheus). To check storage utilization against the 10Gi PVC:
tempodb_backend_bytes_total / 10737418240 * 100
Full PVC-level monitoring (via kubelet volume stats) is not yet available — see backlog.
Grafana Integration
- Tempo datasource with trace-to-log and trace-to-metrics correlation
- Service map and node graph visualization
- Loki derived fields link trace IDs in logs back to Tempo
Related
- alloy - Trace collector (Beyla eBPF on ringtail)
- prometheus - Receives span-metrics from Tempo
- loki - Log correlation via trace IDs
- grafana - Trace visualization