Erich Blume bb55fa9566 Recurring review sweep: 4 doc cards + nvidia-device-plugin v0.19.2 (#366 )

Knocks out the two daily recurring review tasks (doc review + service review) in one PR.

## Doc review (4 never-reviewed reference cards, `last-reviewed: 2026-06-04`)
- **cluster.md** — Kubernetes version v1.34.0 → **v1.35.0**; refreshed the stale ringtail workload list and noted the in-progress minikube→k3s migration (points to `[[ringtail]]` as the canonical list).
- **ntfy.md / tempo.md / alloy.md** — corrected image references: these are now **locally-built `registry.ops.eblu.me/blumeops/*` nix containers** (ntfy v2.19.2, tempo v2.10.3, alloy-k8s v1.16.0), not upstream Docker Hub. Fly.io alloy binary bumped to v1.16.1.

## Service review
- **nvidia-device-plugin** (ringtail GPU): v0.19.0 → **v0.19.2**. Upstream patch releases — CDI/Tegra fixes + dependency bumps, no breaking changes for our manifest-based CDI + RuntimeClass setup (the service-account change in the notes is helm-only).

## Not in this PR (need container rebuilds, deferred)
The other stale services are locally-built nix images, so upgrading them is a forge-runner rebuild rather than a clean tag bump — left untouched (not date-bumped, so they resurface): **prometheus** (v3.10.0→v3.12.0), **loki** (3.6.7→3.7.2), **kube-state-metrics**, **homepage**. Happy to do these as a follow-up rebuild PR.

## Deploy / verify
Not yet deployed — `nvidia-device-plugin` still points at `main`. After review:
```
argocd app set nvidia-device-plugin --revision reviews-jun4 && argocd app sync nvidia-device-plugin
# after merge:
argocd app set nvidia-device-plugin --revision main && argocd app sync nvidia-device-plugin
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #366

2026-06-04 13:37:02 -07:00

2.4 KiB

Raw Permalink Blame History

title

modified

last-reviewed

Grafana Tempo

Distributed tracing backend for BlumeOps infrastructure. Receives traces via OTLP, stores them locally, and generates RED metrics (rate, error, duration) for prometheus.

Quick Reference

Property	Value
URL	https://tempo.ops.eblu.me (when Caddy route added)
Tailscale URL	https://tempo.tail8d86e.ts.net
OTLP Endpoint	https://tempo-otlp.tail8d86e.ts.net
Namespace	`monitoring`
Image	`registry.ops.eblu.me/blumeops/tempo:v2.10.3-75f9ba4` (locally built)
Storage	10Gi PVC (local filesystem)
Retention	7 days

Architecture

Single-node deployment with local filesystem storage
OTLP receivers: gRPC (4317) and HTTP (4318)
metrics_generator produces span-metrics and service-graphs, remote-written to prometheus
Queried via grafana Tempo datasource
Two Tailscale Ingresses: one for query API (3200), one for OTLP HTTP receiver (4318)

Trace Sources

From ringtail (via Beyla eBPF in Alloy):

Service	Protocol	Coverage
frigate	HTTP REST	Request rate, error rate, latency, trace spans
ntfy	HTTP	Same
ollama	HTTP REST	Same (model inference latency)
immich	HTTP REST	Same

Beyla auto-instruments HTTP services via eBPF kernel hooks — no code changes needed.

Future: SDK instrumentation Services with OTel SDK support (e.g., Hermes) can send traces directly to the OTLP endpoint for deeper internal spans (DB queries, business logic) alongside eBPF envelope traces.

Storage Monitoring

Tempo exposes tempodb_backend_bytes_total via its /metrics endpoint (scraped by prometheus). To check storage utilization against the 10Gi PVC:

tempodb_backend_bytes_total / 10737418240 * 100

Full PVC-level monitoring (via kubelet volume stats) is not yet available — see backlog.

Grafana Integration

Tempo datasource with trace-to-log and trace-to-metrics correlation
Service map and node graph visualization
Loki derived fields link trace IDs in logs back to Tempo

alloy - Trace collector (Beyla eBPF on ringtail)
prometheus - Receives span-metrics from Tempo
loki - Log correlation via trace IDs
grafana - Trace visualization

2.4 KiB Raw Permalink Blame History