blumeops/docs/reference/services/grafana.md
Erich Blume c281fb5403 Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286)
## Summary

Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki).

- **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus
- **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
- **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
- **Prometheus** scrapes Tempo operational metrics

### Architecture

```
ringtail (k3s)                                indri (minikube)
┌──────────────────────┐                      ┌─────────────────────┐
│ Alloy+Beyla (eBPF)   │──OTLP HTTP────────→ │ Tempo               │
│  ↳ Frigate, ntfy,    │  via tailnet         │  ↳ trace storage    │
│    Ollama, Immich     │                      │  ↳ RED → Prometheus │
└──────────────────────┘                      │                     │
                                              │ Grafana             │
                                              │  ↳ Tempo datasource │
                                              └─────────────────────┘
```

### New files (12)
- `docs/reference/services/tempo.md` — reference doc
- `docs/changelog.d/feature-otel-tracing.feature.md`
- `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files)
- `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files)

### Modified files (6)
- `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields
- `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target
- `service-versions.yaml` — tempo + alloy-tracing-ringtail entries
- `docs/reference/services/grafana.md` — Tempo in datasources table
- `docs/reference/reference.md` — Tempo in services index
- `docs/reference/operations/observability.md` — Tempo in components list

## Deployment and Testing

- [ ] Sync `apps` app to pick up new Application definitions
- [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo`
- [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo`
- [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready`
- [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring`
- [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail`
- [ ] Check Beyla discovery in alloy-tracing logs on ringtail
- [ ] Sync grafana-config for updated datasources
- [ ] Sync prometheus for updated scrape config
- [ ] Test Grafana Tempo datasource connection
- [ ] Generate test traffic and search traces in Grafana Explore → Tempo
- [ ] After merge: reset all ArgoCD app revisions back to main

Reviewed-on: #286
2026-03-05 10:51:07 -08:00

69 lines
2.3 KiB
Markdown

---
title: Grafana
modified: 2026-02-28
tags:
- service
- observability
---
# Grafana
Dashboards and visualization for BlumeOps observability.
## Quick Reference
| Property | Value |
|----------|-------|
| **URL** | https://grafana.ops.eblu.me |
| **Tailscale URL** | https://grafana.tail8d86e.ts.net |
| **Namespace** | `monitoring` |
| **Deployment** | Kustomize (`argocd/manifests/grafana/`) |
| **Image** | `registry.ops.eblu.me/blumeops/grafana` |
| **Sidecar Image** | `registry.ops.eblu.me/blumeops/grafana-sidecar` |
## Authentication
Grafana supports two login methods:
- **SSO via [[authentik]]** — OIDC login through Authentik (`auth.generic_oauth`). Users click "Sign in with Authentik", authenticate at Authentik, and are redirected back as Admin.
- **Local admin** — break-glass login using the password from 1Password ("Grafana (blumeops)"). Always available if Authentik is down.
The OIDC client secret is injected via [[external-secrets]] (`grafana-authentik-oauth` secret in monitoring namespace).
## Datasources
| Name | Type | Target |
|------|------|--------|
| Prometheus | prometheus | `prometheus.monitoring.svc.cluster.local:9090` |
| Loki | loki | `loki.monitoring.svc.cluster.local:3100` |
| Tempo | tempo | `tempo.monitoring.svc.cluster.local:3200` |
| TeslaMate | postgres | `blumeops-pg-rw.databases.svc.cluster.local:5432` |
## Dashboard Provisioning
Dashboards are ConfigMaps with label `grafana_dashboard: "1"`.
Location: `argocd/manifests/grafana-config/dashboards/`
Optional annotation: `grafana_folder: "FolderName"`
## Key Dashboards
- macOS System - Host metrics for indri
- Minikube - Kubernetes cluster overview
- Borgmatic Backups - Backup status and trends
- Services Health - HTTP probe results
- Docs APM - Request rate, latency, cache for docs.eblu.me
- Fly.io Proxy Health - Aggregate proxy health across all upstream services
- TeslaMate (18 dashboards) - Vehicle data
## Related
- [[build-grafana-container]] - Home-built container image
- [[build-grafana-sidecar]] - Home-built sidecar container
- [[kustomize-grafana-deployment]] - Kustomize manifest structure
- [[authentik]] - OIDC identity provider for SSO
- [[prometheus]] - Metrics datasource
- [[loki]] - Logs datasource
- [[tempo]] - Traces datasource
- [[alloy|Alloy]] - Data collector