From 3be9fa948c2297e76ef40e18f181b1c2374fdc61 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:01:47 -0800 Subject: [PATCH 1/9] Add Tempo reference doc and changelog fragment (docs-first) Tempo is the new distributed tracing backend for BlumeOps, completing the third observability pillar alongside Prometheus (metrics) and Loki (logs). Co-Authored-By: Claude Opus 4.6 --- .../feature-otel-tracing.feature.md | 1 + docs/reference/operations/observability.md | 5 +- docs/reference/reference.md | 1 + docs/reference/services/tempo.md | 60 +++++++++++++++++++ 4 files changed, 65 insertions(+), 2 deletions(-) create mode 100644 docs/changelog.d/feature-otel-tracing.feature.md create mode 100644 docs/reference/services/tempo.md diff --git a/docs/changelog.d/feature-otel-tracing.feature.md b/docs/changelog.d/feature-otel-tracing.feature.md new file mode 100644 index 0000000..5d5d4ab --- /dev/null +++ b/docs/changelog.d/feature-otel-tracing.feature.md @@ -0,0 +1 @@ +Add distributed tracing via Grafana Tempo and Beyla eBPF auto-instrumentation. Tempo runs on minikube-indri for trace storage, while a privileged Alloy DaemonSet on ringtail uses Beyla to instrument HTTP services (Frigate, ntfy, Ollama, Immich) without code changes. Grafana gets trace-to-log and trace-to-metrics correlation. diff --git a/docs/reference/operations/observability.md b/docs/reference/operations/observability.md index 6c42193..5890147 100644 --- a/docs/reference/operations/observability.md +++ b/docs/reference/operations/observability.md @@ -7,11 +7,12 @@ tags: # Observability -Metrics, logs, and dashboards for BlumeOps infrastructure. +Metrics, logs, traces, and dashboards for BlumeOps infrastructure. ## Components - [[prometheus]] - Metrics storage and querying - [[loki]] - Log aggregation -- [[alloy|Alloy]] - Metrics and log collection +- [[tempo]] - Distributed tracing +- [[alloy|Alloy]] - Metrics, log, and trace collection - [[grafana]] - Dashboards and visualization diff --git a/docs/reference/reference.md b/docs/reference/reference.md index 9faa8e2..e9baa20 100644 --- a/docs/reference/reference.md +++ b/docs/reference/reference.md @@ -27,6 +27,7 @@ Individual service reference cards with URLs and configuration details. | [[jellyfin]] | Media server | indri | | [[kiwix]] | Offline Wikipedia & ZIM archives | k8s | | [[loki]] | Log aggregation | k8s | +| [[tempo]] | Distributed tracing | k8s | | [[miniflux]] | RSS feed reader | k8s | | [[navidrome]] | Music streaming | k8s | | [[ntfy]] | Push notifications | k8s (ringtail) | diff --git a/docs/reference/services/tempo.md b/docs/reference/services/tempo.md new file mode 100644 index 0000000..3aea029 --- /dev/null +++ b/docs/reference/services/tempo.md @@ -0,0 +1,60 @@ +--- +title: Tempo +modified: 2026-03-05 +tags: + - service + - observability +--- + +# Grafana Tempo + +Distributed tracing backend for BlumeOps infrastructure. Receives traces via OTLP, stores them locally, and generates RED metrics (rate, error, duration) for [[prometheus]]. + +## Quick Reference + +| Property | Value | +|----------|-------| +| **URL** | https://tempo.ops.eblu.me (when Caddy route added) | +| **Tailscale URL** | https://tempo.tail8d86e.ts.net | +| **OTLP Endpoint** | https://tempo-otlp.tail8d86e.ts.net | +| **Namespace** | `monitoring` | +| **Image** | `grafana/tempo:2.7.2` | +| **Storage** | 10Gi PVC (local filesystem) | +| **Retention** | 7 days | + +## Architecture + +- Single-node deployment with local filesystem storage +- OTLP receivers: gRPC (4317) and HTTP (4318) +- `metrics_generator` produces span-metrics and service-graphs, remote-written to [[prometheus]] +- Queried via [[grafana]] Tempo datasource +- Two Tailscale Ingresses: one for query API (3200), one for OTLP HTTP receiver (4318) + +## Trace Sources + +**From ringtail (via Beyla eBPF in Alloy):** + +| Service | Protocol | Coverage | +|---------|----------|----------| +| [[frigate]] | HTTP REST | Request rate, error rate, latency, trace spans | +| [[ntfy]] | HTTP | Same | +| [[ollama]] | HTTP REST | Same (model inference latency) | +| [[immich]] | HTTP REST | Same | + +Beyla auto-instruments HTTP services via eBPF kernel hooks — no code changes needed. MQTT (Mosquitto) is not instrumented (no eBPF parser for MQTT). + +**Future: SDK instrumentation** +Services with OTel SDK support (e.g., Hermes) can send traces directly to the OTLP endpoint for deeper internal spans (DB queries, business logic) alongside eBPF envelope traces. + +## Grafana Integration + +- **Tempo datasource** with trace-to-log and trace-to-metrics correlation +- **Service map** and **node graph** visualization +- **Loki derived fields** link trace IDs in logs back to Tempo + +## Related + +- [[alloy|Alloy]] - Trace collector (Beyla eBPF on ringtail) +- [[prometheus]] - Receives span-metrics from Tempo +- [[loki]] - Log correlation via trace IDs +- [[grafana]] - Trace visualization -- 2.50.1 (Apple Git-155) From 3fc06cda889358ae92546ead09ee697ae1b0acd6 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:03:17 -0800 Subject: [PATCH 2/9] Add Tempo manifests and ArgoCD Application Deploys Grafana Tempo 2.10.1 on minikube-indri for distributed trace storage. Includes OTLP receivers (gRPC + HTTP), local filesystem storage with 7d retention, and metrics_generator that remote-writes span-metrics to Prometheus. Two Tailscale Ingresses: tempo (query API) and tempo-otlp (OTLP HTTP receiver for cross-cluster trace ingestion). Co-Authored-By: Claude Opus 4.6 --- argocd/apps/tempo.yaml | 17 +++++ .../tempo/ingress-tailscale-otlp.yaml | 27 +++++++ argocd/manifests/tempo/ingress-tailscale.yaml | 26 +++++++ argocd/manifests/tempo/kustomization.yaml | 19 +++++ argocd/manifests/tempo/service.yaml | 22 ++++++ argocd/manifests/tempo/statefulset.yaml | 70 +++++++++++++++++++ argocd/manifests/tempo/tempo.yaml | 53 ++++++++++++++ 7 files changed, 234 insertions(+) create mode 100644 argocd/apps/tempo.yaml create mode 100644 argocd/manifests/tempo/ingress-tailscale-otlp.yaml create mode 100644 argocd/manifests/tempo/ingress-tailscale.yaml create mode 100644 argocd/manifests/tempo/kustomization.yaml create mode 100644 argocd/manifests/tempo/service.yaml create mode 100644 argocd/manifests/tempo/statefulset.yaml create mode 100644 argocd/manifests/tempo/tempo.yaml diff --git a/argocd/apps/tempo.yaml b/argocd/apps/tempo.yaml new file mode 100644 index 0000000..b04d297 --- /dev/null +++ b/argocd/apps/tempo.yaml @@ -0,0 +1,17 @@ +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: tempo + namespace: argocd +spec: + project: default + source: + repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git + targetRevision: main + path: argocd/manifests/tempo + destination: + server: https://kubernetes.default.svc + namespace: monitoring + syncPolicy: + syncOptions: + - CreateNamespace=true diff --git a/argocd/manifests/tempo/ingress-tailscale-otlp.yaml b/argocd/manifests/tempo/ingress-tailscale-otlp.yaml new file mode 100644 index 0000000..ed65113 --- /dev/null +++ b/argocd/manifests/tempo/ingress-tailscale-otlp.yaml @@ -0,0 +1,27 @@ +# Tailscale Ingress for Tempo OTLP HTTP receiver +# Used by ringtail Alloy to push traces across tailnet +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: tempo-otlp-tailscale + namespace: monitoring + annotations: + tailscale.com/funnel: "false" + tailscale.com/proxy-group: "ingress" + tailscale.com/tags: "tag:k8s" + gethomepage.dev/enabled: "false" +spec: + ingressClassName: tailscale + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: tempo + port: + number: 4318 + tls: + - hosts: + - tempo-otlp diff --git a/argocd/manifests/tempo/ingress-tailscale.yaml b/argocd/manifests/tempo/ingress-tailscale.yaml new file mode 100644 index 0000000..660d77a --- /dev/null +++ b/argocd/manifests/tempo/ingress-tailscale.yaml @@ -0,0 +1,26 @@ +# Tailscale Ingress for Tempo query API +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: tempo-tailscale + namespace: monitoring + annotations: + tailscale.com/funnel: "false" + tailscale.com/proxy-group: "ingress" + tailscale.com/tags: "tag:k8s" + gethomepage.dev/enabled: "false" +spec: + ingressClassName: tailscale + rules: + - http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: tempo + port: + number: 3200 + tls: + - hosts: + - tempo diff --git a/argocd/manifests/tempo/kustomization.yaml b/argocd/manifests/tempo/kustomization.yaml new file mode 100644 index 0000000..68a209c --- /dev/null +++ b/argocd/manifests/tempo/kustomization.yaml @@ -0,0 +1,19 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: monitoring + +resources: + - statefulset.yaml + - service.yaml + - ingress-tailscale.yaml + - ingress-tailscale-otlp.yaml + +images: + - name: grafana/tempo + newTag: "2.10.1" + +configMapGenerator: + - name: tempo-config + files: + - tempo.yaml diff --git a/argocd/manifests/tempo/service.yaml b/argocd/manifests/tempo/service.yaml new file mode 100644 index 0000000..37b25df --- /dev/null +++ b/argocd/manifests/tempo/service.yaml @@ -0,0 +1,22 @@ +apiVersion: v1 +kind: Service +metadata: + name: tempo + namespace: monitoring +spec: + selector: + app: tempo + ports: + - name: http + port: 3200 + targetPort: 3200 + - name: grpc + port: 9095 + targetPort: 9095 + - name: otlp-grpc + port: 4317 + targetPort: 4317 + - name: otlp-http + port: 4318 + targetPort: 4318 + type: ClusterIP diff --git a/argocd/manifests/tempo/statefulset.yaml b/argocd/manifests/tempo/statefulset.yaml new file mode 100644 index 0000000..7975347 --- /dev/null +++ b/argocd/manifests/tempo/statefulset.yaml @@ -0,0 +1,70 @@ +apiVersion: apps/v1 +kind: StatefulSet +metadata: + name: tempo + namespace: monitoring +spec: + serviceName: tempo + replicas: 1 + selector: + matchLabels: + app: tempo + template: + metadata: + labels: + app: tempo + spec: + securityContext: + fsGroup: 10001 + runAsNonRoot: true + runAsUser: 10001 + containers: + - name: tempo + image: grafana/tempo + args: + - -config.file=/etc/tempo/tempo.yaml + ports: + - name: http + containerPort: 3200 + - name: grpc + containerPort: 9095 + - name: otlp-grpc + containerPort: 4317 + - name: otlp-http + containerPort: 4318 + volumeMounts: + - name: config + mountPath: /etc/tempo + - name: data + mountPath: /var/tempo + resources: + requests: + memory: "256Mi" + cpu: "100m" + limits: + memory: "1Gi" + cpu: "500m" + livenessProbe: + httpGet: + path: /ready + port: 3200 + initialDelaySeconds: 45 + periodSeconds: 10 + readinessProbe: + httpGet: + path: /ready + port: 3200 + initialDelaySeconds: 10 + periodSeconds: 5 + volumes: + - name: config + configMap: + name: tempo-config + volumeClaimTemplates: + - metadata: + name: data + spec: + accessModes: ["ReadWriteOnce"] + resources: + requests: + storage: 10Gi diff --git a/argocd/manifests/tempo/tempo.yaml b/argocd/manifests/tempo/tempo.yaml new file mode 100644 index 0000000..da26cbe --- /dev/null +++ b/argocd/manifests/tempo/tempo.yaml @@ -0,0 +1,53 @@ +stream_over_http_enabled: true + +server: + http_listen_port: 3200 + grpc_listen_port: 9095 + +distributor: + receivers: + otlp: + protocols: + grpc: + endpoint: "0.0.0.0:4317" + http: + endpoint: "0.0.0.0:4318" + +storage: + trace: + backend: local + wal: + path: /var/tempo/wal + local: + path: /var/tempo/blocks + +compactor: + compaction: + block_retention: 168h # 7 days + +metrics_generator: + registry: + external_labels: + source: tempo + storage: + path: /var/tempo/generator/wal + remote_write: + - url: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write + send_exemplars: true + processor: + span_metrics: + dimensions: + - service.name + - http.method + - http.status_code + - http.target + service_graphs: + dimensions: + - service.name + +overrides: + defaults: + metrics_generator: + processors: + - span-metrics + - service-graphs -- 2.50.1 (Apple Git-155) From ba1c0645e3dcae61e07534f852b04f850bbcc2e8 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:06:18 -0800 Subject: [PATCH 3/9] Add Tempo datasource to Grafana and Prometheus scrape target Grafana: Tempo datasource with trace-to-log (Loki) and trace-to-metrics (Prometheus) correlation. Loki gets derivedFields to link trace IDs back to Tempo. Prometheus: scrape Tempo operational metrics on port 3200. Co-Authored-By: Claude Opus 4.6 --- argocd/manifests/grafana/datasources.yaml | 33 ++++++++++++++++++++++ argocd/manifests/prometheus/prometheus.yml | 8 ++++++ 2 files changed, 41 insertions(+) diff --git a/argocd/manifests/grafana/datasources.yaml b/argocd/manifests/grafana/datasources.yaml index 864fcb1..5a3d0f3 100644 --- a/argocd/manifests/grafana/datasources.yaml +++ b/argocd/manifests/grafana/datasources.yaml @@ -15,6 +15,39 @@ datasources: type: loki uid: loki url: http://loki.monitoring.svc.cluster.local:3100 + jsonData: + derivedFields: + - datasourceUid: tempo + matcherRegex: '"traceID":"(\w+)"' + name: TraceID + url: "$${__value.raw}" +- access: proxy + editable: false + name: Tempo + orgId: 1 + type: tempo + uid: tempo + url: http://tempo.monitoring.svc.cluster.local:3200 + jsonData: + tracesToLogsV2: + datasourceUid: loki + filterByTraceID: true + filterBySpanID: false + tracesToMetrics: + datasourceUid: prometheus + spanStartTimeShift: "-1h" + spanEndTimeShift: "1h" + queries: + - name: Request rate + query: "sum(rate(traces_spanmetrics_calls_total{$$__tags}[5m]))" + - name: Error rate + query: "sum(rate(traces_spanmetrics_calls_total{$$__tags, status_code=\"STATUS_CODE_ERROR\"}[5m]))" + - name: Duration (p95) + query: "histogram_quantile(0.95, sum(rate(traces_spanmetrics_duration_seconds_bucket{$$__tags}[5m])) by (le))" + serviceMap: + datasourceUid: prometheus + nodeGraph: + enabled: true - access: proxy database: teslamate editable: false diff --git a/argocd/manifests/prometheus/prometheus.yml b/argocd/manifests/prometheus/prometheus.yml index 3197ca6..2fd3252 100644 --- a/argocd/manifests/prometheus/prometheus.yml +++ b/argocd/manifests/prometheus/prometheus.yml @@ -64,6 +64,14 @@ scrape_configs: - target_label: cluster replacement: indri + # Tempo operational metrics + - job_name: "tempo" + static_configs: + - targets: ["tempo.monitoring.svc.cluster.local:3200"] + metric_relabel_configs: + - target_label: cluster + replacement: indri + # Frigate NVR metrics (via Caddy on indri — Frigate runs on ringtail) - job_name: "frigate" scheme: https -- 2.50.1 (Apple Git-155) From 3512eb10b679950508a6f2126033ea38d50bfdc5 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:08:17 -0800 Subject: [PATCH 4/9] Add Beyla eBPF tracing DaemonSet for ringtail Deploys a privileged Alloy DaemonSet on ringtail's k3s that uses Beyla eBPF to auto-instrument HTTP services (Frigate, ntfy, Ollama, Immich) without code changes. Traces are exported via OTLP HTTP to Tempo on indri. Separate from the existing unprivileged alloy-ringtail to preserve least-privilege for metrics/logs collection. Co-Authored-By: Claude Opus 4.6 --- argocd/apps/alloy-tracing-ringtail.yaml | 17 ++++ .../alloy-tracing-ringtail/config.alloy | 93 +++++++++++++++++++ .../alloy-tracing-ringtail/daemonset.yaml | 56 +++++++++++ .../alloy-tracing-ringtail/kustomization.yaml | 17 ++++ .../alloy-tracing-ringtail/rbac.yaml | 30 ++++++ 5 files changed, 213 insertions(+) create mode 100644 argocd/apps/alloy-tracing-ringtail.yaml create mode 100644 argocd/manifests/alloy-tracing-ringtail/config.alloy create mode 100644 argocd/manifests/alloy-tracing-ringtail/daemonset.yaml create mode 100644 argocd/manifests/alloy-tracing-ringtail/kustomization.yaml create mode 100644 argocd/manifests/alloy-tracing-ringtail/rbac.yaml diff --git a/argocd/apps/alloy-tracing-ringtail.yaml b/argocd/apps/alloy-tracing-ringtail.yaml new file mode 100644 index 0000000..78d02e3 --- /dev/null +++ b/argocd/apps/alloy-tracing-ringtail.yaml @@ -0,0 +1,17 @@ +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: alloy-tracing-ringtail + namespace: argocd +spec: + project: default + source: + repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git + targetRevision: main + path: argocd/manifests/alloy-tracing-ringtail + destination: + server: https://ringtail.tail8d86e.ts.net:6443 + namespace: alloy + syncPolicy: + syncOptions: + - CreateNamespace=true diff --git a/argocd/manifests/alloy-tracing-ringtail/config.alloy b/argocd/manifests/alloy-tracing-ringtail/config.alloy new file mode 100644 index 0000000..d3f0445 --- /dev/null +++ b/argocd/manifests/alloy-tracing-ringtail/config.alloy @@ -0,0 +1,93 @@ +// Alloy tracing configuration for ringtail +// Uses Beyla eBPF to auto-instrument HTTP services and export traces to Tempo on indri + +// ============== BEYLA eBPF AUTO-INSTRUMENTATION ============== + +beyla.ebpf "http_services" { + discovery { + // Instrument HTTP services on common ports + instrument { + open_ports = "80-9999" + } + + // Exclude infrastructure pods + exclude_instrument { + kubernetes { + namespace = "kube-system" + } + } + exclude_instrument { + kubernetes { + namespace = "tailscale" + } + } + exclude_instrument { + kubernetes { + pod_labels = { app = "alloy" } + } + } + exclude_instrument { + kubernetes { + pod_labels = { app = "alloy-tracing" } + } + } + exclude_instrument { + kubernetes { + pod_labels = { app = "kube-state-metrics" } + } + } + exclude_instrument { + kubernetes { + pod_labels = { "app.kubernetes.io/name" = "nvidia-device-plugin" } + } + } + } + + attributes { + kubernetes { + enable = "true" + cluster_name = "ringtail" + } + } + + traces { + instrumentations = ["http"] + } + + output { + traces = [otelcol.processor.batch.default.input] + } +} + +// ============== OTEL TRACE PIPELINE ============== + +// Batch traces before export +otelcol.processor.batch "default" { + output { + traces = [otelcol.processor.attributes.add_cluster.input] + } +} + +// Add cluster label to all spans +otelcol.processor.attributes "add_cluster" { + action { + key = "cluster" + value = "ringtail" + action = "upsert" + } + + output { + traces = [otelcol.exporter.otlphttp.tempo.input] + } +} + +// Export traces to Tempo on indri via Tailscale +otelcol.exporter.otlphttp "tempo" { + client { + endpoint = "https://tempo-otlp.tail8d86e.ts.net" + + tls { + insecure_skip_verify = true + } + } +} diff --git a/argocd/manifests/alloy-tracing-ringtail/daemonset.yaml b/argocd/manifests/alloy-tracing-ringtail/daemonset.yaml new file mode 100644 index 0000000..93c7d3e --- /dev/null +++ b/argocd/manifests/alloy-tracing-ringtail/daemonset.yaml @@ -0,0 +1,56 @@ +apiVersion: apps/v1 +kind: DaemonSet +metadata: + name: alloy-tracing + namespace: alloy + labels: + app: alloy-tracing +spec: + selector: + matchLabels: + app: alloy-tracing + template: + metadata: + labels: + app: alloy-tracing + spec: + serviceAccountName: alloy-tracing + hostPID: true + containers: + - name: alloy + image: grafana/alloy + args: + - run + - --server.http.listen-addr=0.0.0.0:12346 + - --storage.path=/var/lib/alloy/data + - /etc/alloy/config.alloy + ports: + - containerPort: 12346 + name: http + env: + - name: HOSTNAME + valueFrom: + fieldRef: + fieldPath: spec.nodeName + resources: + requests: + cpu: 100m + memory: 256Mi + limits: + cpu: "1" + memory: 1Gi + volumeMounts: + - name: config + mountPath: /etc/alloy + - name: data + mountPath: /var/lib/alloy/data + securityContext: + privileged: true + tolerations: + - operator: Exists + volumes: + - name: config + configMap: + name: alloy-tracing-config + - name: data + emptyDir: {} diff --git a/argocd/manifests/alloy-tracing-ringtail/kustomization.yaml b/argocd/manifests/alloy-tracing-ringtail/kustomization.yaml new file mode 100644 index 0000000..6956f14 --- /dev/null +++ b/argocd/manifests/alloy-tracing-ringtail/kustomization.yaml @@ -0,0 +1,17 @@ +apiVersion: kustomize.config.k8s.io/v1beta1 +kind: Kustomization + +namespace: alloy + +resources: + - rbac.yaml + - daemonset.yaml + +images: + - name: grafana/alloy + newTag: v1.13.1 + +configMapGenerator: + - name: alloy-tracing-config + files: + - config.alloy diff --git a/argocd/manifests/alloy-tracing-ringtail/rbac.yaml b/argocd/manifests/alloy-tracing-ringtail/rbac.yaml new file mode 100644 index 0000000..656c5f6 --- /dev/null +++ b/argocd/manifests/alloy-tracing-ringtail/rbac.yaml @@ -0,0 +1,30 @@ +apiVersion: v1 +kind: ServiceAccount +metadata: + name: alloy-tracing + namespace: alloy +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: alloy-tracing +rules: + - apiGroups: [""] + resources: ["pods", "services", "endpoints", "nodes", "namespaces"] + verbs: ["get", "list", "watch"] + - apiGroups: ["apps"] + resources: ["deployments", "replicasets", "statefulsets", "daemonsets"] + verbs: ["get", "list", "watch"] +--- +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRoleBinding +metadata: + name: alloy-tracing +roleRef: + apiGroup: rbac.authorization.k8s.io + kind: ClusterRole + name: alloy-tracing +subjects: + - kind: ServiceAccount + name: alloy-tracing + namespace: alloy -- 2.50.1 (Apple Git-155) From 309119f7a46fdf5c1e7bf9c579c68cde95022b42 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:09:09 -0800 Subject: [PATCH 5/9] Add Tempo and alloy-tracing-ringtail to service registry and docs Updates service-versions.yaml, Grafana datasources table, ArgoCD apps registry, and Tempo image version to 2.10.1. Co-Authored-By: Claude Opus 4.6 --- docs/reference/kubernetes/apps.md | 2 ++ docs/reference/services/grafana.md | 2 ++ docs/reference/services/tempo.md | 2 +- service-versions.yaml | 13 +++++++++++++ 4 files changed, 18 insertions(+), 1 deletion(-) diff --git a/docs/reference/kubernetes/apps.md b/docs/reference/kubernetes/apps.md index 919907e..f584d5a 100644 --- a/docs/reference/kubernetes/apps.md +++ b/docs/reference/kubernetes/apps.md @@ -27,7 +27,9 @@ Registry of all applications deployed via [[argocd]]. | `grafana` | monitoring | Helm chart (forge mirror) | [[grafana]] | | `grafana-config` | monitoring | `argocd/manifests/grafana-config/` | [[grafana]] | | `immich` | immich | Helm chart | [[immich]] | +| `tempo` | monitoring | `argocd/manifests/tempo/` | [[tempo]] | | `alloy-k8s` | alloy | `argocd/manifests/alloy-k8s/` | [[alloy|Alloy]] | +| `alloy-tracing-ringtail` | alloy | `argocd/manifests/alloy-tracing-ringtail/` | [[alloy|Alloy]] (eBPF tracing) | | `kube-state-metrics` | monitoring | `argocd/manifests/kube-state-metrics/` | K8s metrics | | `miniflux` | miniflux | `argocd/manifests/miniflux/` | [[miniflux]] | | `kiwix` | kiwix | `argocd/manifests/kiwix/` | [[kiwix]] | diff --git a/docs/reference/services/grafana.md b/docs/reference/services/grafana.md index 3cd5ff1..0c62515 100644 --- a/docs/reference/services/grafana.md +++ b/docs/reference/services/grafana.md @@ -36,6 +36,7 @@ The OIDC client secret is injected via [[external-secrets]] (`grafana-authentik- |------|------|--------| | Prometheus | prometheus | `prometheus.monitoring.svc.cluster.local:9090` | | Loki | loki | `loki.monitoring.svc.cluster.local:3100` | +| Tempo | tempo | `tempo.monitoring.svc.cluster.local:3200` | | TeslaMate | postgres | `blumeops-pg-rw.databases.svc.cluster.local:5432` | ## Dashboard Provisioning @@ -64,4 +65,5 @@ Optional annotation: `grafana_folder: "FolderName"` - [[authentik]] - OIDC identity provider for SSO - [[prometheus]] - Metrics datasource - [[loki]] - Logs datasource +- [[tempo]] - Traces datasource - [[alloy|Alloy]] - Data collector diff --git a/docs/reference/services/tempo.md b/docs/reference/services/tempo.md index 3aea029..0050714 100644 --- a/docs/reference/services/tempo.md +++ b/docs/reference/services/tempo.md @@ -18,7 +18,7 @@ Distributed tracing backend for BlumeOps infrastructure. Receives traces via OTL | **Tailscale URL** | https://tempo.tail8d86e.ts.net | | **OTLP Endpoint** | https://tempo-otlp.tail8d86e.ts.net | | **Namespace** | `monitoring` | -| **Image** | `grafana/tempo:2.7.2` | +| **Image** | `grafana/tempo:2.10.1` | | **Storage** | 10Gi PVC (local filesystem) | | **Retention** | 7 days | diff --git a/service-versions.yaml b/service-versions.yaml index 280fb11..3f7d3fe 100644 --- a/service-versions.yaml +++ b/service-versions.yaml @@ -68,6 +68,19 @@ services: current-version: "v0.5.4" upstream-source: https://github.com/0x2142/frigate-notify/releases + - name: tempo + type: argocd + last-reviewed: 2026-03-05 + current-version: "2.10.1" + upstream-source: https://github.com/grafana/tempo/releases + + - name: alloy-tracing-ringtail + type: argocd + last-reviewed: 2026-03-05 + current-version: "v1.13.1" + upstream-source: https://github.com/grafana/alloy/releases + notes: Privileged DaemonSet with Beyla eBPF for HTTP tracing on ringtail + - name: alloy-k8s type: argocd last-reviewed: 2026-02-16 -- 2.50.1 (Apple Git-155) From 3eb1bb70a0a821e17b8fd582eb38a43adc60abfa Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:16:58 -0800 Subject: [PATCH 6/9] Document Tempo storage monitoring query Add PromQL query for checking Tempo storage utilization against PVC capacity using tempodb_backend_bytes_total. Co-Authored-By: Claude Opus 4.6 --- docs/reference/services/tempo.md | 10 ++++++++++ 1 file changed, 10 insertions(+) diff --git a/docs/reference/services/tempo.md b/docs/reference/services/tempo.md index 0050714..567168a 100644 --- a/docs/reference/services/tempo.md +++ b/docs/reference/services/tempo.md @@ -46,6 +46,16 @@ Beyla auto-instruments HTTP services via eBPF kernel hooks — no code changes n **Future: SDK instrumentation** Services with OTel SDK support (e.g., Hermes) can send traces directly to the OTLP endpoint for deeper internal spans (DB queries, business logic) alongside eBPF envelope traces. +## Storage Monitoring + +Tempo exposes `tempodb_backend_bytes_total` via its `/metrics` endpoint (scraped by [[prometheus]]). To check storage utilization against the 10Gi PVC: + +```promql +tempodb_backend_bytes_total / 10737418240 * 100 +``` + +Full PVC-level monitoring (via kubelet volume stats) is not yet available — see backlog. + ## Grafana Integration - **Tempo datasource** with trace-to-log and trace-to-metrics correlation -- 2.50.1 (Apple Git-155) From dfef1272332dc90da1eda20b6b916b9ca9170c69 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:19:03 -0800 Subject: [PATCH 7/9] Add Tempo health dashboard to Grafana Panels: Storage Used, PVC Utilization (% of 10Gi), Total Blocks, Heap Usage, Storage Over Time, Span Ingestion Rate, Ingestion Throughput, and Query Latency (p50/p95). Co-Authored-By: Claude Opus 4.6 --- .../dashboards/configmap-tempo.yaml | 491 ++++++++++++++++++ .../grafana-config/kustomization.yaml | 1 + 2 files changed, 492 insertions(+) create mode 100644 argocd/manifests/grafana-config/dashboards/configmap-tempo.yaml diff --git a/argocd/manifests/grafana-config/dashboards/configmap-tempo.yaml b/argocd/manifests/grafana-config/dashboards/configmap-tempo.yaml new file mode 100644 index 0000000..f7d5629 --- /dev/null +++ b/argocd/manifests/grafana-config/dashboards/configmap-tempo.yaml @@ -0,0 +1,491 @@ +apiVersion: v1 +kind: ConfigMap +metadata: + name: grafana-dashboard-tempo + namespace: monitoring + labels: + grafana_dashboard: "1" +data: + tempo.json: | + { + "annotations": { + "list": [] + }, + "editable": true, + "fiscalYearStartMonth": 0, + "graphTooltip": 0, + "id": null, + "links": [], + "panels": [ + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "green", "value": null }, + { "color": "yellow", "value": 5368709120 }, + { "color": "red", "value": 8589934592 } + ] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 }, + "id": 1, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": ["lastNotNull"], + "fields": "", + "values": false + }, + "textMode": "auto" + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "sum(tempodb_backend_bytes_total{job=\"tempo\"})", + "refId": "A" + } + ], + "title": "Storage Used", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "green", "value": null }, + { "color": "yellow", "value": 50 }, + { "color": "red", "value": 80 } + ] + }, + "unit": "percent", + "max": 100, + "min": 0 + }, + "overrides": [] + }, + "gridPos": { "h": 4, "w": 6, "x": 6, "y": 0 }, + "id": 2, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": ["lastNotNull"], + "fields": "", + "values": false + }, + "textMode": "auto" + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "sum(tempodb_backend_bytes_total{job=\"tempo\"}) / 10737418240 * 100", + "refId": "A" + } + ], + "title": "PVC Utilization (of 10Gi)", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [{ "color": "green", "value": null }] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { "h": 4, "w": 6, "x": 12, "y": 0 }, + "id": 3, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": ["lastNotNull"], + "fields": "", + "values": false + }, + "textMode": "auto" + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "sum(tempodb_blocklist_length{job=\"tempo\"})", + "refId": "A" + } + ], + "title": "Total Blocks", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "thresholds" + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [ + { "color": "green", "value": null }, + { "color": "yellow", "value": 0.5 }, + { "color": "red", "value": 0.9 } + ] + }, + "unit": "percentunit" + }, + "overrides": [] + }, + "gridPos": { "h": 4, "w": 6, "x": 18, "y": 0 }, + "id": 4, + "options": { + "colorMode": "value", + "graphMode": "none", + "justifyMode": "auto", + "orientation": "auto", + "reduceOptions": { + "calcs": ["lastNotNull"], + "fields": "", + "values": false + }, + "textMode": "auto" + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "1 - (go_memstats_heap_idle_bytes{job=\"tempo\"} / go_memstats_heap_sys_bytes{job=\"tempo\"})", + "refId": "A" + } + ], + "title": "Heap Usage", + "type": "stat" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 10, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "none" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [{ "color": "green", "value": null }] + }, + "unit": "bytes" + }, + "overrides": [] + }, + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 }, + "id": 5, + "options": { + "legend": { + "calcs": ["lastNotNull"], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { "mode": "multi", "sort": "desc" } + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "sum(tempodb_backend_bytes_total{job=\"tempo\"})", + "legendFormat": "Backend Storage", + "refId": "A" + }, + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "go_memstats_heap_inuse_bytes{job=\"tempo\"}", + "legendFormat": "Heap In Use", + "refId": "B" + } + ], + "title": "Storage Over Time", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 10, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "none" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [{ "color": "green", "value": null }] + }, + "unit": "short" + }, + "overrides": [] + }, + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 }, + "id": 6, + "options": { + "legend": { + "calcs": ["mean", "max"], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { "mode": "multi", "sort": "desc" } + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "rate(tempo_distributor_spans_received_total{job=\"tempo\"}[5m])", + "legendFormat": "Spans/sec", + "refId": "A" + } + ], + "title": "Span Ingestion Rate", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 10, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "none" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [{ "color": "green", "value": null }] + }, + "unit": "Bps" + }, + "overrides": [] + }, + "gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 }, + "id": 7, + "options": { + "legend": { + "calcs": ["mean", "max"], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { "mode": "multi", "sort": "desc" } + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "rate(tempo_distributor_bytes_received_total{job=\"tempo\"}[5m])", + "legendFormat": "Bytes Received", + "refId": "A" + } + ], + "title": "Ingestion Throughput", + "type": "timeseries" + }, + { + "datasource": { + "type": "prometheus", + "uid": "prometheus" + }, + "fieldConfig": { + "defaults": { + "color": { + "mode": "palette-classic" + }, + "custom": { + "axisBorderShow": false, + "axisCenteredZero": false, + "axisColorMode": "text", + "axisLabel": "", + "axisPlacement": "auto", + "barAlignment": 0, + "drawStyle": "line", + "fillOpacity": 10, + "gradientMode": "none", + "hideFrom": { "legend": false, "tooltip": false, "viz": false }, + "insertNulls": false, + "lineInterpolation": "linear", + "lineWidth": 1, + "pointSize": 5, + "scaleDistribution": { "type": "linear" }, + "showPoints": "never", + "spanNulls": false, + "stacking": { "group": "A", "mode": "none" }, + "thresholdsStyle": { "mode": "off" } + }, + "mappings": [], + "thresholds": { + "mode": "absolute", + "steps": [{ "color": "green", "value": null }] + }, + "unit": "s" + }, + "overrides": [] + }, + "gridPos": { "h": 8, "w": 12, "x": 12, "y": 12 }, + "id": 8, + "options": { + "legend": { + "calcs": ["mean", "max"], + "displayMode": "table", + "placement": "bottom", + "showLegend": true + }, + "tooltip": { "mode": "multi", "sort": "desc" } + }, + "pluginVersion": "10.0.0", + "targets": [ + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "histogram_quantile(0.95, sum(rate(tempo_query_frontend_result_metrics_duration_seconds_bucket{job=\"tempo\"}[5m])) by (le))", + "legendFormat": "p95", + "refId": "A" + }, + { + "datasource": { "type": "prometheus", "uid": "prometheus" }, + "expr": "histogram_quantile(0.50, sum(rate(tempo_query_frontend_result_metrics_duration_seconds_bucket{job=\"tempo\"}[5m])) by (le))", + "legendFormat": "p50", + "refId": "B" + } + ], + "title": "Query Latency", + "type": "timeseries" + } + ], + "refresh": "1m", + "schemaVersion": 38, + "tags": ["tempo", "tracing"], + "templating": { + "list": [] + }, + "time": { + "from": "now-24h", + "to": "now" + }, + "timepicker": {}, + "timezone": "", + "title": "Tempo", + "uid": "tempo-homelab", + "version": 1, + "weekStart": "" + } diff --git a/argocd/manifests/grafana-config/kustomization.yaml b/argocd/manifests/grafana-config/kustomization.yaml index e565bf7..45a8380 100644 --- a/argocd/manifests/grafana-config/kustomization.yaml +++ b/argocd/manifests/grafana-config/kustomization.yaml @@ -25,6 +25,7 @@ resources: - dashboards/configmap-flyio.yaml - dashboards/configmap-sifaka-disks.yaml - dashboards/configmap-forgejo.yaml + - dashboards/configmap-tempo.yaml # TeslaMate dashboards - dashboards/configmap-teslamate-overview.yaml - dashboards/configmap-teslamate-charges.yaml -- 2.50.1 (Apple Git-155) From 7b8c50a1077e150a3948ac4c7bdda388e3d3474b Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:27:00 -0800 Subject: [PATCH 8/9] Enable local-blocks processor for Grafana Traces Drilldown Required by Grafana's TraceQL metrics queries. Keeps recent traces in memory for query-time aggregation without duplicating data to storage. Co-Authored-By: Claude Opus 4.6 --- argocd/manifests/tempo/tempo.yaml | 3 +++ 1 file changed, 3 insertions(+) diff --git a/argocd/manifests/tempo/tempo.yaml b/argocd/manifests/tempo/tempo.yaml index da26cbe..0a8c2c7 100644 --- a/argocd/manifests/tempo/tempo.yaml +++ b/argocd/manifests/tempo/tempo.yaml @@ -44,6 +44,8 @@ metrics_generator: service_graphs: dimensions: - service.name + local_blocks: + flush_to_storage: false overrides: defaults: @@ -51,3 +53,4 @@ overrides: processors: - span-metrics - service-graphs + - local-blocks -- 2.50.1 (Apple Git-155) From 5bc8b7ed8c4232219a968f271d9e4841101bd6e8 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Thu, 5 Mar 2026 10:42:43 -0800 Subject: [PATCH 9/9] Fix local-blocks processor: add traces_storage path The local-blocks processor requires its own dedicated traces WAL (traces_storage.path), separate from the ingester WAL and the metrics generator WAL. Without it, the processor fails with "local blocks processor requires traces wal". Co-Authored-By: Claude Opus 4.6 --- argocd/manifests/tempo/tempo.yaml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/argocd/manifests/tempo/tempo.yaml b/argocd/manifests/tempo/tempo.yaml index 0a8c2c7..c145d59 100644 --- a/argocd/manifests/tempo/tempo.yaml +++ b/argocd/manifests/tempo/tempo.yaml @@ -34,6 +34,8 @@ metrics_generator: remote_write: - url: http://prometheus.monitoring.svc.cluster.local:9090/api/v1/write send_exemplars: true + traces_storage: + path: /var/tempo/generator/traces processor: span_metrics: dimensions: -- 2.50.1 (Apple Git-155)