blumeops/docs/reference/operations/observability.md
Erich Blume fc8d2cdb12 Add preserve/* branch protection and document Pyroscope blocker
branch-cleanup: Add PROTECTED_PREFIXES with preserve/* exclusion so
preserved work-in-progress branches are never deleted.

observability.md: Document Pyroscope profiling work on branch
preserve/pyroscope-profiling/pr-313, blocked on ringtail kernel
sysctl settings (kptr_restrict=0, perf_event_paranoid≤1). Also
document Faro/RUM as future potential with privacy considerations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:32:25 -07:00

2.1 KiB

title modified tags
Observability 2026-03-26
operations

Observability

Metrics, logs, traces, and dashboards for BlumeOps infrastructure.

Components

  • prometheus - Metrics storage and querying
  • loki - Log aggregation
  • tempo - Distributed tracing
  • alloy - Metrics, log, and trace collection
  • grafana - Dashboards and visualization

Future: Continuous Profiling (Pyroscope)

Full implementation on branch preserve/pyroscope-profiling/pr-313 (PR #313, closed). Includes Pyroscope server (StatefulSet on ringtail), Alloy profiling DaemonSet (pyroscope.ebpf), Grafana datasource with traces-to-profiles linking, Nix container build with embedded frontend, and documentation.

Blocked on ringtail kernel sysctl settings. The pyroscope.ebpf Alloy component requires:

  • kernel.kptr_restrict = 0 (currently 1 — kallsyms addresses are zeroed)
  • kernel.perf_event_paranoid ≤ 1 (currently 2 — eBPF perf events restricted)

These must be set in ringtail's NixOS configuration (boot.kernel.sysctl). Once applied, the branch can be rebased onto main and deployed.

Future: Frontend Monitoring (RUM)

Grafana Faro is a Real User Monitoring SDK that captures page loads, web vitals, errors, and network timings from the browser, feeding into Loki (logs) and Tempo (traces) via Alloy's faro.receiver component. This would add an "outside-in" view of service health from the user's perspective.

Not currently deployed. RUM captures browsing behavior from visitors to public services, creating a data retention liability. Would require careful sanitization before deploying.

Alerting