Add Fly.io proxy observability via embedded Alloy #123

Merged
eblume merged 7 commits from feature/flyio-observability into main 2026-02-08 10:05:38 -08:00
Owner

Summary

  • Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
  • Add nginx stub_status endpoint for connection-level metrics (active/reading/writing/waiting)
  • Create two Grafana dashboards: Docs APM (per-service view filtered by host="docs.eblu.me") and Fly.io Proxy Health (aggregate proxy health across all upstream services)

Changed Files

File Change
fly/nginx.conf Add JSON log_format + access_log, add stub_status endpoint
fly/Dockerfile COPY Alloy binary from grafana/alloy:v1.5.1, COPY alloy.river config
fly/alloy.river New — Alloy config: log tailing, metric extraction, remote_write
fly/start.sh Start Alloy after Tailscale, before nginx
argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml New — Docs APM dashboard
argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml New — Fly.io Proxy Health dashboard
argocd/manifests/grafana-config/kustomization.yaml Register new dashboard configmaps
docs/reference/services/flyio-proxy.md Document observability setup

Deployment and Testing

  • mise run fly-deploy — rebuild container with Alloy
  • curl https://docs.eblu.me/ — generate traffic
  • fly logs -a blumeops-proxy — verify Alloy startup
  • Query Prometheus: flyio_nginx_http_requests_total{instance="flyio-proxy"}
  • Query Loki: {instance="flyio-proxy", job="flyio-nginx"}
  • argocd app sync grafana-config — deploy dashboards
  • Verify dashboards show data in Grafana
  • mise run services-check — no regressions
## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services) ## Changed Files | File | Change | |------|--------| | `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint | | `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config | | `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write | | `fly/start.sh` | Start Alloy after Tailscale, before nginx | | `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard | | `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard | | `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps | | `docs/reference/services/flyio-proxy.md` | Document observability setup | ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions
Instrument the flyio-proxy container with Grafana Alloy to collect
nginx JSON access logs (→ Loki) and derive request/latency/cache
metrics (→ Prometheus). Adds stub_status for connection-level metrics.
Includes two Grafana dashboards: Docs APM (per-service) and Fly.io
Proxy Health (aggregate).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
fly/alloy.river Outdated
@ -0,0 +111,4 @@
url = "https://loki.tail8d86e.ts.net/loki/api/v1/push"
tls_config {
insecure_skip_verify = true
Author
Owner

Is this needed? Can we orchestrate this through caddy instead and keep tls? The threat model here I suppose is accidentally shipping our access logs to a mitm attacker... arguably not very large. If this is a huge lift (eg if loki is not yet proxied via caddy) then just tell me and I will make it a follow-on project

Is this needed? Can we orchestrate this through caddy instead and keep tls? The threat model here I suppose is accidentally shipping our access logs to a mitm attacker... arguably not very large. If this is a huge lift (eg if loki is not yet proxied via caddy) then just tell me and I will make it a follow-on project
eblume marked this conversation as resolved
fly/alloy.river Outdated
@ -0,0 +141,4 @@
url = "https://prometheus.tail8d86e.ts.net/api/v1/write"
tls_config {
insecure_skip_verify = true
Author
Owner

ditto here as above

ditto here as above
eblume marked this conversation as resolved
@ -10,1 +10,4 @@
# JSON access log for Alloy to tail Loki + metric extraction
log_format json_log escape=json
'{'
Author
Owner

this quoting is uuuuugly, can it be avoided? no big deal if it can't, I just hate the nested quoted json, although if the alternative is backslash-city then this is fine

this quoting is uuuuugly, can it be avoided? no big deal if it can't, I just hate the nested quoted json, although if the alternative is backslash-city then this is fine
eblume marked this conversation as resolved
Switch Alloy endpoints from *.tail8d86e.ts.net (with insecure_skip_verify)
to *.ops.eblu.me via Caddy reverse proxy with valid TLS certificates.
Add tag:homelab:443 to flyio-proxy ACL grant so the proxy can reach Caddy.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The new ACL grant lets the Fly.io proxy reach all Caddy-proxied
services, not just Loki/Prometheus. Document the expanded attack
surface and trust boundary (requires RCE on gilbert or 1Password
access) in both the flyio-proxy and caddy reference cards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The grafana/alloy image is Ubuntu-based (glibc), but our container
uses nginx:alpine (musl). The binary exists but fails with "not found"
because the glibc dynamic linker is missing. libc6-compat provides
the compatibility shim.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alloy has no built-in prometheus.exporter.nginx component. Remove
the stub_status scraping and connection panels from the Fly.io
dashboard. Replace with error rate and cache hit ratio stats.
All key signals are still covered by log-derived metrics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Alloy's stage.metrics prefixes all metric names with
loki_process_custom_. Add a relabel rule to strip the prefix so
dashboards can query clean names (flyio_nginx_http_requests_total
etc). Also drop component_id/component_path/filename labels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add Fly.io proxy as a third Alloy deployment, document the new
remote_write source in Prometheus, new log source in Loki, and
two new dashboards in Grafana.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eblume merged commit cc54b4f565 into main 2026-02-08 10:05:38 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!123
No description provided.