blumeops/docs/reference/services/flyio-proxy.md
Erich Blume cc54b4f565
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
Add Fly.io proxy observability via embedded Alloy (#123)
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00

4.3 KiB

title tags
Fly.io Proxy
service
networking
fly-io

Fly.io Proxy

Public reverse proxy on Fly.io that exposes selected BlumeOps services to the internet via a Tailscale tunnel back to the homelab.

Quick Reference

Property Value
App blumeops-proxy
Region sjc (San Jose)
Fly.io URL blumeops-proxy.fly.dev
Config fly/ directory in repo
IaC fly/fly.toml (app), Pulumi (DNS + auth key)

Exposed Services

Public domain Backend Service
docs.eblu.me docs.tail8d86e.ts.net docs

Architecture

Internet traffic hits Fly.io's Anycast edge, terminates TLS with a Let's Encrypt certificate, and is proxied by nginx to the backend service over a Tailscale WireGuard tunnel. See expose-service-publicly for the full architecture diagram.

Key Files

File Purpose
fly/fly.toml App configuration
fly/Dockerfile nginx + Tailscale + Alloy container
fly/nginx.conf Reverse proxy, caching, rate limiting, JSON logging
fly/alloy.river Alloy config: log tailing, metric extraction, remote_write
fly/start.sh Entrypoint: start Tailscale, Alloy, then nginx
pulumi/tailscale/__main__.py Auth key (tag:flyio-proxy)
pulumi/tailscale/policy.hujson ACL grants for proxy
pulumi/gandi/__main__.py DNS CNAMEs

Networking

Fly.io runs Firecracker microVMs which support TUN devices natively. Tailscale runs with a real TUN interface (not userspace networking), so MagicDNS and direct Tailscale IP routing work normally.

The Tailscale auth key is preauthorized=True to avoid device approval hangs on container restarts.

Observability

alloy runs inside the container alongside nginx and Tailscale, providing:

  • Logs: nginx JSON access logs tailed and pushed to loki ({instance="flyio-proxy", job="flyio-nginx"})
  • Metrics: Derived from access logs, pushed to prometheus via remote_write
    • flyio_nginx_http_requests_total — request rate by status/method/host
    • flyio_nginx_http_request_duration_seconds — latency histogram
    • flyio_nginx_http_response_bytes_total — response bandwidth
    • flyio_nginx_cache_requests_total — cache HIT/MISS/EXPIRED counts

Dashboards

Dashboard Purpose
Docs APM Per-service view for docs.eblu.me: request rate, latency percentiles, cache hit ratio, error rate, bandwidth, access logs
Fly.io Proxy Health Aggregate proxy health: connections, total request rate by host, cache performance, upstream latency, Alloy health

Alloy listens on 127.0.0.1:12345 for self-scraping its /metrics endpoint. All metrics carry instance="flyio-proxy".

Security Considerations

The tag:flyio-proxy ACL grants access to both tag:k8s:443 (for proxying public services) and tag:homelab:443 (for pushing metrics/logs to caddy-proxied Loki and Prometheus). This means a compromised nginx config could route traffic to any Caddy-proxied service — not just the intended backends. Some of those services (Loki, Prometheus) have no auth; others (forgejo, navidrome, immich) do.

Exploitation requires either pushing a malicious image to Fly.io or modifying the nginx config — both of which require RCE on gilbert (where fly is authenticated) or access to 1password (the deploy token). This is an acceptable boundary given that 1Password is already the trust root for the entire infrastructure.

If this surface area becomes a concern, an alternative would be to add dedicated Tailscale Ingress tags for Loki/Prometheus write endpoints and restrict tag:flyio-proxy to only those.

Secrets

Secret Source Description
TS_AUTHKEY Pulumi state → fly secrets Tailscale auth key for joining tailnet
FLY_DEPLOY_TOKEN Fly.io → 1Password Deploy token for CI