Deploy Fly.io Proxy / deploy (push) Successful in 1m16s

Details

Add Fly.io proxy observability via embedded Alloy (#123 )

## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123

2026-02-08 10:05:38 -08:00

4.3 KiB

Raw Blame History

title

Fly.io Proxy

Public reverse proxy on Fly.io that exposes selected BlumeOps services to the internet via a Tailscale tunnel back to the homelab.

Quick Reference

Property	Value
App	`blumeops-proxy`
Region	`sjc` (San Jose)
Fly.io URL	`blumeops-proxy.fly.dev`
Config	`fly/` directory in repo
IaC	`fly/fly.toml` (app), Pulumi (DNS + auth key)

Exposed Services

Public domain	Backend	Service
`docs.eblu.me`	`docs.tail8d86e.ts.net`	docs

Architecture

Internet traffic hits Fly.io's Anycast edge, terminates TLS with a Let's Encrypt certificate, and is proxied by nginx to the backend service over a Tailscale WireGuard tunnel. See expose-service-publicly for the full architecture diagram.

Key Files

File	Purpose
`fly/fly.toml`	App configuration
`fly/Dockerfile`	nginx + Tailscale + Alloy container
`fly/nginx.conf`	Reverse proxy, caching, rate limiting, JSON logging
`fly/alloy.river`	Alloy config: log tailing, metric extraction, remote_write
`fly/start.sh`	Entrypoint: start Tailscale, Alloy, then nginx
`pulumi/tailscale/__main__.py`	Auth key (`tag:flyio-proxy`)
`pulumi/tailscale/policy.hujson`	ACL grants for proxy
`pulumi/gandi/__main__.py`	DNS CNAMEs

Networking

Fly.io runs Firecracker microVMs which support TUN devices natively. Tailscale runs with a real TUN interface (not userspace networking), so MagicDNS and direct Tailscale IP routing work normally.

The Tailscale auth key is preauthorized=True to avoid device approval hangs on container restarts.

Observability

alloy runs inside the container alongside nginx and Tailscale, providing:

Logs: nginx JSON access logs tailed and pushed to loki ({instance="flyio-proxy", job="flyio-nginx"})
Metrics: Derived from access logs, pushed to prometheus via remote_write
- flyio_nginx_http_requests_total — request rate by status/method/host
- flyio_nginx_http_request_duration_seconds — latency histogram
- flyio_nginx_http_response_bytes_total — response bandwidth
- flyio_nginx_cache_requests_total — cache HIT/MISS/EXPIRED counts

Dashboards

Dashboard	Purpose
Docs APM	Per-service view for `docs.eblu.me`: request rate, latency percentiles, cache hit ratio, error rate, bandwidth, access logs
Fly.io Proxy Health	Aggregate proxy health: connections, total request rate by host, cache performance, upstream latency, Alloy health

Alloy listens on 127.0.0.1:12345 for self-scraping its /metrics endpoint. All metrics carry instance="flyio-proxy".

Security Considerations

The tag:flyio-proxy ACL grants access to both tag:k8s:443 (for proxying public services) and tag:homelab:443 (for pushing metrics/logs to caddy-proxied Loki and Prometheus). This means a compromised nginx config could route traffic to any Caddy-proxied service — not just the intended backends. Some of those services (Loki, Prometheus) have no auth; others (forgejo, navidrome, immich) do.

Exploitation requires either pushing a malicious image to Fly.io or modifying the nginx config — both of which require RCE on gilbert (where fly is authenticated) or access to 1password (the deploy token). This is an acceptable boundary given that 1Password is already the trust root for the entire infrastructure.

If this surface area becomes a concern, an alternative would be to add dedicated Tailscale Ingress tags for Loki/Prometheus write endpoints and restrict tag:flyio-proxy to only those.

Secrets

Secret	Source	Description
`TS_AUTHKEY`	Pulumi state → `fly secrets`	Tailscale auth key for joining tailnet
`FLY_DEPLOY_TOKEN`	Fly.io → 1Password	Deploy token for CI

expose-service-publicly - Setup guide for adding new public services
manage-flyio-proxy - Operational tasks (deploy, shutoff, troubleshoot)
caddy - Private reverse proxy for *.ops.eblu.me (separate system)
tailscale - WireGuard mesh network
gandi - DNS hosting

4.3 KiB Raw Blame History