## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services) ## Changed Files | File | Change | |------|--------| | `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint | | `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config | | `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write | | `fly/start.sh` | Start Alloy after Tailscale, before nginx | | `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard | | `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard | | `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps | | `docs/reference/services/flyio-proxy.md` | Document observability setup | ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
4.3 KiB
| title | tags | |||
|---|---|---|---|---|
| Fly.io Proxy |
|
Fly.io Proxy
Public reverse proxy on Fly.io that exposes selected BlumeOps services to the internet via a Tailscale tunnel back to the homelab.
Quick Reference
| Property | Value |
|---|---|
| App | blumeops-proxy |
| Region | sjc (San Jose) |
| Fly.io URL | blumeops-proxy.fly.dev |
| Config | fly/ directory in repo |
| IaC | fly/fly.toml (app), Pulumi (DNS + auth key) |
Exposed Services
| Public domain | Backend | Service |
|---|---|---|
docs.eblu.me |
docs.tail8d86e.ts.net |
docs |
Architecture
Internet traffic hits Fly.io's Anycast edge, terminates TLS with a Let's Encrypt certificate, and is proxied by nginx to the backend service over a Tailscale WireGuard tunnel. See expose-service-publicly for the full architecture diagram.
Key Files
| File | Purpose |
|---|---|
fly/fly.toml |
App configuration |
fly/Dockerfile |
nginx + Tailscale + Alloy container |
fly/nginx.conf |
Reverse proxy, caching, rate limiting, JSON logging |
fly/alloy.river |
Alloy config: log tailing, metric extraction, remote_write |
fly/start.sh |
Entrypoint: start Tailscale, Alloy, then nginx |
pulumi/tailscale/__main__.py |
Auth key (tag:flyio-proxy) |
pulumi/tailscale/policy.hujson |
ACL grants for proxy |
pulumi/gandi/__main__.py |
DNS CNAMEs |
Networking
Fly.io runs Firecracker microVMs which support TUN devices natively. Tailscale runs with a real TUN interface (not userspace networking), so MagicDNS and direct Tailscale IP routing work normally.
The Tailscale auth key is preauthorized=True to avoid device approval hangs on container restarts.
Observability
alloy runs inside the container alongside nginx and Tailscale, providing:
- Logs: nginx JSON access logs tailed and pushed to loki (
{instance="flyio-proxy", job="flyio-nginx"}) - Metrics: Derived from access logs, pushed to prometheus via
remote_writeflyio_nginx_http_requests_total— request rate by status/method/hostflyio_nginx_http_request_duration_seconds— latency histogramflyio_nginx_http_response_bytes_total— response bandwidthflyio_nginx_cache_requests_total— cache HIT/MISS/EXPIRED counts
Dashboards
| Dashboard | Purpose |
|---|---|
| Docs APM | Per-service view for docs.eblu.me: request rate, latency percentiles, cache hit ratio, error rate, bandwidth, access logs |
| Fly.io Proxy Health | Aggregate proxy health: connections, total request rate by host, cache performance, upstream latency, Alloy health |
Alloy listens on 127.0.0.1:12345 for self-scraping its /metrics endpoint. All metrics carry instance="flyio-proxy".
Security Considerations
The tag:flyio-proxy ACL grants access to both tag:k8s:443 (for proxying public services) and tag:homelab:443 (for pushing metrics/logs to caddy-proxied Loki and Prometheus). This means a compromised nginx config could route traffic to any Caddy-proxied service — not just the intended backends. Some of those services (Loki, Prometheus) have no auth; others (forgejo, navidrome, immich) do.
Exploitation requires either pushing a malicious image to Fly.io or modifying the nginx config — both of which require RCE on gilbert (where fly is authenticated) or access to 1password (the deploy token). This is an acceptable boundary given that 1Password is already the trust root for the entire infrastructure.
If this surface area becomes a concern, an alternative would be to add dedicated Tailscale Ingress tags for Loki/Prometheus write endpoints and restrict tag:flyio-proxy to only those.
Secrets
| Secret | Source | Description |
|---|---|---|
TS_AUTHKEY |
Pulumi state → fly secrets |
Tailscale auth key for joining tailnet |
FLY_DEPLOY_TOKEN |
Fly.io → 1Password | Deploy token for CI |
Related
- expose-service-publicly - Setup guide for adding new public services
- manage-flyio-proxy - Operational tasks (deploy, shutoff, troubleshoot)
- caddy - Private reverse proxy for
*.ops.eblu.me(separate system) - tailscale - WireGuard mesh network
- gandi - DNS hosting