## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337
3.7 KiB
| title | modified | last-reviewed | tags | ||||
|---|---|---|---|---|---|---|---|
| Manage Fly.io Proxy | 2026-04-17 | 2026-04-17 |
|
Manage Fly.io Proxy
Operational tasks for the flyio-proxy public reverse proxy.
Deploy Changes
After modifying files in fly/:
mise run fly-deploy
Pushes to fly/ on main also trigger automatic deployment via the Forgejo CI workflow.
Reload Nginx (Re-resolve Upstream DNS)
Nginx uses upstream blocks with keepalive connection pools. DNS is resolved at config load. If Tailscale Ingress pods get new IPs (restart, reschedule, minikube restart), reload nginx to re-resolve without a full redeploy:
mise run fly-reload
A Grafana alert fires when upstreams are unreachable, prompting this action. A full fly-deploy also re-resolves DNS (it replaces the container).
Add a New Public Service
See expose-service-publicly#Per-service setup for the full walkthrough. In short:
- Add a
serverblock tofly/nginx.conf - Add a Fly.io certificate:
fly certs add <domain> -a blumeops-proxy - Deploy:
mise run fly-deploy - Verify against
blumeops-proxy.fly.devwith aHostheader - Add DNS CNAME via Pulumi:
mise run dns-previewthenmise run dns-up
Emergency Shutoff
If the proxy is causing issues (DDoS, unexpected traffic, bandwidth consumption on the home network):
Level 1 — Stop the container (seconds, reversible):
mise run fly-shutoff
# or: fly scale count 0 -a blumeops-proxy --yes
All public services go offline immediately. Tailscale tunnel drops. Zero traffic reaches indri. Restore with fly scale count 1 -a blumeops-proxy.
Level 2 — Revoke Tailscale access (seconds):
Remove the flyio-proxy node in the Tailscale admin console. Even if the container is running, it cannot reach the tailnet. Use this if the container itself may be compromised.
Level 3 — Remove DNS (minutes to hours): Delete the CNAME records at Gandi. Takes time for DNS propagation but is the permanent shutoff.
Level 1 is the primary response. It is a single command, takes effect in seconds, and is trivially reversible. Keep mise run fly-shutoff somewhere easily accessible (e.g., pinned in a notes app) so it can be run quickly under stress.
Check Status
# App and machine status
fly status -a blumeops-proxy
# Live logs
fly logs -a blumeops-proxy
# Health check
curl -sf https://blumeops-proxy.fly.dev/healthz
# Certificate status
fly certs list -a blumeops-proxy
Rotate Tailscale Auth Key
The auth key expires every 90 days. To rotate:
- Re-apply Pulumi to generate a new key:
mise run tailnet-up - Re-run setup to stage the new secret:
mise run fly-setup - Deploy to pick up the new secret:
mise run fly-deploy
Troubleshooting
502 Bad Gateway after Tailscale Ingress restart: Upstream DNS is stale. Run mise run fly-reload to re-resolve. This is the most common cause of 502s.
502 Bad Gateway on fresh deploy: MagicDNS may not be ready when nginx starts. The start.sh script polls nslookup before launching nginx, but if it still fails, check that tailscale status is healthy inside the container.
Health check failing: fly ssh console -a blumeops-proxy then curl localhost:8080/healthz to test locally.
TLS errors on custom domain: Check cert status with fly certs show <domain> -a blumeops-proxy. Certs auto-provision via Let's Encrypt and may take a few minutes.
High latency (>1s p50): Likely lost keepalive — redeploy with mise run fly-deploy. Before the keepalive change (April 2026), per-request TLS handshakes through the WireGuard tunnel caused 35s+ p50 at >1 req/s.
Related
- flyio-proxy - Service reference card
- expose-service-publicly - Full setup guide and architecture