blumeops/docs/how-to/operations/manage-flyio-proxy.md
Erich Blume fe0e913963
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s
Switch Fly proxy to upstream keepalive pools (#337)
## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
2026-04-17 16:39:52 -07:00

3.7 KiB

title modified last-reviewed tags
Manage Fly.io Proxy 2026-04-17 2026-04-17
how-to
fly-io
networking
operations

Manage Fly.io Proxy

Operational tasks for the flyio-proxy public reverse proxy.

Deploy Changes

After modifying files in fly/:

mise run fly-deploy

Pushes to fly/ on main also trigger automatic deployment via the Forgejo CI workflow.

Reload Nginx (Re-resolve Upstream DNS)

Nginx uses upstream blocks with keepalive connection pools. DNS is resolved at config load. If Tailscale Ingress pods get new IPs (restart, reschedule, minikube restart), reload nginx to re-resolve without a full redeploy:

mise run fly-reload

A Grafana alert fires when upstreams are unreachable, prompting this action. A full fly-deploy also re-resolves DNS (it replaces the container).

Add a New Public Service

See expose-service-publicly#Per-service setup for the full walkthrough. In short:

  1. Add a server block to fly/nginx.conf
  2. Add a Fly.io certificate: fly certs add <domain> -a blumeops-proxy
  3. Deploy: mise run fly-deploy
  4. Verify against blumeops-proxy.fly.dev with a Host header
  5. Add DNS CNAME via Pulumi: mise run dns-preview then mise run dns-up

Emergency Shutoff

If the proxy is causing issues (DDoS, unexpected traffic, bandwidth consumption on the home network):

Level 1 — Stop the container (seconds, reversible):

mise run fly-shutoff
# or: fly scale count 0 -a blumeops-proxy --yes

All public services go offline immediately. Tailscale tunnel drops. Zero traffic reaches indri. Restore with fly scale count 1 -a blumeops-proxy.

Level 2 — Revoke Tailscale access (seconds): Remove the flyio-proxy node in the Tailscale admin console. Even if the container is running, it cannot reach the tailnet. Use this if the container itself may be compromised.

Level 3 — Remove DNS (minutes to hours): Delete the CNAME records at Gandi. Takes time for DNS propagation but is the permanent shutoff.

Level 1 is the primary response. It is a single command, takes effect in seconds, and is trivially reversible. Keep mise run fly-shutoff somewhere easily accessible (e.g., pinned in a notes app) so it can be run quickly under stress.

Check Status

# App and machine status
fly status -a blumeops-proxy

# Live logs
fly logs -a blumeops-proxy

# Health check
curl -sf https://blumeops-proxy.fly.dev/healthz

# Certificate status
fly certs list -a blumeops-proxy

Rotate Tailscale Auth Key

The auth key expires every 90 days. To rotate:

  1. Re-apply Pulumi to generate a new key: mise run tailnet-up
  2. Re-run setup to stage the new secret: mise run fly-setup
  3. Deploy to pick up the new secret: mise run fly-deploy

Troubleshooting

502 Bad Gateway after Tailscale Ingress restart: Upstream DNS is stale. Run mise run fly-reload to re-resolve. This is the most common cause of 502s.

502 Bad Gateway on fresh deploy: MagicDNS may not be ready when nginx starts. The start.sh script polls nslookup before launching nginx, but if it still fails, check that tailscale status is healthy inside the container.

Health check failing: fly ssh console -a blumeops-proxy then curl localhost:8080/healthz to test locally.

TLS errors on custom domain: Check cert status with fly certs show <domain> -a blumeops-proxy. Certs auto-provision via Let's Encrypt and may take a few minutes.

High latency (>1s p50): Likely lost keepalive — redeploy with mise run fly-deploy. Before the keepalive change (April 2026), per-request TLS handshakes through the WireGuard tunnel caused 35s+ p50 at >1 req/s.