Switch Fly proxy to upstream keepalive pools (#337)

## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337
2026-04-17 16:39:52 -07:00 · 2026-04-17 16:39:52 -07:00 · fe0e913963
commit fe0e913963
parent 54b1cee950
12 changed files with 229 additions and 102 deletions
--- a/docs/how-to/operations/manage-flyio-proxy.md
+++ b/docs/how-to/operations/manage-flyio-proxy.md
@ -1,7 +1,7 @@
 ---
 title: Manage Fly.io Proxy
-modified: 2026-02-08
-last-reviewed: 2026-03-07
+modified: 2026-04-17
+last-reviewed: 2026-04-17
 tags:
  - how-to
  - fly-io
@ -23,6 +23,16 @@ mise run fly-deploy

 Pushes to `fly/` on main also trigger automatic deployment via the Forgejo CI workflow.

+## Reload Nginx (Re-resolve Upstream DNS)
+
+Nginx uses `upstream` blocks with keepalive connection pools. DNS is resolved at config load. If Tailscale Ingress pods get new IPs (restart, reschedule, minikube restart), reload nginx to re-resolve without a full redeploy:
+
+```bash
+mise run fly-reload
+```
+
+A Grafana alert fires when upstreams are unreachable, prompting this action. A full `fly-deploy` also re-resolves DNS (it replaces the container).
+
 ## Add a New Public Service

 See [[expose-service-publicly#Per-service setup]] for the full walkthrough. In short:
@ -78,12 +88,16 @@ The auth key expires every 90 days. To rotate:

 ## Troubleshooting

-**502 Bad Gateway**: Check `fly logs` for nginx upstream errors. Verify the backend Tailscale service is running (`tailscale status` from inside the container via `fly ssh console`).
+**502 Bad Gateway after Tailscale Ingress restart**: Upstream DNS is stale. Run `mise run fly-reload` to re-resolve. This is the most common cause of 502s.
+
+**502 Bad Gateway on fresh deploy**: MagicDNS may not be ready when nginx starts. The `start.sh` script polls `nslookup` before launching nginx, but if it still fails, check that `tailscale status` is healthy inside the container.

 **Health check failing**: `fly ssh console -a blumeops-proxy` then `curl localhost:8080/healthz` to test locally.

 **TLS errors on custom domain**: Check cert status with `fly certs show <domain> -a blumeops-proxy`. Certs auto-provision via Let's Encrypt and may take a few minutes.

+**High latency (>1s p50)**: Likely lost keepalive — redeploy with `mise run fly-deploy`. Before the keepalive change (April 2026), per-request TLS handshakes through the WireGuard tunnel caused 35s+ p50 at >1 req/s.
+
 ## Related

 - [[flyio-proxy]] - Service reference card