Switch Fly proxy to upstream keepalive pools (#337)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s

## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
This commit is contained in:
Erich Blume 2026-04-17 16:39:52 -07:00
commit fe0e913963
12 changed files with 229 additions and 102 deletions

View file

@ -1,7 +1,7 @@
---
title: Manage Fly.io Proxy
modified: 2026-02-08
last-reviewed: 2026-03-07
modified: 2026-04-17
last-reviewed: 2026-04-17
tags:
- how-to
- fly-io
@ -23,6 +23,16 @@ mise run fly-deploy
Pushes to `fly/` on main also trigger automatic deployment via the Forgejo CI workflow.
## Reload Nginx (Re-resolve Upstream DNS)
Nginx uses `upstream` blocks with keepalive connection pools. DNS is resolved at config load. If Tailscale Ingress pods get new IPs (restart, reschedule, minikube restart), reload nginx to re-resolve without a full redeploy:
```bash
mise run fly-reload
```
A Grafana alert fires when upstreams are unreachable, prompting this action. A full `fly-deploy` also re-resolves DNS (it replaces the container).
## Add a New Public Service
See [[expose-service-publicly#Per-service setup]] for the full walkthrough. In short:
@ -78,12 +88,16 @@ The auth key expires every 90 days. To rotate:
## Troubleshooting
**502 Bad Gateway**: Check `fly logs` for nginx upstream errors. Verify the backend Tailscale service is running (`tailscale status` from inside the container via `fly ssh console`).
**502 Bad Gateway after Tailscale Ingress restart**: Upstream DNS is stale. Run `mise run fly-reload` to re-resolve. This is the most common cause of 502s.
**502 Bad Gateway on fresh deploy**: MagicDNS may not be ready when nginx starts. The `start.sh` script polls `nslookup` before launching nginx, but if it still fails, check that `tailscale status` is healthy inside the container.
**Health check failing**: `fly ssh console -a blumeops-proxy` then `curl localhost:8080/healthz` to test locally.
**TLS errors on custom domain**: Check cert status with `fly certs show <domain> -a blumeops-proxy`. Certs auto-provision via Let's Encrypt and may take a few minutes.
**High latency (>1s p50)**: Likely lost keepalive — redeploy with `mise run fly-deploy`. Before the keepalive change (April 2026), per-request TLS handshakes through the WireGuard tunnel caused 35s+ p50 at >1 req/s.
## Related
- [[flyio-proxy]] - Service reference card