Switch Fly proxy to upstream keepalive pools (#337)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s

## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
This commit is contained in:
Erich Blume 2026-04-17 16:39:52 -07:00
commit fe0e913963
12 changed files with 229 additions and 102 deletions

View file

@ -373,6 +373,66 @@ groups:
type: and
refId: C
- orgId: 1
name: flyio-proxy-health
folder: Infrastructure Alerts
interval: 30s
rules:
- uid: flyio-upstream-unreachable
title: FlyioUpstreamUnreachable
condition: C
for: 3m
noDataState: OK
execErrState: Alerting
annotations:
summary: >-
Fly.io proxy returning elevated 502s — upstream DNS may be stale. Run: mise run fly-reload
runbook_url: https://docs.eblu.me/how-to/operations/manage-flyio-proxy
labels:
severity: warning
service: flyio-proxy
data:
- refId: A
datasourceUid: prometheus
relativeTimeRange:
from: 300
to: 0
model:
expr: >-
sum(rate(flyio_nginx_http_requests_total{instance="flyio-proxy",status="502"}[5m]))
/ sum(rate(flyio_nginx_http_requests_total{instance="flyio-proxy"}[5m]))
> 0.5
interval: ""
refId: A
- refId: B
datasourceUid: "__expr__"
relativeTimeRange:
from: 0
to: 0
model:
type: reduce
expression: A
reducer: last
settings:
mode: dropNN
refId: B
- refId: C
datasourceUid: "__expr__"
relativeTimeRange:
from: 0
to: 0
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params:
- 0
operator:
type: and
refId: C
templates:
- orgId: 1
name: ntfy-infra

View file

@ -21,5 +21,9 @@ spec:
pod:
tailscaleContainer:
image: docker.io/tailscale/tailscale:v1.94.2
resources:
requests:
cpu: 100m
memory: 128Mi
tailscaleInitContainer:
image: docker.io/tailscale/tailscale:v1.94.2