Fix 502 errors during Fly.io proxy deploys (#131)

## Summary - Health check (`/healthz`) now returns 503 until Tailscale is connected - `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds - Fly.io keeps the old machine serving traffic during the ~7s startup window Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard) - [ ] Confirm health check transitions from 503 → 200 in `fly logs` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131
2026-02-09 11:07:36 -08:00 · 2026-02-09 11:07:36 -08:00 · bd61da4f85
commit bd61da4f85
parent 3415cad38c
3 changed files with 9 additions and 4 deletions
--- a/docs/changelog.d/fix-deploy-healthcheck-race.bugfix.md
+++ b/docs/changelog.d/fix-deploy-healthcheck-race.bugfix.md
@ -0,0 +1 @@
 Fix 502 errors during Fly.io proxy deploys by deferring health check until Tailscale is connected.
--- a/fly/nginx.conf
+++ b/fly/nginx.conf
@ -76,6 +76,9 @@ http {
        listen 8080 default_server;
        location /healthz {
            if (!-f /tmp/tailscale-ready) {
                return 503 "starting\n";
            }
            return 200 "ok\n";
        }
--- a/fly/start.sh
+++ b/fly/start.sh
@ -1,9 +1,9 @@
 #!/bin/sh
 set -e
-# Start nginx immediately so port 8080 is bound before Fly's deploy checks.
+# Start nginx immediately so port 8080 is bound (avoids connection refused).
-# Upstream DNS resolution is deferred via resolver + variable in nginx.conf,
+# Health check returns 503 until /tmp/tailscale-ready exists, so Fly.io
-# so nginx starts cleanly even before Tailscale connects.
+# keeps the old machine serving traffic until Tailscale connects.
 nginx -g "daemon off;" &
 NGINX_PID=$!
 echo "Nginx started (waiting for Tailscale before proxying)"
@ -16,8 +16,9 @@ sleep 2
 # Authenticate and join tailnet
 tailscale up --authkey="${TS_AUTHKEY}" --hostname=flyio-proxy
-# Wait for tailscale to be ready
+# Wait for tailscale to be ready, then signal nginx health check
 until tailscale status > /dev/null 2>&1; do sleep 1; done
 touch /tmp/tailscale-ready
 echo "Tailscale connected"
 # Start Alloy for observability (logs → Loki, metrics → Prometheus)
		`@ -0,0 +1 @@`
							`Fix 502 errors during Fly.io proxy deploys by deferring health check until Tailscale is connected.`