Zero-downtime Fly.io deploys: bluegreen + startup reorder

Three changes to eliminate 502s during proxy deploys: 1. Start nginx after Tailscale connects (not before) so MagicDNS is always available when the first request arrives. This is the community-recommended pattern for Tailscale sidecars on Fly.io. 2. Switch deploy strategy to bluegreen — the old machine keeps serving traffic until the new one passes health checks, then Fly.io cuts over. Rolling deploys with a single machine always cause downtime. 3. Replace top-level [checks] with [[http_service.checks]]. Top-level checks only monitor; they don't gate traffic routing. Service-level checks tell the Fly Proxy to hold traffic until the app is ready. The sentinel file (/tmp/tailscale-ready) and nginx if-check are removed since nginx no longer starts before Tailscale. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-09 11:31:52 -08:00 · 2026-02-09 11:31:52 -08:00 · 4bbe4e7c20
commit 4bbe4e7c20
parent bd61da4f85
4 changed files with 20 additions and 24 deletions
--- a/docs/changelog.d/fix-zero-downtime-deploy.infra.md
+++ b/docs/changelog.d/fix-zero-downtime-deploy.infra.md
@ -0,0 +1 @@
+Eliminate 502 errors during Fly.io proxy deploys by starting nginx after Tailscale, switching to bluegreen deploys, and using service-level health checks for traffic gating.
				`@ -0,0 +1 @@`
				`Eliminate 502 errors during Fly.io proxy deploys by starting nginx after Tailscale, switching to bluegreen deploys, and using service-level health checks for traffic gating.`