Zero-downtime Fly.io deploys #132

Merged
eblume merged 1 commit from fix/zero-downtime-deploy into main 2026-02-09 11:34:20 -08:00
Owner

Summary

  • Start nginx after Tailscale connects (community best practice for Tailscale sidecars)
  • Switch to bluegreen deploy strategy — old machine serves until new one is healthy
  • Replace top-level [checks] with [[http_service.checks]] — only service-level checks gate traffic routing (confirmed by Fly.io staff)
  • Remove sentinel file and nginx if-check (no longer needed)

Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable.

Deployment and Testing

  • Merge and fly deploy from fly/ directory
  • Verify deploy completes with zero 502s (watch fly logs and Grafana docs-apm)
  • Confirm fly checks list shows the new service-level check passing
## Summary - Start nginx after Tailscale connects (community best practice for Tailscale sidecars) - Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy - Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379)) - Remove sentinel file and nginx if-check (no longer needed) Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm) - [ ] Confirm `fly checks list` shows the new service-level check passing
Three changes to eliminate 502s during proxy deploys:

1. Start nginx after Tailscale connects (not before) so MagicDNS is
   always available when the first request arrives. This is the
   community-recommended pattern for Tailscale sidecars on Fly.io.

2. Switch deploy strategy to bluegreen — the old machine keeps serving
   traffic until the new one passes health checks, then Fly.io cuts
   over. Rolling deploys with a single machine always cause downtime.

3. Replace top-level [checks] with [[http_service.checks]]. Top-level
   checks only monitor; they don't gate traffic routing. Service-level
   checks tell the Fly Proxy to hold traffic until the app is ready.

The sentinel file (/tmp/tailscale-ready) and nginx if-check are removed
since nginx no longer starts before Tailscale.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eblume merged commit 959b6842bc into main 2026-02-09 11:34:20 -08:00
eblume referenced this pull request from a commit 2026-02-09 11:34:21 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!132
No description provided.