Fix macOS heph daemon restart bootout→bootstrap race (5: Input/output error) #13

Merged
eblume merged 1 commit from feature/daemon-restart-race into main 2026-06-08 13:44:03 -07:00
Owner

Summary

Follow-up to #12. heph daemon restart on macOS intermittently failed with launchctl bootstrap failed: 5: Input/output error. The cause: restart bootstrapped immediately after bootout, but launchctl bootout is asynchronous — launchd may still be killing/reaping the job and removing its label from the gui/$uid domain when the command returns. Bootstrapping into that transitional domain returns a generic EIO. Whether it races depends on how fast hephd (sync client + SQLite store/lock + a supervised heph-quickadd child) shuts down, so it surfaced intermittently.

Fix (launchd path only — systemd's restart is already a synchronous transaction):

  • Wait for the label to actually clear (wait_until_unloaded: poll launchctl print, bounded to 5s) before re-bootstrapping.
  • Retry the bootstrap (launchd_bootstrap: up to 5 attempts, 200ms apart) to cover the residual settle window. start shares this helper too.
  • Skip the dance when the plist is unchanged — the common binary-upgrade restart now uses launchctl kickstart -k to restart the loaded job atomically, with no bootout/bootstrap and no race. Full reload (bootout+bootstrap) is reserved for genuine config changes, where launchd must re-read the plist.

Testing

  • cargo test -p heph (existing 12 service tests green; clippy/fmt clean)
  • Manual on macOS: repeated heph daemon restart — kickstart path (plist unchanged) and reload path (config change). The launchctl-calling helpers aren't unit-testable.

Notes

  • kickstart -k restarts the loaded job definition, so it intentionally does not pick up an edited plist — which is exactly why it's gated on "plist unchanged."

🤖 Generated with Claude Code

## Summary Follow-up to #12. `heph daemon restart` on macOS intermittently failed with `launchctl bootstrap failed: 5: Input/output error`. The cause: `restart` bootstrapped immediately after `bootout`, but `launchctl bootout` is **asynchronous** — launchd may still be killing/reaping the job and removing its label from the `gui/$uid` domain when the command returns. Bootstrapping into that transitional domain returns a generic EIO. Whether it races depends on how fast `hephd` (sync client + SQLite store/lock + a supervised `heph-quickadd` child) shuts down, so it surfaced intermittently. Fix (launchd path only — systemd's `restart` is already a synchronous transaction): - **Wait for the label to actually clear** (`wait_until_unloaded`: poll `launchctl print`, bounded to 5s) before re-bootstrapping. - **Retry the bootstrap** (`launchd_bootstrap`: up to 5 attempts, 200ms apart) to cover the residual settle window. `start` shares this helper too. - **Skip the dance when the plist is unchanged** — the common binary-upgrade restart now uses `launchctl kickstart -k` to restart the loaded job atomically, with no bootout/bootstrap and no race. Full reload (bootout+bootstrap) is reserved for genuine config changes, where launchd must re-read the plist. ## Testing - [x] `cargo test -p heph` (existing 12 service tests green; clippy/fmt clean) - [ ] Manual on macOS: repeated `heph daemon restart` — kickstart path (plist unchanged) and reload path (config change). The launchctl-calling helpers aren't unit-testable. ## Notes - `kickstart -k` restarts the *loaded* job definition, so it intentionally does **not** pick up an edited plist — which is exactly why it's gated on "plist unchanged." 🤖 Generated with [Claude Code](https://claude.com/claude-code)
fix(heph): make macOS heph daemon restart race-free
All checks were successful
Build / validate (pull_request) Successful in 8m39s
f6b27414a8
`restart` bootstrapped immediately after `bootout`, but `launchctl bootout` is
asynchronous: launchd may still be killing/reaping the job and removing its
label when the command returns. Bootstrapping into that transitional domain
fails with a generic `5: Input/output error`, intermittently — the odds depend
on how fast hephd (sync client + SQLite + a heph-quickadd child) shuts down.

- Wait for the label to actually clear (poll `launchctl print`, bounded) before
  re-bootstrapping, and retry the bootstrap to cover the residual settle window.
- When the plist is unchanged (the common binary-upgrade restart), use
  `launchctl kickstart -k` to restart the loaded job atomically — no
  bootout/bootstrap, no race. The full reload path is reserved for genuine
  config changes, where launchd must re-read the plist.

Start's bootstrap shares the same retry helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Author
Owner

Verified live on gilbert (macOS launchd), built from this branch (hephd stays the installed 1.2.3):

  • Kickstart path (bare heph daemon restart, plist unchanged): 3 consecutive restarts, all rc=0, fresh pid each time, no 5: Input/output error.
  • Reload path (the one that raced): restart --self-update-interval-secs 300 then --self-update-interval-secs 600 — both rc=0, plist re-read (interval 600→300→600), daemon running each time.
  • Post-restart: socket responds (heph next), daemon args retain all spoke flags (--hub-url/--oidc-*/--self-update --self-update-interval-secs 600), log shows clean self-update enabled interval_secs=600 … listening.

Ran the old code ~once and reproduced the EIO; the new code restarted cleanly 5/5 times across both paths.

Verified live on gilbert (macOS launchd), built from this branch (hephd stays the installed 1.2.3): - **Kickstart path** (bare `heph daemon restart`, plist unchanged): 3 consecutive restarts, all rc=0, fresh pid each time, no `5: Input/output error`. - **Reload path** (the one that raced): `restart --self-update-interval-secs 300` then `--self-update-interval-secs 600` — both rc=0, plist re-read (interval 600→300→600), daemon running each time. - Post-restart: socket responds (`heph next`), daemon args retain all spoke flags (`--hub-url`/`--oidc-*`/`--self-update --self-update-interval-secs 600`), log shows clean `self-update enabled interval_secs=600 … listening`. Ran the old code ~once and reproduced the EIO; the new code restarted cleanly 5/5 times across both paths.
eblume merged commit b82264892f into main 2026-06-08 13:44:03 -07:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/hephaestus!13
No description provided.