Erich Blume 14ca0160ba Migrate devpi from minikube to indri (launchd) (#341 )

## Summary

Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent.

## What changed

- **New ansible role** `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box).
- **Caddy**: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`.
- **Playbook**: `indri.yml` gains pre_tasks for the root password and the new role.
- **service-versions.yaml**: devpi flipped from `type: argocd` to `type: ansible`.
- **ArgoCD**: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted.
- **Docs**: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list.

## Already deployed (live on indri)

- Service running: `launchctl list mcquack.eblume.devpi` → PID 53888
- `curl https://pypi.ops.eblu.me/+api` returns 200 ✅
- `mise run docs-mikado` works again ✅
- 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/`
- Minikube namespace and PVC fully reclaimed

## Test plan

- [ ] `mise run services-check` (after merge)
- [ ] CI workflows that use devpi succeed
- [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines)

## Context

This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain.

Reviewed-on: #341

2026-04-29 13:38:36 -07:00

7.3 KiB

Raw Blame History

title

modified

last-reviewed

Restart Indri

How to safely shut down and restart indri, the primary BlumeOps server.

Prerequisites

SSH access to indri
Tailscale connected

Shutdown Procedure

1. Stop Kubernetes Gracefully

Minikube runs on the Docker driver, so stopping it cleanly ensures pods terminate gracefully and persistent volumes are properly unmounted.

ssh indri 'minikube stop'

This may take a minute as pods receive termination signals. You can verify it stopped:

ssh indri 'minikube status'

2. Stop Native Services (Optional)

Native services managed by launchd will stop automatically during macOS shutdown. However, if you want to stop them explicitly first:

# LaunchAgent services
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist'
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.caddy.plist'
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.zot.plist'
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.devpi.plist'  # see [[devpi-on-indri]]
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.jellyfin.plist'
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.alloy.plist'
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.borgmatic.plist'

3. Quit GUI Applications

These apps don't autostart and should be quit cleanly before reboot:

Docker Desktop - Quit from menubar or: ssh indri 'osascript -e "quit app \"Docker\""'
Amphetamine - Quit from menubar (prevents sleep; will need restart)
AutoMounter - Quit from menubar (mounts sifaka SMB shares)

4. Reboot

ssh indri 'sudo shutdown -r now'

Or if you're at the console, use the Apple menu.

Startup Procedure

After indri boots, most services recover automatically. Only a few things need manual attention.

What autostarts: Docker Desktop and all mcquack LaunchAgent services (Forgejo, Caddy, Zot, Jellyfin, Alloy, Borgmatic, metrics collectors).

What needs manual action: Amphetamine, AutoMounter, and minikube (including its Tailscale serve port).

Warning: Do NOT run minikube delete — it destroys all PersistentVolumes, etcd state, and requires a full DR rebuild. Use minikube stop / minikube start instead. If minikube is stuck, see #Troubleshooting CNI Conflict After Unclean Shutdown. For full cluster rebuild, see rebuild-minikube-cluster.

0. Dismiss macOS Permission Dialogs

After a cold boot, the first inbound Tailscale SSH connection to indri triggers a macOS GUI permission dialog from tailscaled. This blocks the SSH session (and anything downstream like ansible) until dismissed at the console. You must be logged in to indri (via Screen Sharing or physically) to approve it before running any remote commands.

1. Log In and Start GUI Apps

App	Purpose	Launch Method
Amphetamine	Prevents sleep	Spotlight or App Store apps
AutoMounter	Mounts sifaka SMB shares to `/Volumes/`	Spotlight or App Store apps

Docker Desktop autostarts on login. Wait for it to finish starting (whale icon in menubar stops animating) before proceeding.

2. Verify Sifaka Mounts

AutoMounter should automatically mount the sifaka shares. Verify:

ssh indri 'ls /Volumes/'

You should see: allisonflix, backups, music, photos, torrents (or similar).

If mounts are missing, open AutoMounter and trigger a reconnect.

3. Fix Minikube Remote Access

Minikube uses the Docker driver, which assigns a random API server port on each start. After a reboot, the Tailscale serve proxy (k8s.tail8d86e.ts.net) will still point to the old port, breaking remote kubectl access.

Run the minikube ansible role to detect the new port and update Tailscale serve:

mise run provision-indri -- --tags minikube

Note: Do NOT run the full mise run provision-indri without tags during startup — the forgejo_actions_secrets role will timeout because the Forgejo API routes through Caddy → k8s, which isn't up yet. Use --tags minikube (or --tags minikube,minikube_metrics) to target just the minikube role.

This will:

Start minikube if it hasn't started yet
Detect the current API server port
Update tailscale serve to forward to the correct port

You can verify remote access works:

kubectl --context=minikube-indri get nodes

4. Run Health Check

Once everything is up, verify all services:

mise run services-check

All checks should pass. If any fail, see troubleshooting.

Troubleshooting: CNI Conflict After Unclean Shutdown

After a power loss or unclean reboot, minikube may come up with broken pod networking. The symptom is that new pods cannot reach CoreDNS — services crash-loop with DNS errors (EAI_AGAIN, connection timed out; no servers could be reached) or fail liveness probes because their event loops hang on blocked network calls.

Existing pods that were restarted (not recreated) may appear healthy because the kubelet reuses their cached network namespaces.

Cause

During minikube recovery from a bad state, the CRI-O / Docker networking bootstrap can regenerate a default CNI config file (1-k8s.conflist) that conflicts with kindnet's config (10-kindnet.conflist). Since 1- sorts before 10-, the stale bridge+firewall config takes precedence, and new pods get attached to a different network topology than existing pods.

Diagnosis

1. Check if new pods can resolve DNS:

kubectl --context=minikube-indri run dns-test --image=alpine:3.21 --restart=Never \
  --command -- sh -c 'nslookup kubernetes.default.svc.cluster.local'
sleep 10
kubectl --context=minikube-indri logs dns-test
kubectl --context=minikube-indri delete pod dns-test

If this shows connection timed out; no servers could be reached, pod networking is broken.

2. Check for conflicting CNI configs:

ssh indri 'minikube ssh "ls -la /etc/cni/net.d/"'

You should see only 10-kindnet.conflist (plus 200-loopback.conf and disabled .mk_disabled files). If 1-k8s.conflist or any other active config exists alongside 10-kindnet.conflist, that's the conflict.

3. Confirm the conflict by inspecting the stale config:

ssh indri 'minikube ssh "cat /etc/cni/net.d/1-k8s.conflist"'

If it uses a bridge plugin with a firewall plugin (instead of kindnet's ptp plugin), it's the culprit.

Fix

1. Remove the stale CNI config:

ssh indri 'minikube ssh "sudo rm /etc/cni/net.d/1-k8s.conflist"'

2. Delete all pods that were created while the bad config was active. The simplest approach is to restart all deployments:

kubectl --context=minikube-indri get deployments -A --no-headers | \
  awk '{print "-n " $1 " " $2}' | \
  xargs -L1 kubectl --context=minikube-indri rollout restart deployment

StatefulSets managed by operators (CNPG, Tailscale) generally survive because the kubelet restarts their containers in-place rather than creating new pods.

3. Verify with the DNS test above, then run mise run services-check.

indri - Server specifications
troubleshooting - Diagnose issues
cluster - Kubernetes details
sifaka - NAS storage

7.3 KiB Raw Blame History