blumeops/docs/how-to/operations/rebuild-minikube-cluster.md
Erich Blume 14ca0160ba Migrate devpi from minikube to indri (launchd) (#341)
## Summary

Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent.

## What changed

- **New ansible role** `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box).
- **Caddy**: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`.
- **Playbook**: `indri.yml` gains pre_tasks for the root password and the new role.
- **service-versions.yaml**: devpi flipped from `type: argocd` to `type: ansible`.
- **ArgoCD**: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted.
- **Docs**: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list.

## Already deployed (live on indri)

- Service running: `launchctl list mcquack.eblume.devpi` → PID 53888
- `curl https://pypi.ops.eblu.me/+api` returns 200 
- `mise run docs-mikado` works again 
- 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/`
- Minikube namespace and PVC fully reclaimed

## Test plan

- [ ] `mise run services-check` (after merge)
- [ ] CI workflows that use devpi succeed
- [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines)

## Context

This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain.

Reviewed-on: #341
2026-04-29 13:38:36 -07:00

12 KiB

title modified last-reviewed tags
Rebuild Minikube Cluster (DR) 2026-04-13 2026-04-13
how-to
operations
disaster-recovery

Rebuild Minikube Cluster (DR)

How to rebuild the minikube cluster from scratch after data loss (e.g., accidental minikube delete). This is a DR procedure — for normal restarts, see restart-indri.

This procedure was validated during a real DR event on 2026-04-13 after a power loss and accidental minikube delete destroyed all cluster state.

Prerequisites

Before You Start

Clean Stale Tailscale Devices

Before bringing up the Tailscale operator, delete stale service devices from the Tailscale admin console (admin.tailscale.com). Old devices from the destroyed cluster will cause name collisions (new devices get -1, -2 suffixes).

Look for offline tagged devices like: pg, immich-pg, cnpg-metrics, ingress-0, ingress-1, and any other tag:k8s devices that show "last seen" timestamps from before the rebuild.

If you miss this step, you'll need to: delete stale devices from the console, delete the Tailscale state secrets in k8s (kubectl delete secret -n tailscale <name>), and restart the affected pods.

Watch out for cross-cluster name collisions. Both indri (minikube) and ringtail (k3s) use a ProxyGroup named ingress, producing pods named ingress-0, ingress-1. Deleting the wrong device can break the other cluster. Check which IPs are active before deleting. This is tech debt — the ProxyGroups should eventually be renamed to indri-ingress / ringtail-ingress.

Phase 1: Start Minikube

minikube start --driver=docker --container-runtime=docker \
  --cpus=6 --memory=11264 --disk-size=200g \
  --apiserver-names=k8s.tail8d86e.ts.net --apiserver-names=indri \
  --apiserver-port=6443 --listen-address=0.0.0.0

Then run the ansible minikube role to configure Tailscale serve and registry mirrors:

mise run provision-indri -- --tags minikube

Phase 2: Bootstrap Tailscale Operator

The Tailscale operator must be deployed before ArgoCD (ArgoCD uses Tailscale Ingress).

# 1. Create namespace
kubectl --context=minikube-indri create namespace tailscale

# 2. Create OAuth secret manually (ExternalSecrets isn't available yet)
CLIENT_ID=$(op read "op://blumeops/Tailscale K8s Operator OAuth/client-id")
CLIENT_SECRET=$(op read "op://blumeops/Tailscale K8s Operator OAuth/client-secret")
kubectl --context=minikube-indri create secret generic operator-oauth -n tailscale \
  --from-literal=client_id="$CLIENT_ID" \
  --from-literal=client_secret="$CLIENT_SECRET"

# 3. Apply operator manifests
#    NOTE: The kustomization fetches from forge.eblu.me which routes through
#    Fly → Tailscale → k8s (not yet up). Use forge.ops.eblu.me or github.com/eblume/blumeops.
#    Fetch the upstream manifest locally and build a temp kustomization:
curl -s "https://forge.ops.eblu.me/mirrors/tailscale/raw/tag/v1.94.2/cmd/k8s-operator/deploy/manifests/operator.yaml" \
  -o /tmp/ts-operator.yaml
# (create temp kustomization referencing local file — see memory/project_dr_lessons_2026_04.md for details)
kubectl --context=minikube-indri apply -k /tmp/ts-bootstrap/

# 4. Apply ProxyGroup for ingress
kubectl --context=minikube-indri apply -f argocd/manifests/tailscale-operator/proxygroup-ingress.yaml

Phase 3: Bootstrap ArgoCD

# 1. Create namespace
kubectl --context=minikube-indri create namespace argocd

# 2. Apply ArgoCD (skip ExternalSecret resources — not available yet)
#    Create a temp kustomization without external-secret-*.yaml resources.
#    Use --server-side --force-conflicts for large CRDs (applicationsets).
kubectl --context=minikube-indri apply -k /tmp/argocd-bootstrap/ --server-side --force-conflicts

# 3. Wait for ArgoCD
kubectl --context=minikube-indri wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s

# 4. Create forge SSH repo credentials
PRIV_KEY=$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/csjncynh6htjvnh2l2da65y32q/private key?ssh-format=openssh")$'\n'
KNOWN_HOSTS=$(ssh-keyscan -p 2222 forge.ops.eblu.me 2>/dev/null | grep ssh-rsa)
kubectl --context=minikube-indri create secret generic repo-creds-forge -n argocd \
  --from-literal=type=git \
  --from-literal=url='ssh://forgejo@forge.ops.eblu.me:2222/' \
  --from-literal=insecure=false \
  --from-literal=sshPrivateKey="$PRIV_KEY" \
  --from-literal=sshKnownHosts="$KNOWN_HOSTS"
kubectl --context=minikube-indri label secret repo-creds-forge -n argocd argocd.argoproj.io/secret-type=repo-creds

# 5. Apply app-of-apps
kubectl --context=minikube-indri apply -f argocd/apps/argocd.yaml
kubectl --context=minikube-indri apply -f argocd/apps/apps.yaml

# 6. Login and sync apps
argocd login argocd.tail8d86e.ts.net --username admin \
  --password "$(kubectl --context=minikube-indri -n argocd get secret argocd-initial-admin-secret -o jsonpath='{.data.password}' | base64 -d)" \
 argocd app sync apps```

## Phase 4: Bootstrap 1Password Connect + External Secrets

```bash
# 1. Sync foundation
argocd app sync external-secrets-crdsargocd app sync external-secretsargocd app sync 1password-connect
# 2. Create 1Password Connect secrets manually
CREDS_RAW=$(op read "op://blumeops/1Password Connect/credentials-file")
echo "$CREDS_RAW" | kubectl --context=minikube-indri create secret generic op-credentials -n 1password \
  --from-file=1password-credentials.json=/dev/stdin
TOKEN=$(op read "op://blumeops/1Password Connect/token")
kubectl --context=minikube-indri create secret generic onepassword-token -n 1password \
  --from-literal=token="$TOKEN"

# 3. Wait for 1Password Connect to start, then restart External Secrets
kubectl --context=minikube-indri wait --for=condition=available deployment/onepassword-connect -n 1password --timeout=120s
kubectl --context=minikube-indri rollout restart deployment -n external-secrets external-secrets

# 4. Verify ClusterSecretStore becomes Valid
kubectl --context=minikube-indri get clustersecretstores

Phase 5: Sync Services (Dependency Order)

# Foundation (CRDs, operators)
argocd app sync cloudnative-pg kube-state-metrics
# Databases
argocd app sync blumeops-pg
# Observability
argocd app sync loki prometheus tempo grafana grafana-config
# Register ringtail cluster (for authentik, ntfy, ollama, frigate)
ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' | \
  sed 's|127.0.0.1|ringtail.tail8d86e.ts.net|' > /tmp/k3s-ringtail.yaml
KUBECONFIG=/tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail --grpc-web -y

# Authentik (critical — Zot OIDC depends on it, most image pulls depend on Zot)
argocd app sync authentik
# Everything else
argocd app sync tailscale-operator alloy-k8s# ... remaining apps

Phase 6: Restore Databases from Borgmatic

Databases come up empty. Restore from the latest borgmatic backup.

# Extract dumps
ssh indri 'mkdir -p /tmp/borg-restore && borgmatic extract --repository /Volumes/backups/borg --archive latest --destination /tmp/borg-restore --path borgmatic/postgresql_databases'

# Create databases that don't exist yet
kubectl --context=minikube-indri exec -n databases blumeops-pg-1 -c postgres -- \
  psql -U postgres -c "CREATE DATABASE teslamate OWNER teslamate;"
kubectl --context=minikube-indri exec -n databases blumeops-pg-1 -c postgres -- \
  psql -U postgres -c "CREATE DATABASE authentik OWNER authentik;"
# (repeat for other DBs as needed)

# For teslamate: create extensions BEFORE restoring
kubectl --context=minikube-indri exec -n databases blumeops-pg-1 -c postgres -- \
  psql -U postgres -d teslamate -c "CREATE EXTENSION IF NOT EXISTS cube CASCADE; CREATE EXTENSION IF NOT EXISTS earthdistance CASCADE;"

# For immich: create extensions BEFORE restoring
kubectl --context=minikube-indri exec -n databases immich-pg-1 -c postgres -- \
  psql -U postgres -d immich -c "CREATE EXTENSION IF NOT EXISTS vector; CREATE EXTENSION IF NOT EXISTS vchord CASCADE; CREATE EXTENSION IF NOT EXISTS cube CASCADE; CREATE EXTENSION IF NOT EXISTS earthdistance CASCADE; CREATE EXTENSION IF NOT EXISTS pg_trgm; CREATE EXTENSION IF NOT EXISTS unaccent; CREATE EXTENSION IF NOT EXISTS \"uuid-ossp\";"

# Restore (dumps are in custom format — use pg_restore, not psql)
scp indri:/tmp/borg-restore/borgmatic/postgresql_databases/pg.ops.eblu.me:5432/miniflux /tmp/miniflux.sql
kubectl --context=minikube-indri exec -i -n databases blumeops-pg-1 -c postgres -- \
  pg_restore -U postgres -d miniflux --no-owner --role=miniflux < /tmp/miniflux.sql
# (repeat for teslamate, authentik, immich)

# Reset passwords to match current ExternalSecrets/CNPG-generated credentials
# The restored dumps contain OLD password hashes
PASS=$(kubectl --context=minikube-indri -n databases get secret blumeops-pg-app -o jsonpath='{.data.password}' | base64 -d)
kubectl --context=minikube-indri exec -n databases blumeops-pg-1 -c postgres -- \
  psql -U postgres -c "ALTER USER miniflux WITH PASSWORD '${PASS}';"
# (repeat for each user with the appropriate secret source)

# Create manually-managed DB secrets
kubectl --context=minikube-indri create secret generic miniflux-db -n miniflux \
  --from-literal=url="$(kubectl --context=minikube-indri -n databases get secret blumeops-pg-app -o jsonpath='{.data.uri}' | base64 -d)"
kubectl --context=minikube-indri create secret generic immich-db -n immich \
  --from-literal=password="$(kubectl --context=minikube-indri -n databases get secret immich-pg-app -o jsonpath='{.data.password}' | base64 -d)"

Phase 7: Manual Fixups

Forge Tailscale Ingress + Endpoints

The forge-external Endpoints must be applied manually (ArgoCD excludes Endpoints resources):

kubectl --context=minikube-indri apply -f argocd/manifests/tailscale-operator/svc-forge-external.yaml
kubectl --context=minikube-indri apply -f argocd/manifests/tailscale-operator/ingress-forge.yaml
kubectl --context=minikube-indri apply -f argocd/manifests/tailscale-operator/endpoints-forge.yaml

Restart Fly.io Proxy

After the Tailscale ingress ProxyGroup gets new VIPs, the Fly.io proxy's MagicDNS cache may be stale:

FLY_API_TOKEN=$(op read "op://blumeops/fly.io admin/deploy-token") fly machine restart <machine-id> --app blumeops-proxy

Grafana SQLite

If Grafana crashes with migration errors (no such column: help_flags1), delete its PVC and resync — Grafana is fully stateless (all config provisioned via ConfigMaps).

Phase 8: Verify

mise run services-check

Known Circular Dependencies

Dependency Breaks Workaround
forge.eblu.me → Fly → Tailscale → k8s tailscale-operator kustomization fetch Fetch manifests from forge.ops.eblu.me or github.com/eblume/blumeops
Forgejo Actions secrets → Forgejo API → Caddy → k8s Full ansible playbook Use --tags minikube during bootstrap
Zot → Authentik OIDC All container image pulls from Zot Sync authentik early; Zot will crash-loop until OIDC is reachable
ArgoCD Endpoints exclusion → forge-external Forge Tailscale ingress has no backend Manual kubectl apply for Endpoints

Post-Rebuild: Cold Cache Failures

Devpi runs natively on indri (see devpi-on-indri) and is unaffected by minikube rebuilds, so the historical "devpi cold cache after rebuild" failure mode no longer applies. If devpi itself goes cold (fresh server-dir), the same lazy-cache race can still cause 404 on the first Dagger build under concurrent load — re-run the build to warm the cache, or pre-warm with uv pip install --dry-run --index-url https://pypi.ops.eblu.me/root/pypi/+simple/ dagger-io.