C2: migrate immich from minikube to ringtail (mikado chain) #356
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "mikado/migrate-immich-to-ringtail"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off
minikube-indriand ontok3s-ringtail. Immich is thelargest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.
This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.
Goal end-state
server,machine-learning,valkeyon ringtail.CPU-only).
immich-pg(PG17 + VectorChord) runs on ringtail.photos.ops.eblu.mereroutes through Caddy → ringtail ingress.immichandimmich-pgare removed.Cards
migrate-immich-to-ringtail(goal)cnpg-on-ringtailimmich-pg-on-ringtailimmich-pg-data-migrationsifaka-nfs-from-ringtailimmich-app-on-ringtailimmich-cutover-and-decommissionKey constraints
surfaces matter: postgres (ML embeddings, face data — slow to
re-derive) and the library files (don't move, but NFS access from
ringtail must be verified).
externalClusterbasebackup → promote. Option B is
pg_dump/pg_restoreas adocumented fallback. Either way, dry-run against a scratch
cluster first.
would block the whole decommission, and Immich is chatty with pg
so tailnet round-trips would hurt.
Test plan
mise run docs-mikado migrate-immich-to-ringtailshows thechain correctly.
convention enforced by hook).
29da047441to17ceb5812517ceb58125to59f862821aMigration via CNPG pg_basebackup (Option A) completed cleanly. Sequence: 1. Stopped immich-server + immich-machine-learning on minikube (scaled to 0). valkey + source pg kept running. 2. Copied minikube's immich-pg-ca + immich-pg-replication secrets to ringtail as source-immich-pg-{ca,replication}. 3. Recreated the ringtail immich-pg Cluster with bootstrap.pg_basebackup, replica.enabled=true, externalClusters pointing at immich-pg.tail8d86e.ts.net via the streaming_replica TLS cert. 4. Basebackup completed in ~50s. Replica caught up streaming. 5. Verified row counts identical between source and replica: asset=12681, user=1, album=28, smart_search=9624, activity=0, asset_face=3917. 6. Promoted via replica.enabled=false. pg_is_in_recovery → false. Write test passed. All 7 expected extensions present in immich db (vector, vchord, cube, earthdistance, pg_trgm, unaccent, uuid-ossp). 7. Pruned bootstrap + externalClusters blocks; deleted out-of-band replication secrets. Source minikube immich-pg is intact and untouched — recovery path remains available until immich-cutover-and-decommission completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>3b66829e8ato674ca2ced9- argocd/manifests/immich-ringtail/: full port of the immich stack (server, ML, valkey, services, ingress, pvc-ml-cache) from argocd/manifests/immich/, with ringtail-specific tweaks: - deployment-ml: runtimeClassName=nvidia, nvidia.com/gpu:1 limit, -cuda image tag - deployment-valkey + kustomization: drop the registry.ops.eblu.me/blumeops/valkey mirror (arm64-only), use upstream docker.io/valkey/valkey:8.1.6 (multi-arch) - ingress-tailscale: tls.hosts=[photos-ringtail] for staging - argocd/apps/immich-ringtail.yaml: new ArgoCD app (manual sync, ringtail destination) - argocd/manifests/nvidia-device-plugin/time-slicing-config.yaml: bump replicas 2 -> 4 so the ringtail GPU can be shared by frigate + ollama + immich-ml The immich-db Secret in the immich namespace is created manually (matching minikube pattern) — see argocd/apps/immich-ringtail.yaml header for the procedure. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>All three pods Running, 1/1 Ready: - immich-server: v2.6.3, connected to ringtail pg + valkey ("/api/server/ping" returns 200, "/api/server/version" returns v2.6.3) - immich-machine-learning: CUDA variant, RTX 4080 attached (nvidia-smi shows 8 GiB used / 16 GiB total — shared with frigate via time-slicing), gunicorn workers booted - immich-valkey: upstream multi-arch docker.io/valkey/valkey:8.1.6 immich-db Secret in the immich namespace created manually with source's immich-pg-app password (matches minikube pattern). Tailscale ingress staging hostname: photos-ringtail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>GitOps decommission of immich + immich-pg on minikube: - Delete argocd/apps/immich.yaml - Delete argocd/manifests/immich/ entirely - Delete argocd/manifests/databases/{immich-pg,external-secret-immich-borgmatic,service-immich-pg-tailscale}.yaml - Remove those entries from databases/kustomization.yaml Add ringtail-side immich-pg Tailscale LoadBalancer Service (hostname "immich-pg") so borgmatic can keep using the same FQDN for nightly backups. This claims the device name freed by deleting the minikube service. The ringtail manifest path stays as argocd/manifests/immich-ringtail/ and the ArgoCD app stays as immich-ringtail — renaming would force a cascading delete + recreate, with a window where live resources disappear. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>Sequence executed: 1. Quiesced source: immich-server + immich-machine-learning on minikube scaled to 0 (done in immich-pg-data-migration). 2. Deleted minikube immich-tailscale Ingress; waited for "photos" Tailscale device to deregister. 3. (Promote of ringtail pg was done in immich-pg-data-migration.) 4. Renamed ringtail ingress tls.hosts photos-ringtail -> photos. 5. Caddy was already pointing photos.ops.eblu.me -> photos.tail8d86e.ts.net so no Ansible change needed. 6. Smoke test: photos.ops.eblu.me/api/server/ping -> 200, /api/server/version -> {"major":2,"minor":6,"patch":3}. 7. Borgmatic continuity: added a ringtail immich-pg-tailscale Service (same FQDN as before, immich-pg.tail8d86e.ts.net). Verified borgmatic role can SELECT count(*) FROM asset over the tailnet (returned 12681, matches source). Decommission: - Deleted argocd Application "immich" with --cascade (clears Deployments, Services, etc. on minikube). - Pruned blumeops-pg Application against the branch which removed the Cluster immich-pg, its ExternalSecret, and the old immich-pg-tailscale Service from minikube. - Deleted leftover Released PVs on minikube. - Deleted the empty immich namespace on minikube. Did not verify minikube host memory drop directly (tailscale-ssh re-auth was prompting at the time). Caller should confirm via "docker stats minikube" once SSH is re-authenticated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>