Erich Blume 947e4310c3 C2: migrate immich from minikube to ringtail (mikado chain) (#356 )

## Summary

C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the
largest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.

This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.

## Goal end-state

- Immich `server`, `machine-learning`, `valkey` on ringtail.
- ML pod uses ringtail's RTX 4080 (performance win — currently
  CPU-only).
- CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail.
- Library still on sifaka NFS — ringtail mounts the same path.
- `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress.
- Minikube `immich` and `immich-pg` are removed.

## Cards

| Card | Depends on |
|---|---|
| `migrate-immich-to-ringtail` (goal) | all six below |
| `cnpg-on-ringtail` | — |
| `immich-pg-on-ringtail` | cnpg-on-ringtail |
| `immich-pg-data-migration` | immich-pg-on-ringtail |
| `sifaka-nfs-from-ringtail` | — |
| `immich-app-on-ringtail` | immich-pg-on-ringtail, sifaka-nfs-from-ringtail |
| `immich-cutover-and-decommission` | immich-pg-data-migration, immich-app-on-ringtail |

## Key constraints

- **No data loss.** Downtime is acceptable; data loss is not. Two
  surfaces matter: postgres (ML embeddings, face data — slow to
  re-derive) and the library files (don't move, but NFS access from
  ringtail must be verified).
- **Migration method:** Option A is a CNPG `externalCluster`
  basebackup → promote. Option B is `pg_dump`/`pg_restore` as a
  documented fallback. Either way, dry-run against a scratch
  cluster first.
- **Why pg moves too** (not cross-cluster): keeping pg on minikube
  would block the whole decommission, and Immich is chatty with pg
  so tailnet round-trips would hurt.

## Test plan

- [ ] Plan review — does the dependency graph make sense?
- [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the
      chain correctly.
- [ ] Per-card implementation cycles land separately (commit
      convention enforced by hook).

Reviewed-on: #356

2026-05-13 16:46:17 -07:00

3.8 KiB

Raw Blame History

title

modified

last-reviewed

Immich App on Ringtail

Bring up immich-server, immich-machine-learning, and immich-valkey on ringtail. This card stands the stack up against the new pg cluster — it does not move user traffic. Cutover lives in immich-cutover-and-decommission.

What to do

New manifest dir argocd/manifests/immich-ringtail/ (the suffix matches the -ringtail convention used by other apps). Port from argocd/manifests/immich/:
- deployment-server.yaml — point DB_HOSTNAME at the ringtail pg service.
- deployment-ml.yaml — use runtimeClassName: nvidia + a resources.limits for nvidia.com/gpu: 1. Use the -cuda tag of the immich-ml image (set in kustomization). Ringtail is single-node, so no node selector needed. See argocd/manifests/frigate/ for the existing GPU pod pattern.
  
  GPU contention discovery: ringtail's nvidia-device-plugin is configured with timeSlicing.replicas: 2. Frigate + Ollama already consume both virtual slices. Adding immich-ml requires bumping the count to >= 3. Edit argocd/manifests/nvidia-device-plugin/configmap.yaml (or wherever the device-plugin config lives) and re-sync the nvidia-device-plugin ArgoCD app. The plugin pod restarts and the new advertised count appears as the node's nvidia.com/gpu allocatable.
- deployment-valkey.yaml — straight port, BUT use the upstream multi-arch docker.io/valkey/valkey:<version> image — do NOT use the registry.ops.eblu.me/blumeops/valkey rewrite in the kustomization. That mirror was built on indri (arm64) and is single-arch; pulling it on ringtail (amd64) gets exec format error in CrashLoopBackOff. The mirror should eventually carry a multi-arch tag, at which point the rewrite can return.
- service*.yaml — straight port.
- pvc-ml-cache.yaml — straight port (empty local-path PVC).
- pv-nfs.yaml + pvc.yaml — already covered by sifaka-nfs-from-ringtail (may live in this dir or theirs).
- ingress-tailscale.yaml — ProxyGroup ingress, must not set an explicit host: (or use host: *) per the lesson on ProxyGroup VIP routing. Hostname collision warning: the minikube ingress claims the Tailscale device name photos (tls.hosts: [photos]). Two devices on the tailnet cannot share that name. While the ringtail deployment is being staged it must use a different tls.hosts value (e.g. photos-ringtail) so it can coexist with the running minikube one. The flip to photos happens at cutover time, after the minikube ingress has been removed. See immich-cutover-and-decommission#Cutover sequence.
- kustomization.yaml — same images: block (server, ML, valkey).
New ArgoCD app argocd/apps/immich-ringtail.yaml targeting ringtail, namespace immich. Manual sync only until the cutover.
Existing argocd/apps/immich.yaml (minikube) stays untouched during this card — both apps exist briefly.

Bring it up against a copy of the DB

Use the throwaway/test path from [[immich-pg-data-migration#Dry run before real cutover]]: point the ringtail immich at the test pg cluster first, verify the pod boots, the web UI loads (via kubectl port-forward), assets list, ML embeddings query. Then tear it down.

Verification

All three pods Ready.
ML pod has a GPU attached: nvidia-smi inside the container shows the 4080.
immich-server connects to pg and valkey (no ECONNREFUSED in logs).
A kubectl port-forward to the server service shows the Immich web UI.

Out of scope

Public/tailnet routing flip. Caddy still points at the minikube Tailscale ingress until immich-cutover-and-decommission.
Removing the minikube immich. Same.

3.8 KiB Raw Blame History