C2: migrate immich from minikube to ringtail (mikado chain) #356

Merged
eblume merged 20 commits from mikado/migrate-immich-to-ringtail into main 2026-05-13 16:46:20 -07:00
Owner

Summary

C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off minikube-indri and onto k3s-ringtail. Immich is the
largest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.

This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.

Goal end-state

  • Immich server, machine-learning, valkey on ringtail.
  • ML pod uses ringtail's RTX 4080 (performance win — currently
    CPU-only).
  • CNPG immich-pg (PG17 + VectorChord) runs on ringtail.
  • Library still on sifaka NFS — ringtail mounts the same path.
  • photos.ops.eblu.me reroutes through Caddy → ringtail ingress.
  • Minikube immich and immich-pg are removed.

Cards

Card Depends on
migrate-immich-to-ringtail (goal) all six below
cnpg-on-ringtail
immich-pg-on-ringtail cnpg-on-ringtail
immich-pg-data-migration immich-pg-on-ringtail
sifaka-nfs-from-ringtail
immich-app-on-ringtail immich-pg-on-ringtail, sifaka-nfs-from-ringtail
immich-cutover-and-decommission immich-pg-data-migration, immich-app-on-ringtail

Key constraints

  • No data loss. Downtime is acceptable; data loss is not. Two
    surfaces matter: postgres (ML embeddings, face data — slow to
    re-derive) and the library files (don't move, but NFS access from
    ringtail must be verified).
  • Migration method: Option A is a CNPG externalCluster
    basebackup → promote. Option B is pg_dump/pg_restore as a
    documented fallback. Either way, dry-run against a scratch
    cluster first.
  • Why pg moves too (not cross-cluster): keeping pg on minikube
    would block the whole decommission, and Immich is chatty with pg
    so tailnet round-trips would hurt.

Test plan

  • Plan review — does the dependency graph make sense?
  • mise run docs-mikado migrate-immich-to-ringtail shows the
    chain correctly.
  • Per-card implementation cycles land separately (commit
    convention enforced by hook).
## Summary C2 Mikado chain to move the entire Immich stack (server, ML, valkey, postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the largest single tenant on minikube (~1.5 GiB resident) and minikube is currently memory-saturated (97% RAM, swapping). This is the first concrete chain in the broader indri-k8s decommission effort. This PR contains the planning layer only — 7 cards (1 goal + 6 prerequisites). Implementation cycles follow per the Mikado Branch Invariant. ## Goal end-state - Immich `server`, `machine-learning`, `valkey` on ringtail. - ML pod uses ringtail's RTX 4080 (performance win — currently CPU-only). - CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail. - Library still on sifaka NFS — ringtail mounts the same path. - `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress. - Minikube `immich` and `immich-pg` are removed. ## Cards | Card | Depends on | |---|---| | `migrate-immich-to-ringtail` (goal) | all six below | | `cnpg-on-ringtail` | — | | `immich-pg-on-ringtail` | cnpg-on-ringtail | | `immich-pg-data-migration` | immich-pg-on-ringtail | | `sifaka-nfs-from-ringtail` | — | | `immich-app-on-ringtail` | immich-pg-on-ringtail, sifaka-nfs-from-ringtail | | `immich-cutover-and-decommission` | immich-pg-data-migration, immich-app-on-ringtail | ## Key constraints - **No data loss.** Downtime is acceptable; data loss is not. Two surfaces matter: postgres (ML embeddings, face data — slow to re-derive) and the library files (don't move, but NFS access from ringtail must be verified). - **Migration method:** Option A is a CNPG `externalCluster` basebackup → promote. Option B is `pg_dump`/`pg_restore` as a documented fallback. Either way, dry-run against a scratch cluster first. - **Why pg moves too** (not cross-cluster): keeping pg on minikube would block the whole decommission, and Immich is chatty with pg so tailnet round-trips would hurt. ## Test plan - [ ] Plan review — does the dependency graph make sense? - [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the chain correctly. - [ ] Per-card implementation cycles land separately (commit convention enforced by hook).
Goal: move immich (server, ML, valkey, postgres) off minikube-indri
onto k3s-ringtail. Immich is the largest single tenant on minikube
(~1.5 GiB resident) and minikube is memory-saturated.

Prerequisite cards:
- cnpg-on-ringtail
- immich-pg-on-ringtail (requires cnpg-on-ringtail)
- immich-pg-data-migration (requires immich-pg-on-ringtail)
- sifaka-nfs-from-ringtail
- immich-app-on-ringtail (requires immich-pg-on-ringtail, sifaka-nfs-from-ringtail)
- immich-cutover-and-decommission (requires immich-pg-data-migration, immich-app-on-ringtail)

Data loss is a critical failure; downtime is acceptable. The cutover
plan favors a CNPG externalCluster basebackup (Option A) with pg_dump
as the documented fallback (Option B).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sibling of cloudnative-pg.yaml targeting k3s-ringtail. Same mirror
(mirrors/cloudnative-pg) and release (v1.27.1), same sync options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified: cnpg-controller-manager pod Ready on k3s-ringtail; CRDs
clusters.postgresql.cnpg.io etc. installed; ArgoCD app Synced/Healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of argocd/manifests/databases/immich-pg.yaml on ringtail:
- Same VectorChord image (PG17 + VectorChord 0.5.0)
- Same extensions (vector, vchord, cube, earthdistance) via postInitSQL
- Same managed borgmatic role with pg_read_all_data
- 10 GiB local-path storage (matches minikube source)
- shared_preload_libraries: vchord.so
- Empty initdb today; bootstrap block will be rewritten when
  immich-pg-data-migration picks its import method.

ArgoCD app databases-ringtail targets ringtail/databases.
ExternalSecret reuses the onepassword-blumeops ClusterSecretStore that
already exists on ringtail via external-secrets-ringtail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eblume force-pushed mikado/migrate-immich-to-ringtail from 29da047441 to 17ceb58125 2026-05-13 12:22:27 -07:00 Compare
eblume force-pushed mikado/migrate-immich-to-ringtail from 17ceb58125 to 59f862821a 2026-05-13 12:25:53 -07:00 Compare
Verified on k3s-ringtail:
- Cluster immich-pg reached "Cluster in healthy state" (1/1 instance)
- borgmatic role: rolcanlogin=t, member of pg_read_all_data
- ExternalSecret immich-pg-borgmatic: Ready=True, username=borgmatic
- Extensions vchord, vector, cube, earthdistance installed in postgres db
  (immich db extensions deferred to app startup per the card)

10 GiB local-path storage; same VectorChord image as minikube source.
Bootstrap is empty initdb today; will be rewritten when
immich-pg-data-migration picks its import method.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors argocd/manifests/immich/pv-nfs.yaml + pvc.yaml. PV renamed
to immich-library-nfs-pv-ringtail to avoid confusion with the
minikube side (PVs are cluster-scoped; both can coexist).

Initial kustomization.yaml in argocd/manifests/immich-ringtail/
holds just the storage bits today; deployments/services/ingress
will be added in immich-app-on-ringtail.

Verified: PVC binds to PV on k3s-ringtail; mount test from a
busybox pod read existing photo library dirs, wrote and deleted a
test file. DNS resolves sifaka to 192.168.1.203 so NFS traffic
stays on the LAN, off the tailnet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verified on k3s-ringtail:
- Sifaka NFS export /volume1/photos covers 192.168.1.0/24 +
  100.64.0.0/10. Ringtail at 192.168.1.21 is in scope; no DSM rule
  changes needed.
- nfs-test pod mounted the share, read existing library/ thumbs/
  backups/ encoded-video/ profile/, wrote a temp file, deleted it.
- DNS resolution: sifaka → 192.168.1.203 (LAN). NFS traffic stays
  off tailnet, avoiding the sifaka-tailscale-userspace concern.
- Committed PV + PVC bind on first apply (RWX, 2Ti).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the initdb bootstrap with a pg_basebackup from the minikube
source over the tailnet (immich-pg.tail8d86e.ts.net). The ringtail
cluster starts in replica mode (replica.enabled=true), streaming WAL
from the source. Promotion happens by flipping replica.enabled=false
after the replica catches up and the source is quiesced.

Uses the source's streaming_replica TLS cert + CA, copied to ringtail
as out-of-band secrets (source-immich-pg-replication,
source-immich-pg-ca) — the standard CNPG-to-CNPG migration auth path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Row counts verified equal between source (minikube) and replica
(ringtail) across asset (12681), user (1), album (28),
smart_search (9624), activity (0), asset_face (3917). Source immich
is scaled to 0 — no writes since the basebackup completed.

Flipping replica.enabled=false to promote. The externalClusters and
bootstrap.pg_basebackup blocks are left in place as documentation
(CNPG ignores them after initialization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration done, cluster promoted. Pruning the externalClusters block
and bootstrap.pg_basebackup reference eliminates the footgun where a
future replica.enabled=true would demote this primary against the
stale minikube source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Migration via CNPG pg_basebackup (Option A) completed cleanly.

Sequence:
1. Stopped immich-server + immich-machine-learning on minikube
   (scaled to 0). valkey + source pg kept running.
2. Copied minikube's immich-pg-ca + immich-pg-replication secrets
   to ringtail as source-immich-pg-{ca,replication}.
3. Recreated the ringtail immich-pg Cluster with
   bootstrap.pg_basebackup, replica.enabled=true, externalClusters
   pointing at immich-pg.tail8d86e.ts.net via the streaming_replica
   TLS cert.
4. Basebackup completed in ~50s. Replica caught up streaming.
5. Verified row counts identical between source and replica:
   asset=12681, user=1, album=28, smart_search=9624,
   activity=0, asset_face=3917.
6. Promoted via replica.enabled=false. pg_is_in_recovery → false.
   Write test passed. All 7 expected extensions present in immich
   db (vector, vchord, cube, earthdistance, pg_trgm, unaccent,
   uuid-ossp).
7. Pruned bootstrap + externalClusters blocks; deleted out-of-band
   replication secrets.

Source minikube immich-pg is intact and untouched — recovery path
remains available until immich-cutover-and-decommission completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports the immich stack from argocd/manifests/immich/ to
argocd/manifests/immich-ringtail/ with these ringtail-specific
adjustments:

- deployment-ml: runtimeClassName=nvidia + nvidia.com/gpu:1 limit
  to use the RTX 4080. Image tag bumped to v2.6.3-cuda in
  kustomization.yaml.
- ingress-tailscale: tls.hosts=[photos-ringtail] for staging — the
  minikube ingress still owns "photos" until cutover.
- New ArgoCD app immich-ringtail.yaml targeting ringtail, manual
  sync only.
- The immich-db Secret in the immich namespace is created manually
  on first deploy from the source immich-pg-app Secret's password
  (mirrors the minikube pattern documented in argocd/apps/immich.yaml
  README block).

The minikube immich.yaml app is untouched and continues to exist
during the staging window.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eblume force-pushed mikado/migrate-immich-to-ringtail from 3b66829e8a to 674ca2ced9 2026-05-13 13:12:39 -07:00 Compare
- argocd/manifests/immich-ringtail/: full port of the immich stack
  (server, ML, valkey, services, ingress, pvc-ml-cache) from
  argocd/manifests/immich/, with ringtail-specific tweaks:
  - deployment-ml: runtimeClassName=nvidia, nvidia.com/gpu:1 limit,
    -cuda image tag
  - deployment-valkey + kustomization: drop the
    registry.ops.eblu.me/blumeops/valkey mirror (arm64-only), use
    upstream docker.io/valkey/valkey:8.1.6 (multi-arch)
  - ingress-tailscale: tls.hosts=[photos-ringtail] for staging
- argocd/apps/immich-ringtail.yaml: new ArgoCD app (manual sync,
  ringtail destination)
- argocd/manifests/nvidia-device-plugin/time-slicing-config.yaml:
  bump replicas 2 -> 4 so the ringtail GPU can be shared by
  frigate + ollama + immich-ml

The immich-db Secret in the immich namespace is created manually
(matching minikube pattern) — see argocd/apps/immich-ringtail.yaml
header for the procedure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three pods Running, 1/1 Ready:
- immich-server: v2.6.3, connected to ringtail pg + valkey
  ("/api/server/ping" returns 200, "/api/server/version" returns
  v2.6.3)
- immich-machine-learning: CUDA variant, RTX 4080 attached
  (nvidia-smi shows 8 GiB used / 16 GiB total — shared with
  frigate via time-slicing), gunicorn workers booted
- immich-valkey: upstream multi-arch docker.io/valkey/valkey:8.1.6

immich-db Secret in the immich namespace created manually with
source's immich-pg-app password (matches minikube pattern).
Tailscale ingress staging hostname: photos-ringtail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minikube immich-tailscale Ingress was deleted; the "photos" Tailscale
device name is now free. Renaming the ringtail ingress claims it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
GitOps decommission of immich + immich-pg on minikube:
- Delete argocd/apps/immich.yaml
- Delete argocd/manifests/immich/ entirely
- Delete argocd/manifests/databases/{immich-pg,external-secret-immich-borgmatic,service-immich-pg-tailscale}.yaml
- Remove those entries from databases/kustomization.yaml

Add ringtail-side immich-pg Tailscale LoadBalancer Service (hostname
"immich-pg") so borgmatic can keep using the same FQDN for nightly
backups. This claims the device name freed by deleting the minikube
service.

The ringtail manifest path stays as argocd/manifests/immich-ringtail/
and the ArgoCD app stays as immich-ringtail — renaming would force a
cascading delete + recreate, with a window where live resources
disappear.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sequence executed:
1. Quiesced source: immich-server + immich-machine-learning on
   minikube scaled to 0 (done in immich-pg-data-migration).
2. Deleted minikube immich-tailscale Ingress; waited for "photos"
   Tailscale device to deregister.
3. (Promote of ringtail pg was done in immich-pg-data-migration.)
4. Renamed ringtail ingress tls.hosts photos-ringtail -> photos.
5. Caddy was already pointing photos.ops.eblu.me ->
   photos.tail8d86e.ts.net so no Ansible change needed.
6. Smoke test: photos.ops.eblu.me/api/server/ping -> 200,
   /api/server/version -> {"major":2,"minor":6,"patch":3}.
7. Borgmatic continuity: added a ringtail immich-pg-tailscale
   Service (same FQDN as before, immich-pg.tail8d86e.ts.net).
   Verified borgmatic role can SELECT count(*) FROM asset over the
   tailnet (returned 12681, matches source).

Decommission:
- Deleted argocd Application "immich" with --cascade (clears
  Deployments, Services, etc. on minikube).
- Pruned blumeops-pg Application against the branch which removed
  the Cluster immich-pg, its ExternalSecret, and the old
  immich-pg-tailscale Service from minikube.
- Deleted leftover Released PVs on minikube.
- Deleted the empty immich namespace on minikube.

Did not verify minikube host memory drop directly (tailscale-ssh
re-auth was prompting at the time). Caller should confirm via
"docker stats minikube" once SSH is re-authenticated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Immich is fully migrated off minikube-indri onto k3s-ringtail. All
six prerequisite cards plus the goal card converted to historical
documentation by removing status/branch/requires Mikado frontmatter.

Changelog fragment added at docs/changelog.d/migrate-immich-to-ringtail.infra.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eblume merged commit 947e4310c3 into main 2026-05-13 16:46:20 -07:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!356
No description provided.