From bca5c4066343cdf68650a987617d88b51ec1f687 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Wed, 13 May 2026 13:12:09 -0700
Subject: [PATCH] C2(migrate-immich-to-ringtail): plan capture GPU contention +
 valkey arch on immich-app-on-ringtail

Two discovered prereqs while bringing the immich stack up on ringtail:

1. nvidia-device-plugin time-slicing on ringtail advertises only 2
   virtual GPUs. Frigate + Ollama consume both. immich-ml's
   nvidia.com/gpu:1 cannot schedule until replicas is bumped to >= 3.
2. The registry.ops.eblu.me/blumeops/valkey image was built on indri
   (arm64) and is single-arch. Pulling on ringtail (amd64)
   crashloops with "exec format error". Use the upstream multi-arch
   docker.io/valkey/valkey image directly until the mirror gets a
   multi-arch tag.

Card body updated to capture both. Next impl incorporates the fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/how-to/immich/immich-app-on-ringtail.md | 28 +++++++++++++++-----
 1 file changed, 22 insertions(+), 6 deletions(-)
diff --git a/docs/how-to/immich/immich-app-on-ringtail.md b/docs/how-to/immich/immich-app-on-ringtail.md
index 41266ca..2d23c1d 100644
--- a/docs/how-to/immich/immich-app-on-ringtail.md
+++ b/docs/how-to/immich/immich-app-on-ringtail.md
@@ -26,12 +26,28 @@ in [[immich-cutover-and-decommission]].
   `argocd/manifests/immich/`:
   - `deployment-server.yaml` — point `DB_HOSTNAME` at the ringtail
     pg service.
-  - `deployment-ml.yaml` — add a node selector / toleration so it
-    schedules where the GPU is, and a `resources.limits` for
-    `nvidia.com/gpu: 1`. Verify the immich-ml image actually wants
-    CUDA (it has CPU and CUDA variants — check the upstream chart).
-    See `argocd/manifests/frigate/` for the existing GPU pod pattern.
-  - `deployment-valkey.yaml` — straight port.
+  - `deployment-ml.yaml` — use `runtimeClassName: nvidia` + a
+    `resources.limits` for `nvidia.com/gpu: 1`. Use the `-cuda` tag
+    of the immich-ml image (set in kustomization). Ringtail is
+    single-node, so no node selector needed. See
+    `argocd/manifests/frigate/` for the existing GPU pod pattern.
+
+    **GPU contention discovery:** ringtail's `nvidia-device-plugin`
+    is configured with `timeSlicing.replicas: 2`. Frigate + Ollama
+    already consume both virtual slices. Adding immich-ml requires
+    bumping the count to >= 3. Edit
+    `argocd/manifests/nvidia-device-plugin/configmap.yaml` (or
+    wherever the device-plugin config lives) and re-sync the
+    `nvidia-device-plugin` ArgoCD app. The plugin pod restarts and
+    the new advertised count appears as the node's
+    `nvidia.com/gpu` allocatable.
+  - `deployment-valkey.yaml` — straight port, BUT use the upstream
+    multi-arch `docker.io/valkey/valkey:<version>` image — do NOT
+    use the `registry.ops.eblu.me/blumeops/valkey` rewrite in the
+    kustomization. That mirror was built on indri (arm64) and is
+    single-arch; pulling it on ringtail (amd64) gets `exec format
+    error` in CrashLoopBackOff. The mirror should eventually carry
+    a multi-arch tag, at which point the rewrite can return.
   - `service*.yaml` — straight port.
   - `pvc-ml-cache.yaml` — straight port (empty `local-path` PVC).
   - `pv-nfs.yaml` + `pvc.yaml` — already covered by