C2(migrate-immich-to-ringtail): plan capture GPU contention + valkey arch on immich-app-on-ringtail
Two discovered prereqs while bringing the immich stack up on ringtail: 1. nvidia-device-plugin time-slicing on ringtail advertises only 2 virtual GPUs. Frigate + Ollama consume both. immich-ml's nvidia.com/gpu:1 cannot schedule until replicas is bumped to >= 3. 2. The registry.ops.eblu.me/blumeops/valkey image was built on indri (arm64) and is single-arch. Pulling on ringtail (amd64) crashloops with "exec format error". Use the upstream multi-arch docker.io/valkey/valkey image directly until the mirror gets a multi-arch tag. Card body updated to capture both. Next impl incorporates the fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
355be3fbc4
commit
bca5c40663
1 changed files with 22 additions and 6 deletions
|
|
@ -26,12 +26,28 @@ in [[immich-cutover-and-decommission]].
|
||||||
`argocd/manifests/immich/`:
|
`argocd/manifests/immich/`:
|
||||||
- `deployment-server.yaml` — point `DB_HOSTNAME` at the ringtail
|
- `deployment-server.yaml` — point `DB_HOSTNAME` at the ringtail
|
||||||
pg service.
|
pg service.
|
||||||
- `deployment-ml.yaml` — add a node selector / toleration so it
|
- `deployment-ml.yaml` — use `runtimeClassName: nvidia` + a
|
||||||
schedules where the GPU is, and a `resources.limits` for
|
`resources.limits` for `nvidia.com/gpu: 1`. Use the `-cuda` tag
|
||||||
`nvidia.com/gpu: 1`. Verify the immich-ml image actually wants
|
of the immich-ml image (set in kustomization). Ringtail is
|
||||||
CUDA (it has CPU and CUDA variants — check the upstream chart).
|
single-node, so no node selector needed. See
|
||||||
See `argocd/manifests/frigate/` for the existing GPU pod pattern.
|
`argocd/manifests/frigate/` for the existing GPU pod pattern.
|
||||||
- `deployment-valkey.yaml` — straight port.
|
|
||||||
|
**GPU contention discovery:** ringtail's `nvidia-device-plugin`
|
||||||
|
is configured with `timeSlicing.replicas: 2`. Frigate + Ollama
|
||||||
|
already consume both virtual slices. Adding immich-ml requires
|
||||||
|
bumping the count to >= 3. Edit
|
||||||
|
`argocd/manifests/nvidia-device-plugin/configmap.yaml` (or
|
||||||
|
wherever the device-plugin config lives) and re-sync the
|
||||||
|
`nvidia-device-plugin` ArgoCD app. The plugin pod restarts and
|
||||||
|
the new advertised count appears as the node's
|
||||||
|
`nvidia.com/gpu` allocatable.
|
||||||
|
- `deployment-valkey.yaml` — straight port, BUT use the upstream
|
||||||
|
multi-arch `docker.io/valkey/valkey:<version>` image — do NOT
|
||||||
|
use the `registry.ops.eblu.me/blumeops/valkey` rewrite in the
|
||||||
|
kustomization. That mirror was built on indri (arm64) and is
|
||||||
|
single-arch; pulling it on ringtail (amd64) gets `exec format
|
||||||
|
error` in CrashLoopBackOff. The mirror should eventually carry
|
||||||
|
a multi-arch tag, at which point the rewrite can return.
|
||||||
- `service*.yaml` — straight port.
|
- `service*.yaml` — straight port.
|
||||||
- `pvc-ml-cache.yaml` — straight port (empty `local-path` PVC).
|
- `pvc-ml-cache.yaml` — straight port (empty `local-path` PVC).
|
||||||
- `pv-nfs.yaml` + `pvc.yaml` — already covered by
|
- `pv-nfs.yaml` + `pvc.yaml` — already covered by
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue