Port Frigate NVR to ringtail k3s with GPU acceleration #217

eblume · 2026-02-19T11:42:39-08:00

eblume commented

2026-02-19 11:42:39 -08:00

Summary

Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime
Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose nvidia.com/gpu resources
Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX)
Switch Frigate image to -tensorrt variant with GPU resource limits and increased shared memory

Manual Prerequisites

NFS access: Verify ringtail can mount sifaka:/volume1/frigate

ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1'

YOLO model: Verify /volume1/frigate/models/yolov9m.onnx exists on sifaka

Deployment Steps

Provision ringtail: mise run provision-ringtail
Sync ArgoCD apps: argocd app sync apps --prune
Deploy NVIDIA device plugin: argocd app sync nvidia-device-plugin
Verify GPU: kubectl --context=k3s-ringtail get nodes -o json | jq '.items[].status.capacity'
Deploy Frigate: argocd app sync frigate

Verification

nvidia.com/gpu: 1 visible in node capacity
Frigate pod running with GPU allocated
Frigate UI loads at https://nvr.ops.eblu.me
Detector shows ONNX/TensorRT on System page
Camera feed with bounding boxes in live view
TensorRT engine build completes (watch logs on first start)

🤖 Generated with Claude Code

## Summary - Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime - Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose `nvidia.com/gpu` resources - Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX) - Switch Frigate image to `-tensorrt` variant with GPU resource limits and increased shared memory ## Manual Prerequisites 1. **NFS access**: Verify ringtail can mount `sifaka:/volume1/frigate` ```fish ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1' ``` 2. **YOLO model**: Verify `/volume1/frigate/models/yolov9m.onnx` exists on sifaka ## Deployment Steps 1. Provision ringtail: `mise run provision-ringtail` 2. Sync ArgoCD apps: `argocd app sync apps --prune` 3. Deploy NVIDIA device plugin: `argocd app sync nvidia-device-plugin` 4. Verify GPU: `kubectl --context=k3s-ringtail get nodes -o json | jq '.items[].status.capacity'` 5. Deploy Frigate: `argocd app sync frigate` ## Verification - [ ] `nvidia.com/gpu: 1` visible in node capacity - [ ] Frigate pod running with GPU allocated - [ ] Frigate UI loads at `https://nvr.ops.eblu.me` - [ ] Detector shows ONNX/TensorRT on System page - [ ] Camera feed with bounding boxes in live view - [ ] TensorRT engine build completes (watch logs on first start) 🤖 Generated with [Claude Code](https://claude.com/claude-code)

eblume added 1 commit

2026-02-19 11:42:39 -08:00

Port Frigate NVR to ringtail k3s with GPU acceleration 4e16116c4f

Migrate Frigate from indri's minikube (arm64, ZMQ detector) to ringtail's
k3s cluster to leverage the RTX 4080 for TensorRT-accelerated ONNX inference.

- Enable nvidia-container-toolkit and configure k3s containerd nvidia runtime
- Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet)
- Re-target Frigate ArgoCD app to ringtail k3s cluster
- Switch image to x86_64 tensorrt variant with runtimeClassName: nvidia
- Add GPU resource limit (nvidia.com/gpu: 1) and increase shm to 512Mi
- Replace ZMQ detector with ONNX (auto-selects TensorRT execution provider)
- Update NFS PV and database PVC comments for ringtail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume reviewed

2026-02-19 11:43:45 -08:00

argocd/manifests/nvidia-device-plugin/daemonset.yaml Outdated

					
				@ -0,0 +22,4 @@

				      priorityClassName: system-node-critical

				      containers:

				        - name: nvidia-device-plugin

				          image: nvcr.io/nvidia/k8s-device-plugin:v0.17.0

can you please check this is the latest release

eblume marked this conversation as resolved

eblume added 1 commit

2026-02-19 11:44:19 -08:00

Bump NVIDIA k8s-device-plugin to v0.18.2 3e6d997c29

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 11:46:00 -08:00

Add nvidia-device-plugin to service version tracking cf5194c138

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 11:50:55 -08:00

Enable NFS client support on ringtail for k3s NFS volumes 986505c7ef

mount.nfs was missing, preventing NFS PersistentVolume mounts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:05:55 -08:00

Fix containerd nvidia runtime config for v3 format 57e5aeccc2

K3s ships containerd 2.0+ which uses config v3 format. The plugin key
path is 'io.containerd.cri.v1.runtime' not 'io.containerd.grpc.v1.cri'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:23:23 -08:00

Add nvidia runtimeClass to device plugin DaemonSet 7e498c5a34

The device plugin needs access to NVIDIA libraries (NVML) to discover
GPUs. Running with the nvidia runtime makes device files visible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:28:20 -08:00

Switch to CDI for GPU device injection instead of nvidia-container-runtime 912dfcab10

NixOS splits nvidia-container-toolkit into separate derivations, making
the nvidia-container-runtime binary path unreliable in containerd config.
CDI (Container Device Interface) is the modern approach:

- Enable CDI in k3s containerd config (cdi_spec_dirs: /var/run/cdi)
- Device plugin uses CDI annotations to inject GPU devices
- Remove RuntimeClass (not needed with CDI)
- Remove runtimeClassName from Frigate deployment
- Mount CDI specs into device plugin pod

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:34:30 -08:00

Mount host NVIDIA libraries into device plugin for NVML access 5194de13b9

The device plugin needs libnvidia-ml.so to discover GPUs even when using
CDI annotations. Mount /run/opengl-driver/lib (NixOS NVIDIA lib path)
into the pod and set LD_LIBRARY_PATH.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:36:19 -08:00

Mount NVIDIA libs to standard lib path for NVML discovery 4427eb77f2

go-nvml uses dl.Open which looks in standard library paths.
Mount to /usr/lib/x86_64-linux-gnu for reliable discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:38:14 -08:00

Run device plugin as privileged for GPU device node access 2cd32108bd

NVML needs both libnvidia-ml.so and /dev/nvidia* device nodes.
Mount libs to a non-clobbering path and run privileged (matching
NVIDIA's official deployment) for device file access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:39:41 -08:00

Remove CDI device-list-strategy from device plugin 7b7358225c

CDI annotations require NVML validation that fails on NixOS. Use the
default envvar strategy for the device plugin — CDI device injection
still works at the containerd level via enable_cdi=true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 12:42:51 -08:00

Mount /nix/store to resolve NVIDIA library symlinks in device plugin 1556eaa5e4

NixOS /run/opengl-driver/lib contains symlinks to /nix/store paths.
Without mounting the nix store, the symlinks are dangling inside the
container and libnvidia-ml.so can't be loaded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:04:51 -08:00

Switch nvidia device plugin to CDI device list strategy 37f625b1fa

Use CDI-based device injection instead of nvidia-container-runtime.
The NixOS nvidia-container-toolkit module generates CDI specs with all
the correct nix store paths, so containerd's native CDI support handles
GPU device and library injection without a custom runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:20:15 -08:00

Use nvidia-container-runtime.cdi for GPU workload injection 9192a31204

Replace the CDI device-list-strategy approach (which fails because the
device plugin generates its own CDI specs and can't find libs on NixOS)
with the nvidia-container-runtime.cdi runtime handler approach:

- Add wrapper script at /etc/nvidia-container-runtime/ that provides
  runc in PATH for nvidia-container-runtime.cdi
- Register nvidia runtime handler in k3s containerd config
- Create RuntimeClass for GPU workloads
- Revert device plugin to default envvar strategy (already working)
- Add runtimeClassName: nvidia to Frigate deployment

The nvidia-container-runtime.cdi binary reads the NixOS-generated CDI
specs from /var/run/cdi/ and injects GPU devices and driver libraries
into containers at create time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:24:36 -08:00

Use index-based device IDs in nvidia device plugin bb1e1e5af9

The CDI spec generated by NixOS uses index-based device names (0, all)
not UUIDs. The device plugin must match by using --device-id-strategy=index,
otherwise nvidia-container-runtime.cdi fails to resolve CDI devices.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:26:15 -08:00

Use Recreate strategy for Frigate deployment 27353792ed

GPU resources can't be shared during rolling updates — the old pod
holds nvidia.com/gpu preventing the new pod from scheduling. Recreate
strategy ensures the old pod is terminated before the new one starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:28:40 -08:00

Use Frigate default model instead of custom YOLOv9m 870d602019

The YOLOv9m ONNX model has ops not fully partitionable to CUDA EP,
causing CUDA graph capture to fail on the -tensorrt image. Use the
default model that ships with the image and is tested for GPU inference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:31:40 -08:00

Use YOLO-NAS model for TensorRT-compatible ONNX inference 4b12e7f7fa

The YOLOv9m model fails with CUDA graph capture on the tensorrt image.
Try YOLO-NAS-S which has a different architecture that may be fully
partitionable to the CUDA execution provider.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:34:17 -08:00

Fix YOLO-NAS input dtype: use int (uint8) not float 95873bcca2

YOLO-NAS expects uint8 input tensors, not float32.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:39:38 -08:00

Add Frigate and Ntfy as static homepage services a5949f228d

These services moved to ringtail k3s and are no longer autodiscovered
by homepage (which runs on indri's minikube). Add them as static
service entries in the Infrastructure group.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:47:28 -08:00

Replace /nix/store mount with targeted nvidia driver lib path 333950d3ba

Create a stable symlink at /etc/nvidia-driver/lib pointing to the
nvidia driver package's lib directory. The device plugin now mounts
only the driver libs it needs instead of the entire nix store.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 13:59:34 -08:00

Filter frigate-notify alerts by zone and label 5d44213017

frigate-notify was firing on every MQTT detection event regardless of
zone, causing notification spam. Add filters to match the Frigate
review config: only alert for person/car in the driveway_entrance
zone, and drop all unzoned events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume added 1 commit

2026-02-19 14:03:11 -08:00

Keep MQTT for real-time alerts, add Pacific timezone 03dc4a5235

Revert webapi change — polling latency is too high for alerts.
MQTT with zone/label filters gives sub-second delivery. Add
TZ=America/Los_Angeles to frigate-notify for local timestamps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

eblume merged commit d5d32fe91f into main

2026-02-19 14:27:04 -08:00

eblume referenced this pull request from a commit

2026-02-19 14:27:05 -08:00

Port Frigate NVR to ringtail k3s with GPU acceleration (#217)

eblume referenced this pull request from a commit

2026-02-19 14:37:41 -08:00

Fix services-check and update docs for Frigate migration to ringtail

Sign in to join this conversation.

No reviewers

No labels

No milestone