Port Frigate NVR to ringtail k3s with GPU acceleration #217

Merged
eblume merged 23 commits from feature/frigate-ringtail-gpu into main 2026-02-19 14:27:04 -08:00
Owner

Summary

  • Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime
  • Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose nvidia.com/gpu resources
  • Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX)
  • Switch Frigate image to -tensorrt variant with GPU resource limits and increased shared memory

Manual Prerequisites

  1. NFS access: Verify ringtail can mount sifaka:/volume1/frigate
    ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1'
    
  2. YOLO model: Verify /volume1/frigate/models/yolov9m.onnx exists on sifaka

Deployment Steps

  1. Provision ringtail: mise run provision-ringtail
  2. Sync ArgoCD apps: argocd app sync apps --prune
  3. Deploy NVIDIA device plugin: argocd app sync nvidia-device-plugin
  4. Verify GPU: kubectl --context=k3s-ringtail get nodes -o json | jq '.items[].status.capacity'
  5. Deploy Frigate: argocd app sync frigate

Verification

  • nvidia.com/gpu: 1 visible in node capacity
  • Frigate pod running with GPU allocated
  • Frigate UI loads at https://nvr.ops.eblu.me
  • Detector shows ONNX/TensorRT on System page
  • Camera feed with bounding boxes in live view
  • TensorRT engine build completes (watch logs on first start)

🤖 Generated with Claude Code

## Summary - Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime - Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose `nvidia.com/gpu` resources - Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX) - Switch Frigate image to `-tensorrt` variant with GPU resource limits and increased shared memory ## Manual Prerequisites 1. **NFS access**: Verify ringtail can mount `sifaka:/volume1/frigate` ```fish ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1' ``` 2. **YOLO model**: Verify `/volume1/frigate/models/yolov9m.onnx` exists on sifaka ## Deployment Steps 1. Provision ringtail: `mise run provision-ringtail` 2. Sync ArgoCD apps: `argocd app sync apps --prune` 3. Deploy NVIDIA device plugin: `argocd app sync nvidia-device-plugin` 4. Verify GPU: `kubectl --context=k3s-ringtail get nodes -o json | jq '.items[].status.capacity'` 5. Deploy Frigate: `argocd app sync frigate` ## Verification - [ ] `nvidia.com/gpu: 1` visible in node capacity - [ ] Frigate pod running with GPU allocated - [ ] Frigate UI loads at `https://nvr.ops.eblu.me` - [ ] Detector shows ONNX/TensorRT on System page - [ ] Camera feed with bounding boxes in live view - [ ] TensorRT engine build completes (watch logs on first start) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
Migrate Frigate from indri's minikube (arm64, ZMQ detector) to ringtail's
k3s cluster to leverage the RTX 4080 for TensorRT-accelerated ONNX inference.

- Enable nvidia-container-toolkit and configure k3s containerd nvidia runtime
- Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet)
- Re-target Frigate ArgoCD app to ringtail k3s cluster
- Switch image to x86_64 tensorrt variant with runtimeClassName: nvidia
- Add GPU resource limit (nvidia.com/gpu: 1) and increase shm to 512Mi
- Replace ZMQ detector with ONNX (auto-selects TensorRT execution provider)
- Update NFS PV and database PVC comments for ringtail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ -0,0 +22,4 @@
priorityClassName: system-node-critical
containers:
- name: nvidia-device-plugin
image: nvcr.io/nvidia/k8s-device-plugin:v0.17.0
Author
Owner

can you please check this is the latest release

can you please check this is the latest release
eblume marked this conversation as resolved
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mount.nfs was missing, preventing NFS PersistentVolume mounts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
K3s ships containerd 2.0+ which uses config v3 format. The plugin key
path is 'io.containerd.cri.v1.runtime' not 'io.containerd.grpc.v1.cri'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The device plugin needs access to NVIDIA libraries (NVML) to discover
GPUs. Running with the nvidia runtime makes device files visible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NixOS splits nvidia-container-toolkit into separate derivations, making
the nvidia-container-runtime binary path unreliable in containerd config.
CDI (Container Device Interface) is the modern approach:

- Enable CDI in k3s containerd config (cdi_spec_dirs: /var/run/cdi)
- Device plugin uses CDI annotations to inject GPU devices
- Remove RuntimeClass (not needed with CDI)
- Remove runtimeClassName from Frigate deployment
- Mount CDI specs into device plugin pod

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The device plugin needs libnvidia-ml.so to discover GPUs even when using
CDI annotations. Mount /run/opengl-driver/lib (NixOS NVIDIA lib path)
into the pod and set LD_LIBRARY_PATH.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
go-nvml uses dl.Open which looks in standard library paths.
Mount to /usr/lib/x86_64-linux-gnu for reliable discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NVML needs both libnvidia-ml.so and /dev/nvidia* device nodes.
Mount libs to a non-clobbering path and run privileged (matching
NVIDIA's official deployment) for device file access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CDI annotations require NVML validation that fails on NixOS. Use the
default envvar strategy for the device plugin — CDI device injection
still works at the containerd level via enable_cdi=true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
NixOS /run/opengl-driver/lib contains symlinks to /nix/store paths.
Without mounting the nix store, the symlinks are dangling inside the
container and libnvidia-ml.so can't be loaded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use CDI-based device injection instead of nvidia-container-runtime.
The NixOS nvidia-container-toolkit module generates CDI specs with all
the correct nix store paths, so containerd's native CDI support handles
GPU device and library injection without a custom runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the CDI device-list-strategy approach (which fails because the
device plugin generates its own CDI specs and can't find libs on NixOS)
with the nvidia-container-runtime.cdi runtime handler approach:

- Add wrapper script at /etc/nvidia-container-runtime/ that provides
  runc in PATH for nvidia-container-runtime.cdi
- Register nvidia runtime handler in k3s containerd config
- Create RuntimeClass for GPU workloads
- Revert device plugin to default envvar strategy (already working)
- Add runtimeClassName: nvidia to Frigate deployment

The nvidia-container-runtime.cdi binary reads the NixOS-generated CDI
specs from /var/run/cdi/ and injects GPU devices and driver libraries
into containers at create time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The CDI spec generated by NixOS uses index-based device names (0, all)
not UUIDs. The device plugin must match by using --device-id-strategy=index,
otherwise nvidia-container-runtime.cdi fails to resolve CDI devices.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
GPU resources can't be shared during rolling updates — the old pod
holds nvidia.com/gpu preventing the new pod from scheduling. Recreate
strategy ensures the old pod is terminated before the new one starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The YOLOv9m ONNX model has ops not fully partitionable to CUDA EP,
causing CUDA graph capture to fail on the -tensorrt image. Use the
default model that ships with the image and is tested for GPU inference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The YOLOv9m model fails with CUDA graph capture on the tensorrt image.
Try YOLO-NAS-S which has a different architecture that may be fully
partitionable to the CUDA execution provider.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
YOLO-NAS expects uint8 input tensors, not float32.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These services moved to ringtail k3s and are no longer autodiscovered
by homepage (which runs on indri's minikube). Add them as static
service entries in the Infrastructure group.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Create a stable symlink at /etc/nvidia-driver/lib pointing to the
nvidia driver package's lib directory. The device plugin now mounts
only the driver libs it needs instead of the entire nix store.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
frigate-notify was firing on every MQTT detection event regardless of
zone, causing notification spam. Add filters to match the Frigate
review config: only alert for person/car in the driveway_entrance
zone, and drop all unzoned events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert webapi change — polling latency is too high for alerts.
MQTT with zone/label filters gives sub-second delivery. Add
TZ=America/Los_Angeles to frigate-notify for local timestamps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eblume merged commit d5d32fe91f into main 2026-02-19 14:27:04 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!217
No description provided.