Port Frigate NVR to ringtail k3s with GPU acceleration #217

Merged

eblume merged 23 commits from feature/frigate-ringtail-gpu into main

2026-02-19 14:27:04 -08:00

Author	SHA1	Message	Date
Erich Blume	03dc4a5235	Keep MQTT for real-time alerts, add Pacific timezone Revert webapi change — polling latency is too high for alerts. MQTT with zone/label filters gives sub-second delivery. Add TZ=America/Los_Angeles to frigate-notify for local timestamps. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 14:03:07 -08:00
Erich Blume	5d44213017	Filter frigate-notify alerts by zone and label frigate-notify was firing on every MQTT detection event regardless of zone, causing notification spam. Add filters to match the Frigate review config: only alert for person/car in the driveway_entrance zone, and drop all unzoned events. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:59:29 -08:00
Erich Blume	333950d3ba	Replace /nix/store mount with targeted nvidia driver lib path Create a stable symlink at /etc/nvidia-driver/lib pointing to the nvidia driver package's lib directory. The device plugin now mounts only the driver libs it needs instead of the entire nix store. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:47:23 -08:00
Erich Blume	a5949f228d	Add Frigate and Ntfy as static homepage services These services moved to ringtail k3s and are no longer autodiscovered by homepage (which runs on indri's minikube). Add them as static service entries in the Infrastructure group. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:39:34 -08:00
Erich Blume	95873bcca2	Fix YOLO-NAS input dtype: use int (uint8) not float YOLO-NAS expects uint8 input tensors, not float32. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:34:13 -08:00
Erich Blume	4b12e7f7fa	Use YOLO-NAS model for TensorRT-compatible ONNX inference The YOLOv9m model fails with CUDA graph capture on the tensorrt image. Try YOLO-NAS-S which has a different architecture that may be fully partitionable to the CUDA execution provider. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:31:35 -08:00
Erich Blume	870d602019	Use Frigate default model instead of custom YOLOv9m The YOLOv9m ONNX model has ops not fully partitionable to CUDA EP, causing CUDA graph capture to fail on the -tensorrt image. Use the default model that ships with the image and is tested for GPU inference. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:28:35 -08:00
Erich Blume	27353792ed	Use Recreate strategy for Frigate deployment GPU resources can't be shared during rolling updates — the old pod holds nvidia.com/gpu preventing the new pod from scheduling. Recreate strategy ensures the old pod is terminated before the new one starts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:26:11 -08:00
Erich Blume	bb1e1e5af9	Use index-based device IDs in nvidia device plugin The CDI spec generated by NixOS uses index-based device names (0, all) not UUIDs. The device plugin must match by using --device-id-strategy=index, otherwise nvidia-container-runtime.cdi fails to resolve CDI devices. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:24:31 -08:00
Erich Blume	9192a31204	Use nvidia-container-runtime.cdi for GPU workload injection Replace the CDI device-list-strategy approach (which fails because the device plugin generates its own CDI specs and can't find libs on NixOS) with the nvidia-container-runtime.cdi runtime handler approach: - Add wrapper script at /etc/nvidia-container-runtime/ that provides runc in PATH for nvidia-container-runtime.cdi - Register nvidia runtime handler in k3s containerd config - Create RuntimeClass for GPU workloads - Revert device plugin to default envvar strategy (already working) - Add runtimeClassName: nvidia to Frigate deployment The nvidia-container-runtime.cdi binary reads the NixOS-generated CDI specs from /var/run/cdi/ and injects GPU devices and driver libraries into containers at create time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:20:01 -08:00
Erich Blume	37f625b1fa	Switch nvidia device plugin to CDI device list strategy Use CDI-based device injection instead of nvidia-container-runtime. The NixOS nvidia-container-toolkit module generates CDI specs with all the correct nix store paths, so containerd's native CDI support handles GPU device and library injection without a custom runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:04:38 -08:00
Erich Blume	1556eaa5e4	Mount /nix/store to resolve NVIDIA library symlinks in device plugin NixOS /run/opengl-driver/lib contains symlinks to /nix/store paths. Without mounting the nix store, the symlinks are dangling inside the container and libnvidia-ml.so can't be loaded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:42:46 -08:00
Erich Blume	7b7358225c	Remove CDI device-list-strategy from device plugin CDI annotations require NVML validation that fails on NixOS. Use the default envvar strategy for the device plugin — CDI device injection still works at the containerd level via enable_cdi=true. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:39:36 -08:00
Erich Blume	2cd32108bd	Run device plugin as privileged for GPU device node access NVML needs both libnvidia-ml.so and /dev/nvidia* device nodes. Mount libs to a non-clobbering path and run privileged (matching NVIDIA's official deployment) for device file access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:38:11 -08:00
Erich Blume	4427eb77f2	Mount NVIDIA libs to standard lib path for NVML discovery go-nvml uses dl.Open which looks in standard library paths. Mount to /usr/lib/x86_64-linux-gnu for reliable discovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:36:15 -08:00
Erich Blume	5194de13b9	Mount host NVIDIA libraries into device plugin for NVML access The device plugin needs libnvidia-ml.so to discover GPUs even when using CDI annotations. Mount /run/opengl-driver/lib (NixOS NVIDIA lib path) into the pod and set LD_LIBRARY_PATH. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:34:20 -08:00
Erich Blume	912dfcab10	Switch to CDI for GPU device injection instead of nvidia-container-runtime NixOS splits nvidia-container-toolkit into separate derivations, making the nvidia-container-runtime binary path unreliable in containerd config. CDI (Container Device Interface) is the modern approach: - Enable CDI in k3s containerd config (cdi_spec_dirs: /var/run/cdi) - Device plugin uses CDI annotations to inject GPU devices - Remove RuntimeClass (not needed with CDI) - Remove runtimeClassName from Frigate deployment - Mount CDI specs into device plugin pod Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:28:16 -08:00
Erich Blume	7e498c5a34	Add nvidia runtimeClass to device plugin DaemonSet The device plugin needs access to NVIDIA libraries (NVML) to discover GPUs. Running with the nvidia runtime makes device files visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:23:18 -08:00
Erich Blume	57e5aeccc2	Fix containerd nvidia runtime config for v3 format K3s ships containerd 2.0+ which uses config v3 format. The plugin key path is 'io.containerd.cri.v1.runtime' not 'io.containerd.grpc.v1.cri'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:05:46 -08:00
Erich Blume	986505c7ef	Enable NFS client support on ringtail for k3s NFS volumes mount.nfs was missing, preventing NFS PersistentVolume mounts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:50:50 -08:00
Erich Blume	cf5194c138	Add nvidia-device-plugin to service version tracking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:45:53 -08:00
Erich Blume	3e6d997c29	Bump NVIDIA k8s-device-plugin to v0.18.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:44:10 -08:00
Erich Blume	4e16116c4f	Port Frigate NVR to ringtail k3s with GPU acceleration Migrate Frigate from indri's minikube (arm64, ZMQ detector) to ringtail's k3s cluster to leverage the RTX 4080 for TensorRT-accelerated ONNX inference. - Enable nvidia-container-toolkit and configure k3s containerd nvidia runtime - Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) - Re-target Frigate ArgoCD app to ringtail k3s cluster - Switch image to x86_64 tensorrt variant with runtimeClassName: nvidia - Add GPU resource limit (nvidia.com/gpu: 1) and increase shm to 512Mi - Replace ZMQ detector with ONNX (auto-selects TensorRT execution provider) - Update NFS PV and database PVC comments for ringtail Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:41:47 -08:00