Port Frigate NVR to ringtail k3s with GPU acceleration #217

Merged
eblume merged 23 commits from feature/frigate-ringtail-gpu into main 2026-02-19 14:27:04 -08:00

23 commits

Author SHA1 Message Date
03dc4a5235 Keep MQTT for real-time alerts, add Pacific timezone
Revert webapi change — polling latency is too high for alerts.
MQTT with zone/label filters gives sub-second delivery. Add
TZ=America/Los_Angeles to frigate-notify for local timestamps.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 14:03:07 -08:00
5d44213017 Filter frigate-notify alerts by zone and label
frigate-notify was firing on every MQTT detection event regardless of
zone, causing notification spam. Add filters to match the Frigate
review config: only alert for person/car in the driveway_entrance
zone, and drop all unzoned events.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:59:29 -08:00
333950d3ba Replace /nix/store mount with targeted nvidia driver lib path
Create a stable symlink at /etc/nvidia-driver/lib pointing to the
nvidia driver package's lib directory. The device plugin now mounts
only the driver libs it needs instead of the entire nix store.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:47:23 -08:00
a5949f228d Add Frigate and Ntfy as static homepage services
These services moved to ringtail k3s and are no longer autodiscovered
by homepage (which runs on indri's minikube). Add them as static
service entries in the Infrastructure group.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:39:34 -08:00
95873bcca2 Fix YOLO-NAS input dtype: use int (uint8) not float
YOLO-NAS expects uint8 input tensors, not float32.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:34:13 -08:00
4b12e7f7fa Use YOLO-NAS model for TensorRT-compatible ONNX inference
The YOLOv9m model fails with CUDA graph capture on the tensorrt image.
Try YOLO-NAS-S which has a different architecture that may be fully
partitionable to the CUDA execution provider.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:31:35 -08:00
870d602019 Use Frigate default model instead of custom YOLOv9m
The YOLOv9m ONNX model has ops not fully partitionable to CUDA EP,
causing CUDA graph capture to fail on the -tensorrt image. Use the
default model that ships with the image and is tested for GPU inference.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:28:35 -08:00
27353792ed Use Recreate strategy for Frigate deployment
GPU resources can't be shared during rolling updates — the old pod
holds nvidia.com/gpu preventing the new pod from scheduling. Recreate
strategy ensures the old pod is terminated before the new one starts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:26:11 -08:00
bb1e1e5af9 Use index-based device IDs in nvidia device plugin
The CDI spec generated by NixOS uses index-based device names (0, all)
not UUIDs. The device plugin must match by using --device-id-strategy=index,
otherwise nvidia-container-runtime.cdi fails to resolve CDI devices.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:24:31 -08:00
9192a31204 Use nvidia-container-runtime.cdi for GPU workload injection
Replace the CDI device-list-strategy approach (which fails because the
device plugin generates its own CDI specs and can't find libs on NixOS)
with the nvidia-container-runtime.cdi runtime handler approach:

- Add wrapper script at /etc/nvidia-container-runtime/ that provides
  runc in PATH for nvidia-container-runtime.cdi
- Register nvidia runtime handler in k3s containerd config
- Create RuntimeClass for GPU workloads
- Revert device plugin to default envvar strategy (already working)
- Add runtimeClassName: nvidia to Frigate deployment

The nvidia-container-runtime.cdi binary reads the NixOS-generated CDI
specs from /var/run/cdi/ and injects GPU devices and driver libraries
into containers at create time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:20:01 -08:00
37f625b1fa Switch nvidia device plugin to CDI device list strategy
Use CDI-based device injection instead of nvidia-container-runtime.
The NixOS nvidia-container-toolkit module generates CDI specs with all
the correct nix store paths, so containerd's native CDI support handles
GPU device and library injection without a custom runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:04:38 -08:00
1556eaa5e4 Mount /nix/store to resolve NVIDIA library symlinks in device plugin
NixOS /run/opengl-driver/lib contains symlinks to /nix/store paths.
Without mounting the nix store, the symlinks are dangling inside the
container and libnvidia-ml.so can't be loaded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:42:46 -08:00
7b7358225c Remove CDI device-list-strategy from device plugin
CDI annotations require NVML validation that fails on NixOS. Use the
default envvar strategy for the device plugin — CDI device injection
still works at the containerd level via enable_cdi=true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:39:36 -08:00
2cd32108bd Run device plugin as privileged for GPU device node access
NVML needs both libnvidia-ml.so and /dev/nvidia* device nodes.
Mount libs to a non-clobbering path and run privileged (matching
NVIDIA's official deployment) for device file access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:38:11 -08:00
4427eb77f2 Mount NVIDIA libs to standard lib path for NVML discovery
go-nvml uses dl.Open which looks in standard library paths.
Mount to /usr/lib/x86_64-linux-gnu for reliable discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:36:15 -08:00
5194de13b9 Mount host NVIDIA libraries into device plugin for NVML access
The device plugin needs libnvidia-ml.so to discover GPUs even when using
CDI annotations. Mount /run/opengl-driver/lib (NixOS NVIDIA lib path)
into the pod and set LD_LIBRARY_PATH.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:34:20 -08:00
912dfcab10 Switch to CDI for GPU device injection instead of nvidia-container-runtime
NixOS splits nvidia-container-toolkit into separate derivations, making
the nvidia-container-runtime binary path unreliable in containerd config.
CDI (Container Device Interface) is the modern approach:

- Enable CDI in k3s containerd config (cdi_spec_dirs: /var/run/cdi)
- Device plugin uses CDI annotations to inject GPU devices
- Remove RuntimeClass (not needed with CDI)
- Remove runtimeClassName from Frigate deployment
- Mount CDI specs into device plugin pod

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:28:16 -08:00
7e498c5a34 Add nvidia runtimeClass to device plugin DaemonSet
The device plugin needs access to NVIDIA libraries (NVML) to discover
GPUs. Running with the nvidia runtime makes device files visible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:23:18 -08:00
57e5aeccc2 Fix containerd nvidia runtime config for v3 format
K3s ships containerd 2.0+ which uses config v3 format. The plugin key
path is 'io.containerd.cri.v1.runtime' not 'io.containerd.grpc.v1.cri'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:05:46 -08:00
986505c7ef Enable NFS client support on ringtail for k3s NFS volumes
mount.nfs was missing, preventing NFS PersistentVolume mounts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:50:50 -08:00
cf5194c138 Add nvidia-device-plugin to service version tracking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:45:53 -08:00
3e6d997c29 Bump NVIDIA k8s-device-plugin to v0.18.2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:44:10 -08:00
4e16116c4f Port Frigate NVR to ringtail k3s with GPU acceleration
Migrate Frigate from indri's minikube (arm64, ZMQ detector) to ringtail's
k3s cluster to leverage the RTX 4080 for TensorRT-accelerated ONNX inference.

- Enable nvidia-container-toolkit and configure k3s containerd nvidia runtime
- Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet)
- Re-target Frigate ArgoCD app to ringtail k3s cluster
- Switch image to x86_64 tensorrt variant with runtimeClassName: nvidia
- Add GPU resource limit (nvidia.com/gpu: 1) and increase shm to 512Mi
- Replace ZMQ detector with ONNX (auto-selects TensorRT execution provider)
- Update NFS PV and database PVC comments for ringtail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:41:47 -08:00