blumeops

Author	SHA1	Message	Date
Erich Blume	bb1e1e5af9	Use index-based device IDs in nvidia device plugin The CDI spec generated by NixOS uses index-based device names (0, all) not UUIDs. The device plugin must match by using --device-id-strategy=index, otherwise nvidia-container-runtime.cdi fails to resolve CDI devices. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:24:31 -08:00
Erich Blume	9192a31204	Use nvidia-container-runtime.cdi for GPU workload injection Replace the CDI device-list-strategy approach (which fails because the device plugin generates its own CDI specs and can't find libs on NixOS) with the nvidia-container-runtime.cdi runtime handler approach: - Add wrapper script at /etc/nvidia-container-runtime/ that provides runc in PATH for nvidia-container-runtime.cdi - Register nvidia runtime handler in k3s containerd config - Create RuntimeClass for GPU workloads - Revert device plugin to default envvar strategy (already working) - Add runtimeClassName: nvidia to Frigate deployment The nvidia-container-runtime.cdi binary reads the NixOS-generated CDI specs from /var/run/cdi/ and injects GPU devices and driver libraries into containers at create time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:20:01 -08:00
Erich Blume	37f625b1fa	Switch nvidia device plugin to CDI device list strategy Use CDI-based device injection instead of nvidia-container-runtime. The NixOS nvidia-container-toolkit module generates CDI specs with all the correct nix store paths, so containerd's native CDI support handles GPU device and library injection without a custom runtime. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 13:04:38 -08:00
Erich Blume	1556eaa5e4	Mount /nix/store to resolve NVIDIA library symlinks in device plugin NixOS /run/opengl-driver/lib contains symlinks to /nix/store paths. Without mounting the nix store, the symlinks are dangling inside the container and libnvidia-ml.so can't be loaded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:42:46 -08:00
Erich Blume	7b7358225c	Remove CDI device-list-strategy from device plugin CDI annotations require NVML validation that fails on NixOS. Use the default envvar strategy for the device plugin — CDI device injection still works at the containerd level via enable_cdi=true. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:39:36 -08:00
Erich Blume	2cd32108bd	Run device plugin as privileged for GPU device node access NVML needs both libnvidia-ml.so and /dev/nvidia* device nodes. Mount libs to a non-clobbering path and run privileged (matching NVIDIA's official deployment) for device file access. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:38:11 -08:00
Erich Blume	4427eb77f2	Mount NVIDIA libs to standard lib path for NVML discovery go-nvml uses dl.Open which looks in standard library paths. Mount to /usr/lib/x86_64-linux-gnu for reliable discovery. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:36:15 -08:00
Erich Blume	5194de13b9	Mount host NVIDIA libraries into device plugin for NVML access The device plugin needs libnvidia-ml.so to discover GPUs even when using CDI annotations. Mount /run/opengl-driver/lib (NixOS NVIDIA lib path) into the pod and set LD_LIBRARY_PATH. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:34:20 -08:00
Erich Blume	912dfcab10	Switch to CDI for GPU device injection instead of nvidia-container-runtime NixOS splits nvidia-container-toolkit into separate derivations, making the nvidia-container-runtime binary path unreliable in containerd config. CDI (Container Device Interface) is the modern approach: - Enable CDI in k3s containerd config (cdi_spec_dirs: /var/run/cdi) - Device plugin uses CDI annotations to inject GPU devices - Remove RuntimeClass (not needed with CDI) - Remove runtimeClassName from Frigate deployment - Mount CDI specs into device plugin pod Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:28:16 -08:00
Erich Blume	7e498c5a34	Add nvidia runtimeClass to device plugin DaemonSet The device plugin needs access to NVIDIA libraries (NVML) to discover GPUs. Running with the nvidia runtime makes device files visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:23:18 -08:00
Erich Blume	57e5aeccc2	Fix containerd nvidia runtime config for v3 format K3s ships containerd 2.0+ which uses config v3 format. The plugin key path is 'io.containerd.cri.v1.runtime' not 'io.containerd.grpc.v1.cri'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 12:05:46 -08:00
Erich Blume	986505c7ef	Enable NFS client support on ringtail for k3s NFS volumes mount.nfs was missing, preventing NFS PersistentVolume mounts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:50:50 -08:00
Erich Blume	cf5194c138	Add nvidia-device-plugin to service version tracking Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:45:53 -08:00
Erich Blume	3e6d997c29	Bump NVIDIA k8s-device-plugin to v0.18.2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:44:10 -08:00
Erich Blume	4e16116c4f	Port Frigate NVR to ringtail k3s with GPU acceleration Migrate Frigate from indri's minikube (arm64, ZMQ detector) to ringtail's k3s cluster to leverage the RTX 4080 for TensorRT-accelerated ONNX inference. - Enable nvidia-container-toolkit and configure k3s containerd nvidia runtime - Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) - Re-target Frigate ArgoCD app to ringtail k3s cluster - Switch image to x86_64 tensorrt variant with runtimeClassName: nvidia - Add GPU resource limit (nvidia.com/gpu: 1) and increase shm to 512Mi - Replace ZMQ detector with ONNX (auto-selects TensorRT execution provider) - Update NFS PV and database PVC comments for ringtail Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 11:41:47 -08:00
Erich Blume	16a4a9a616	Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216 ) ## Summary - Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired - Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s - Move ntfy ArgoCD app from indri minikube to ringtail k3s - Update Frigate docs to reflect detector removal and planned RTX 4080 migration - Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail ## Deployment After merge: 1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps: ``` argocd app sync apps --prune ``` 2. Sync new ringtail apps: ``` argocd app sync mosquitto-ringtail argocd app sync ntfy-ringtail ``` 3. Manually clean up the detector LaunchAgent on indri: ``` ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ``` ## Notes - Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns) - ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup - Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed - Frigate + frigate-notify will be ported to ringtail in a follow-up PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216	2026-02-19 11:22:44 -08:00
Erich Blume	61ca1ca305	Deploy Tailscale operator on ringtail k3s cluster (#215 ) ## Summary - Extract shared Tailscale operator resources (CRDs, RBAC, Deployment, ProxyClass, DNSConfig) into `tailscale-operator-base/` so both clusters reference the same manifests - Add `tailscale-operator-ringtail/` overlay with 1-replica ProxyGroup and ExternalSecret for the shared OAuth client - Add ArgoCD Application targeting `ringtail.tail8d86e.ts.net:6443` - Update `.yamllint.yaml` ignore path for the moved `operator.yaml` ## Deployment and Testing - [ ] Sync `apps` app to pick up the new Application definition - [ ] `argocd app sync tailscale-operator-ringtail` - [ ] Verify ExternalSecret syncs: `kubectl --context=k3s-ringtail -n tailscale get externalsecret` - [ ] Verify operator pod runs: `kubectl --context=k3s-ringtail -n tailscale get pods` - [ ] Verify ProxyGroup ready: `kubectl --context=k3s-ringtail -n tailscale get proxygroups` - [ ] Verify indri operator still works: `argocd app diff tailscale-operator` - [ ] Check Tailscale admin for new operator device with `tag:k8s-operator` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/215	2026-02-19 09:33:05 -08:00
Erich Blume	695089499e	Nix container build for nettest (#214 ) ## Summary - Add `containers/nettest/default.nix` using `dockerTools.buildLayeredImage` with curl, jq, dnsutils, cacert, and bash — equivalent to the existing Dockerfile - Update `container-tag-and-release` to require `--nix` or `--dockerfile` flag when both build types exist for a container - Update `container-list` to show `[dockerfile+nix]` label when both exist ## Deployment and Testing - [ ] SSH to ringtail, run `nix build -f containers/nettest/default.nix -o result` to verify the nix expression builds - [ ] Tag `nettest-nix-v1.0.0`, confirm `build-container-nix` workflow runs on `nix-container-builder` runner and pushes to registry - [ ] Smoke test on ringtail k3s: `kubectl run nettest --image=registry.ops.eblu.me/blumeops/nettest:v1.0.0 --restart=Never && kubectl logs nettest` - [ ] Verify `mise run container-list` shows `[dockerfile+nix]` for nettest - [ ] Verify `mise run container-tag-and-release nettest v1.1.0` prompts for build type Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/214	2026-02-19 08:42:58 -08:00
Erich Blume	b475a1fcd7	Fix 1Password secret tasks always reporting changed in ringtail playbook (#213 ) ## Summary - Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml` - Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created` ## Test plan - [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213	2026-02-19 07:25:24 -08:00
Erich Blume	8f89239c78	Inhibit idle lock for fullscreen windows on ringtail (#212 ) ## Summary - Adds `inhibit_idle fullscreen` window commands to sway config on ringtail - Covers both Wayland-native (`app_id`) and XWayland (`class`) windows - Prevents swayidle from locking the screen during gamepad-only gaming sessions where controller input isn't detected by the Wayland idle tracker ## Notes This is a blanket fullscreen inhibit. A more targeted approach (daemon monitoring `/dev/input` gamepad events) may be desired later to allow idle lock during long-running fullscreen apps like Factorio. ## Deployment and Testing - [ ] `mise run provision-ringtail` to deploy - [ ] Run a fullscreen app and verify swayidle doesn't lock after 15 minutes - [ ] Verify lock still activates when no fullscreen window is present Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/212	2026-02-19 07:20:05 -08:00
Erich Blume	9829a6f971	Add screen lock and idle management to ringtail (#211 ) ## Summary - Configure swayidle to lock screen (swaylock) after 15 minutes of inactivity - Turn off display (DPMS) after 60 minutes, auto-restore on activity - swaylock themed with Catppuccin Macchiato to match existing Sway config - Add `Mod4+l` keybinding for manual screen lock - Add PAM service for swaylock authentication - Disable system suspend/hibernate entirely (workstation should never sleep) ## What changes All changes in `nixos/ringtail/configuration.nix`: - `security.pam.services.swaylock` — required for swaylock to authenticate on NixOS - `systemd.sleep.extraConfig` — blocks all sleep/hibernate modes - `programs.swaylock` (home-manager) — lock screen appearance config - `services.swayidle` (home-manager) — idle timeout daemon with lock + DPMS events - New keybinding `Mod4+l` for manual lock ## Deployment and Testing - [ ] `mise run provision-ringtail` - [ ] Verify swayidle is running: `systemctl --user status swayidle` - [ ] Test manual lock with `Super+l` - [ ] Verify display DPMS off after idle (can lower timeout temporarily to test) - [ ] Confirm machine does not suspend: `systemctl status sleep.target` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/211	2026-02-19 06:46:37 -08:00
Erich Blume	630ebcd12d	Add ringtail DeviceTags and homelab-to-homelab SSH rule (#210 ) ## Summary - Add `ringtail` DeviceTags Pulumi resource with `tag:homelab` + `tag:blumeops` (matching indri/sifaka pattern) - Remove the bootstrap `ringtail_key` auth key — ringtail is already on the tailnet - Add SSH ACL rule allowing `tag:homelab` → `tag:homelab` SSH, unblocking cross-host management (e.g., ringtail running ansible against indri) ## Deployment and Testing - [ ] `mise run tailnet-preview` — dry run, confirm diff - [ ] `mise run tailnet-up` — apply - [ ] From ringtail: `ssh indri 'hostname'` — should succeed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/210	2026-02-18 21:48:11 -08:00
Erich Blume	aa04618829	Fix k3s health check to use explicit KUBECONFIG path k3s kubectl on ringtail needs KUBECONFIG set since the eblume user doesn't have it in their default environment. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 21:26:00 -08:00
Erich Blume	1f2134bf0a	Fix provision-ringtail ls-remote matching with mirror refs git ls-remote returns multiple lines when a mirror ref exists (e.g. refs/remotes/remote_mirror_*/main). Take only the first line to avoid a false mismatch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 21:22:46 -08:00
Erich Blume	918df9e642	Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209 ) ## Summary Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions runner for building containers with Nix. ### K3s cluster - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint) - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`) - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access - `kubectl` added to system packages ### 1Password Connect + External Secrets Operator - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`, `external-secrets-config-ringtail` - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password` namespace in post_tasks ### Systemd Forgejo Actions runner - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env` - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access ### Nix container build workflow - New `.forgejo/workflows/build-container-nix.yaml` triggers on `-nix-v[0-9]` tags (e.g. `nettest-nix-v1.0.0`) - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy` - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering ### Mise task updates - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`) - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`) ## Post-merge 1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild 2. Register k3s cluster in ArgoCD (first time only): ```fish ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' \| \ sed 's\|127.0.0.1\|ringtail.tail8d86e.ts.net\|' > /tmp/k3s-ringtail.yaml set -x KUBECONFIG /tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail 3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail -> external-secrets-config-ringtail 4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder' 5. Check Forgejo admin panel for ringtail-nix-builder runner online 6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209	2026-02-18 21:15:30 -08:00
Erich Blume	535f897054	Polish ringtail NixOS config and add documentation (#208 ) ## Summary - Fix Super+Return keybinding to launch wezterm in sway - Set fish as default login shell - Remove `initialPassword` (real password already set) - Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep) - Add ringtail reference card, update host inventory and reference index - Changelog fragment ## Post-merge deployment - `mise run provision-ringtail` to rebuild NixOS - On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration) - Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208	2026-02-18 17:53:47 -08:00
Erich Blume	b76f2314c2	Add force: true to ringtail git task nixos-rebuild can dirty the tree (e.g. flake.lock updates), which blocks the Ansible git module. Force ensures we always reset to the upstream state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:32:23 -08:00
Erich Blume	7bf46f4e28	Add flake.lock for ringtail NixOS config Prevents 'Git tree is dirty' warnings during nixos-rebuild. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:31:21 -08:00
Erich Blume	5a087c10df	Fix deprecated greetd.tuigreet package reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:30:01 -08:00
Erich Blume	4b7491c58f	Add python3 to ringtail for Ansible compatibility NixOS doesn't include Python by default. Ansible needs it on the managed host for module execution. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:29:09 -08:00
Erich Blume	b08ed98881	Enable passwordless sudo for wheel group on ringtail Required for Ansible unattended provisioning via become: true. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:25:32 -08:00
Erich Blume	8ee6c1271a	Add --accept-routes and --ssh to tailscale config Makes tailscale settings declarative so they persist across rebuilds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:24:17 -08:00
Erich Blume	aaf7e73c27	Fix sway on NVIDIA proprietary drivers Sway/wlroots refuses to start on proprietary NVIDIA by default. Add --unsupported-gpu flag and disable hardware cursors. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:08:26 -08:00
Erich Blume	104e49d337	Allow unfree packages for NVIDIA drivers and Steam Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 08:56:27 -08:00
Erich Blume	b9d813cde1	Add NixOS configuration for ringtail workstation (#207 ) ## Summary - NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/` - Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe) - NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam - Tailscale integration for tailnet connectivity - Ansible playbook + `mise run provision-ringtail` for ongoing management - Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap ## Deployment Order 1. Merge PR 2. `pulumi up` in tailscale stack → creates auth key 3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets` 4. On ringtail NixOS installer: - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo) - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail` 5. Reboot, `tailscale up --auth-key=<key>` 6. Verify: `tailscale status`, SSH from gilbert ## Test plan - [ ] Review NixOS configuration for completeness - [ ] Verify disko partition layout matches ringtail hardware - [ ] Run `pulumi preview` for tailscale stack - [ ] Install NixOS on ringtail - [ ] Confirm tailscale connectivity - [ ] Confirm sway desktop works - [ ] Test `mise run provision-ringtail` for ongoing management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207	2026-02-18 08:24:25 -08:00
Erich Blume	5f9b024b4a	Add Apple Silicon ZMQ detector for Frigate (#206 ) ## Summary - New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri - Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms) - Removes detect FPS cap (no longer needed with fast inference) - Updates Frigate docs and adds changelog fragment ## Deployment ### Phase 1: Deploy detector on indri (one-time setup + ansible) ```fish ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector' ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install' mise run provision-indri -- --tags frigate_detector --check --diff # dry run mise run provision-indri -- --tags frigate_detector # apply ssh indri 'launchctl list mcquack.eblume.frigate-detector' # verify running ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log' # verify bound ``` ### Phase 2: Test connectivity ```fish kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555 ``` ### Phase 3: Deploy Frigate config (branch workflow) ```fish argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate ``` ### Phase 4: Post-deploy checks - [ ] Pod starts, no config errors - [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms - [ ] detect_fps uncapped - [ ] Recordings and MQTT events flowing - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206	2026-02-17 19:03:28 -08:00
Erich Blume	f45897b7c7	Upgrade Frigate 0.16.4 → 0.17.0-rc2 (#205 ) ## Summary - Bump Frigate image from `0.16.4-standard-arm64` to `0.17.0-rc2-standard-arm64` - Adapt `record` config to 0.17 schema: `retain.days`/`mode: all` → `continuous.days` - Update service docs and version tracker This is the first step toward the Apple Silicon ZMQ detector. The existing ONNX detector is kept so we can validate the upgrade independently. ## What is NOT changing - Detector config (still `type: onnx` with YOLO-NAS-s) - go2rtc streams, MQTT, cameras, zones, review rules - frigate-notify, storage PVs, Grafana dashboard ## Deployment and Testing - [ ] `argocd app set frigate --revision upgrade-frigate-0.17 && argocd app sync frigate` - [ ] Pod starts, `/api/version` returns `0.17.0-rc2` - [ ] No config errors in pod logs - [ ] Frigate web UI loads at `https://nvr.ops.eblu.me` - [ ] Live view works, detection running (`/api/stats` shows `detection_fps > 0`) - [ ] Recordings being created (`/api/recordings/summary`) - [ ] MQTT events flowing (check frigate-notify logs) - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/205	2026-02-17 16:56:12 -08:00
Erich Blume	acd213559e	Fix frigate live view by capping detect FPS (#204 ) ## Summary - Cap detect FPS to 2 to prevent recording segment backlog from ONNX inference bottleneck (~750ms/frame on ARM64 CPU) - Sync motion masks from live config (added second mask area) - Update driveway_entrance zone coordinates from live config - Add explicit alert labels `[person, car]` while keeping `required_zones: [driveway_entrance]` ## Context The "No frames have been received" error on the gablecam live view was caused by the detect stream falling behind — ONNX YOLO-NAS-s takes ~750ms per inference on ARM64 CPU, but the sub-stream sends 5 FPS. This caused recording segments to pile up and the ffmpeg watchdog to repeatedly kill/restart the process, creating gaps in the live view. ## Test plan - [ ] Sync ArgoCD `frigate` app to branch and verify pod restarts cleanly - [ ] Check `/api/stats` — `skipped_fps` should drop significantly, `process_fps` should be close to 2 - [ ] Verify live view at https://nvr.ops.eblu.me/#gablecam no longer shows "No frames" error - [ ] Verify detections and alerts still work in the driveway_entrance zone 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/204	2026-02-17 16:18:02 -08:00
Erich Blume	1e96866dd3	Grafana helm chart upgrade plan	2026-02-17 11:15:34 -08:00
Erich Blume	b9d1acaf3a	Service review for external-secrets	2026-02-17 10:48:09 -08:00
Erich Blume	105a2c8c08	Update External Secrets Helm chart 1.3.1 → 2.0.0 (#203 ) ## Summary - Bump External Secrets Operator Helm chart from `helm-chart-1.3.1` to `helm-chart-2.0.0` (operator v1.3.2) - Updates both the operator app and CRDs app `targetRevision` - No Helm values changes needed — `installCRDs`, `resources`, `webhook`, `certController` keys are unchanged ## Breaking changes in chart 2.0.0 - Removed providers: Alibaba and Device42 (unmaintained) — does not affect our 1Password setup - Templating engine v1 deprecated — our ExternalSecrets don't set `engineVersion`, so they use the default (v2) - Webhook `failurePolicy` for SecretStore is now dynamic ## Deployment 1. Sync CRDs first: `argocd app set external-secrets-crds --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets-crds` 2. Sync operator: `argocd app set external-secrets --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets` 3. Verify: `kubectl --context=minikube-indri -n external-secrets get pods` 4. After merge, set both apps back to `--revision main` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/203	2026-02-17 10:43:21 -08:00
Erich Blume	5fbe70d1ba	Port ntfy to locally built container image (#202 ) All checks were successful Build Container / build (push) Successful in 6m28s Details ntfy-v1.0.0 ## Summary - Add `containers/ntfy/Dockerfile` — three-stage build (Node web UI, Go+CGO server, Alpine runtime) pinned to commit SHA `a03a37fe` (v2.17.0), sourced from forge mirror - Update ntfy deployment image from `binwiederhier/ntfy:v2.17.0` to `registry.ops.eblu.me/blumeops/ntfy:v1.0.0` - Note fish shell in CLAUDE.md ## Deployment After merge, release the container image: ```fish mise run container-tag-and-release ntfy v1.0.0 ``` Then sync: ```fish argocd app sync ntfy ``` ## Test plan - [x] `docker build` succeeds - [x] `dagger call build --src=. --container-name=ntfy` succeeds (exit 0, container ID printed) - [x] `ntfy --help` works in built container - [ ] Tag and release `ntfy-v1.0.0` after merge - [ ] Verify ntfy pod starts with new image - [ ] Verify health endpoint responds at `ntfy.ops.eblu.me/v1/health` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/202	2026-02-17 10:18:20 -08:00
Erich Blume	3e604d8fdc	Review ntfy: upgrade to v2.17.0 and add reference docs (#201 ) ## Summary - Upgrade ntfy from v2.11.0 to v2.17.0 (6 minor releases, no breaking changes) - Add reference doc for ntfy service - Add reference doc for frigate service (ntfy's sole producer via frigate-notify) - Update reference index and service-versions.yaml tracking ## Notable upstream changes (v2.12.0–v2.17.0) - v2.14.0: Declarative users/ACL config in files - v2.15.0: `require-login` flag for topic-level auth - v2.16.0: Dead man's switch (heartbeat) notifications, notification update/delete - v2.17.0: Priority templating, crash fixes (nil pointer panics) ## Deployment and Testing - [ ] ArgoCD sync ntfy after merge - [ ] Verify ntfy pod healthy with new image - [ ] Send a test notification via `curl -d "test" https://ntfy.ops.eblu.me/test` - [ ] Verify frigate-notify still delivers alerts to ntfy Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/201	2026-02-17 09:51:40 -08:00
Erich Blume	54c3b0a5f3	Expanded some CLAUDE.md stuff manualy	2026-02-17 07:54:34 -08:00
Erich Blume	2f599a15bd	Fix zk-docs broken path after how-to reorg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 07:32:54 -08:00
Forgejo Actions	530460171a	Update docs release to v1.9.4 - Built changelog from towncrier fragments [skip ci]	2026-02-17 07:30:39 -08:00
Erich Blume	27d8f3cf1f	Review gandi-operations doc and reorganize how-to guides (#200 ) v1.9.4 ## Summary - Doc review: Reviewed `gandi-operations.md` — added `last-reviewed` frontmatter, verified all wiki-links, confirmed Pulumi state has no drift - Gandi reference fix: Added missing `cv.eblu.me` CNAME row to `gandi.md` DNS records table (was present in Pulumi but undocumented) - Pulumi comment fix: Updated stale `README.md` reference in `__main__.py` to point to `docs/how-to/gandi-operations.md` - How-to reorg: Moved 14 how-to guides into 3 subdirectories (`deployment/`, `configuration/`, `operations/`), collapsed the Documentation and Database index sections into Configuration and Operations respectively ## Verification - `docs-check-links` — all 180 wiki-links valid - `docs-check-filenames` — all 90 filenames unique - `dns-preview` — 5 resources unchanged, no drift - All pre-commit hooks pass ## Test plan - [ ] Verify docs site builds correctly with new paths - [ ] Spot-check a few wiki-links from other pages to moved how-to guides Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/200	2026-02-17 07:29:33 -08:00
Forgejo Actions	8a48171acf	Update docs release to v1.9.3 - Built changelog from towncrier fragments [skip ci]	2026-02-16 21:25:47 -08:00
Erich Blume	779b7d6709	Eliminate double towncrier run in release workflow (#199 ) v1.9.3 ## Summary - Added a new `build_quartz` Dagger function that builds the Quartz site from a pre-processed source tree (no towncrier) - Reordered the release workflow so towncrier runs once on the runner, then passes the updated working tree to `build-quartz` - `build_docs` and `build_changelog` are preserved for standalone use — `build_docs` now delegates to `build_quartz` internally ## Motivation Previously towncrier ran twice per release: once inside a Dagger container (via `build_docs` → `build_changelog`) and once on the runner to capture CHANGELOG.md changes for the git commit. This was wasteful and fragile — if towncrier behavior changed, the two runs could produce different results. ## Test plan - [ ] Review diff to confirm workflow step ordering is correct - [ ] Trigger a release and confirm towncrier runs only once - [ ] Verify the docs tarball contains the updated CHANGELOG.md - [ ] `dagger call build-quartz --src=. --version=vX.Y.Z` should work standalone Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/199	2026-02-16 21:24:34 -08:00
Erich Blume	627e2b7894	Add UniFi admin link to homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 19:15:46 -08:00

1 2 3 4 5 ...

427 commits