Commit graph

427 commits

Author SHA1 Message Date
bb1e1e5af9 Use index-based device IDs in nvidia device plugin
The CDI spec generated by NixOS uses index-based device names (0, all)
not UUIDs. The device plugin must match by using --device-id-strategy=index,
otherwise nvidia-container-runtime.cdi fails to resolve CDI devices.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:24:31 -08:00
9192a31204 Use nvidia-container-runtime.cdi for GPU workload injection
Replace the CDI device-list-strategy approach (which fails because the
device plugin generates its own CDI specs and can't find libs on NixOS)
with the nvidia-container-runtime.cdi runtime handler approach:

- Add wrapper script at /etc/nvidia-container-runtime/ that provides
  runc in PATH for nvidia-container-runtime.cdi
- Register nvidia runtime handler in k3s containerd config
- Create RuntimeClass for GPU workloads
- Revert device plugin to default envvar strategy (already working)
- Add runtimeClassName: nvidia to Frigate deployment

The nvidia-container-runtime.cdi binary reads the NixOS-generated CDI
specs from /var/run/cdi/ and injects GPU devices and driver libraries
into containers at create time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:20:01 -08:00
37f625b1fa Switch nvidia device plugin to CDI device list strategy
Use CDI-based device injection instead of nvidia-container-runtime.
The NixOS nvidia-container-toolkit module generates CDI specs with all
the correct nix store paths, so containerd's native CDI support handles
GPU device and library injection without a custom runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 13:04:38 -08:00
1556eaa5e4 Mount /nix/store to resolve NVIDIA library symlinks in device plugin
NixOS /run/opengl-driver/lib contains symlinks to /nix/store paths.
Without mounting the nix store, the symlinks are dangling inside the
container and libnvidia-ml.so can't be loaded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:42:46 -08:00
7b7358225c Remove CDI device-list-strategy from device plugin
CDI annotations require NVML validation that fails on NixOS. Use the
default envvar strategy for the device plugin — CDI device injection
still works at the containerd level via enable_cdi=true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:39:36 -08:00
2cd32108bd Run device plugin as privileged for GPU device node access
NVML needs both libnvidia-ml.so and /dev/nvidia* device nodes.
Mount libs to a non-clobbering path and run privileged (matching
NVIDIA's official deployment) for device file access.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:38:11 -08:00
4427eb77f2 Mount NVIDIA libs to standard lib path for NVML discovery
go-nvml uses dl.Open which looks in standard library paths.
Mount to /usr/lib/x86_64-linux-gnu for reliable discovery.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:36:15 -08:00
5194de13b9 Mount host NVIDIA libraries into device plugin for NVML access
The device plugin needs libnvidia-ml.so to discover GPUs even when using
CDI annotations. Mount /run/opengl-driver/lib (NixOS NVIDIA lib path)
into the pod and set LD_LIBRARY_PATH.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:34:20 -08:00
912dfcab10 Switch to CDI for GPU device injection instead of nvidia-container-runtime
NixOS splits nvidia-container-toolkit into separate derivations, making
the nvidia-container-runtime binary path unreliable in containerd config.
CDI (Container Device Interface) is the modern approach:

- Enable CDI in k3s containerd config (cdi_spec_dirs: /var/run/cdi)
- Device plugin uses CDI annotations to inject GPU devices
- Remove RuntimeClass (not needed with CDI)
- Remove runtimeClassName from Frigate deployment
- Mount CDI specs into device plugin pod

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:28:16 -08:00
7e498c5a34 Add nvidia runtimeClass to device plugin DaemonSet
The device plugin needs access to NVIDIA libraries (NVML) to discover
GPUs. Running with the nvidia runtime makes device files visible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:23:18 -08:00
57e5aeccc2 Fix containerd nvidia runtime config for v3 format
K3s ships containerd 2.0+ which uses config v3 format. The plugin key
path is 'io.containerd.cri.v1.runtime' not 'io.containerd.grpc.v1.cri'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 12:05:46 -08:00
986505c7ef Enable NFS client support on ringtail for k3s NFS volumes
mount.nfs was missing, preventing NFS PersistentVolume mounts.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:50:50 -08:00
cf5194c138 Add nvidia-device-plugin to service version tracking
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:45:53 -08:00
3e6d997c29 Bump NVIDIA k8s-device-plugin to v0.18.2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:44:10 -08:00
4e16116c4f Port Frigate NVR to ringtail k3s with GPU acceleration
Migrate Frigate from indri's minikube (arm64, ZMQ detector) to ringtail's
k3s cluster to leverage the RTX 4080 for TensorRT-accelerated ONNX inference.

- Enable nvidia-container-toolkit and configure k3s containerd nvidia runtime
- Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet)
- Re-target Frigate ArgoCD app to ringtail k3s cluster
- Switch image to x86_64 tensorrt variant with runtimeClassName: nvidia
- Add GPU resource limit (nvidia.com/gpu: 1) and increase shm to 512Mi
- Replace ZMQ detector with ONNX (auto-selects TensorRT execution provider)
- Update NFS PV and database PVC comments for ringtail

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 11:41:47 -08:00
16a4a9a616 Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216)
## Summary
- Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired
- Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s
- Move ntfy ArgoCD app from indri minikube to ringtail k3s
- Update Frigate docs to reflect detector removal and planned RTX 4080 migration
- Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail

## Deployment

After merge:
1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps:
   ```
   argocd app sync apps --prune
   ```
2. Sync new ringtail apps:
   ```
   argocd app sync mosquitto-ringtail
   argocd app sync ntfy-ringtail
   ```
3. Manually clean up the detector LaunchAgent on indri:
   ```
   ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ```

## Notes
- Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns)
- ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup
- Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed
- Frigate + frigate-notify will be ported to ringtail in a follow-up PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216
2026-02-19 11:22:44 -08:00
61ca1ca305 Deploy Tailscale operator on ringtail k3s cluster (#215)
## Summary
- Extract shared Tailscale operator resources (CRDs, RBAC, Deployment, ProxyClass, DNSConfig) into `tailscale-operator-base/` so both clusters reference the same manifests
- Add `tailscale-operator-ringtail/` overlay with 1-replica ProxyGroup and ExternalSecret for the shared OAuth client
- Add ArgoCD Application targeting `ringtail.tail8d86e.ts.net:6443`
- Update `.yamllint.yaml` ignore path for the moved `operator.yaml`

## Deployment and Testing
- [ ] Sync `apps` app to pick up the new Application definition
- [ ] `argocd app sync tailscale-operator-ringtail`
- [ ] Verify ExternalSecret syncs: `kubectl --context=k3s-ringtail -n tailscale get externalsecret`
- [ ] Verify operator pod runs: `kubectl --context=k3s-ringtail -n tailscale get pods`
- [ ] Verify ProxyGroup ready: `kubectl --context=k3s-ringtail -n tailscale get proxygroups`
- [ ] Verify indri operator still works: `argocd app diff tailscale-operator`
- [ ] Check Tailscale admin for new operator device with `tag:k8s-operator`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/215
2026-02-19 09:33:05 -08:00
695089499e Nix container build for nettest (#214)
## Summary
- Add `containers/nettest/default.nix` using `dockerTools.buildLayeredImage` with curl, jq, dnsutils, cacert, and bash — equivalent to the existing Dockerfile
- Update `container-tag-and-release` to require `--nix` or `--dockerfile` flag when both build types exist for a container
- Update `container-list` to show `[dockerfile+nix]` label when both exist

## Deployment and Testing
- [ ] SSH to ringtail, run `nix build -f containers/nettest/default.nix -o result` to verify the nix expression builds
- [ ] Tag `nettest-nix-v1.0.0`, confirm `build-container-nix` workflow runs on `nix-container-builder` runner and pushes to registry
- [ ] Smoke test on ringtail k3s: `kubectl run nettest --image=registry.ops.eblu.me/blumeops/nettest:v1.0.0 --restart=Never && kubectl logs nettest`
- [ ] Verify `mise run container-list` shows `[dockerfile+nix]` for nettest
- [ ] Verify `mise run container-tag-and-release nettest v1.1.0` prompts for build type

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/214
2026-02-19 08:42:58 -08:00
b475a1fcd7 Fix 1Password secret tasks always reporting changed in ringtail playbook (#213)
## Summary
- Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml`
- Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created`

## Test plan
- [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213
2026-02-19 07:25:24 -08:00
8f89239c78 Inhibit idle lock for fullscreen windows on ringtail (#212)
## Summary
- Adds `inhibit_idle fullscreen` window commands to sway config on ringtail
- Covers both Wayland-native (`app_id`) and XWayland (`class`) windows
- Prevents swayidle from locking the screen during gamepad-only gaming sessions where controller input isn't detected by the Wayland idle tracker

## Notes
This is a blanket fullscreen inhibit. A more targeted approach (daemon monitoring `/dev/input` gamepad events) may be desired later to allow idle lock during long-running fullscreen apps like Factorio.

## Deployment and Testing
- [ ] `mise run provision-ringtail` to deploy
- [ ] Run a fullscreen app and verify swayidle doesn't lock after 15 minutes
- [ ] Verify lock still activates when no fullscreen window is present

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/212
2026-02-19 07:20:05 -08:00
9829a6f971 Add screen lock and idle management to ringtail (#211)
## Summary
- Configure **swayidle** to lock screen (swaylock) after 15 minutes of inactivity
- Turn off display (DPMS) after 60 minutes, auto-restore on activity
- **swaylock** themed with Catppuccin Macchiato to match existing Sway config
- Add `Mod4+l` keybinding for manual screen lock
- Add PAM service for swaylock authentication
- Disable system suspend/hibernate entirely (workstation should never sleep)

## What changes
All changes in `nixos/ringtail/configuration.nix`:
- `security.pam.services.swaylock` — required for swaylock to authenticate on NixOS
- `systemd.sleep.extraConfig` — blocks all sleep/hibernate modes
- `programs.swaylock` (home-manager) — lock screen appearance config
- `services.swayidle` (home-manager) — idle timeout daemon with lock + DPMS events
- New keybinding `Mod4+l` for manual lock

## Deployment and Testing
- [ ] `mise run provision-ringtail`
- [ ] Verify swayidle is running: `systemctl --user status swayidle`
- [ ] Test manual lock with `Super+l`
- [ ] Verify display DPMS off after idle (can lower timeout temporarily to test)
- [ ] Confirm machine does not suspend: `systemctl status sleep.target`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/211
2026-02-19 06:46:37 -08:00
630ebcd12d Add ringtail DeviceTags and homelab-to-homelab SSH rule (#210)
## Summary
- Add `ringtail` DeviceTags Pulumi resource with `tag:homelab` + `tag:blumeops` (matching indri/sifaka pattern)
- Remove the bootstrap `ringtail_key` auth key — ringtail is already on the tailnet
- Add SSH ACL rule allowing `tag:homelab` → `tag:homelab` SSH, unblocking cross-host management (e.g., ringtail running ansible against indri)

## Deployment and Testing
- [ ] `mise run tailnet-preview` — dry run, confirm diff
- [ ] `mise run tailnet-up` — apply
- [ ] From ringtail: `ssh indri 'hostname'` — should succeed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/210
2026-02-18 21:48:11 -08:00
aa04618829 Fix k3s health check to use explicit KUBECONFIG path
k3s kubectl on ringtail needs KUBECONFIG set since the eblume
user doesn't have it in their default environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 21:26:00 -08:00
1f2134bf0a Fix provision-ringtail ls-remote matching with mirror refs
git ls-remote returns multiple lines when a mirror ref exists
(e.g. refs/remotes/remote_mirror_*/main). Take only the first
line to avoid a false mismatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 21:22:46 -08:00
918df9e642 Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209)
## Summary

  Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions
  runner for building containers with Nix.

  ### K3s cluster
  - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint)
  - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale
  - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`)
  - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access
  - `kubectl` added to system packages

  ### 1Password Connect + External Secrets Operator
  - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`,
  `external-secrets-config-ringtail`
  - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server
  - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password`
  namespace in post_tasks

  ### Systemd Forgejo Actions runner
  - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host
  - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH
  - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env`
  - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access

  ### Nix container build workflow
  - New `.forgejo/workflows/build-container-nix.yaml` triggers on `*-nix-v[0-9]*` tags (e.g. `nettest-nix-v1.0.0`)
  - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy`
  - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering

  ### Mise task updates
  - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`)
  - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`)

  ## Post-merge

  1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild
  2. Register k3s cluster in ArgoCD (first time only):
     ```fish
     ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' | \
       sed 's|127.0.0.1|ringtail.tail8d86e.ts.net|' > /tmp/k3s-ringtail.yaml
     set -x KUBECONFIG /tmp/k3s-ringtail.yaml
     argocd cluster add default --name k3s-ringtail
  3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail ->
  external-secrets-config-ringtail
  4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder'
  5. Check Forgejo admin panel for ringtail-nix-builder runner online
  6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209
2026-02-18 21:15:30 -08:00
535f897054 Polish ringtail NixOS config and add documentation (#208)
## Summary
- Fix Super+Return keybinding to launch wezterm in sway
- Set fish as default login shell
- Remove `initialPassword` (real password already set)
- Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep)
- Add ringtail reference card, update host inventory and reference index
- Changelog fragment

## Post-merge deployment
- `mise run provision-ringtail` to rebuild NixOS
- On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration)
- Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208
2026-02-18 17:53:47 -08:00
b76f2314c2 Add force: true to ringtail git task
nixos-rebuild can dirty the tree (e.g. flake.lock updates), which
blocks the Ansible git module. Force ensures we always reset to the
upstream state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:32:23 -08:00
7bf46f4e28 Add flake.lock for ringtail NixOS config
Prevents 'Git tree is dirty' warnings during nixos-rebuild.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:31:21 -08:00
5a087c10df Fix deprecated greetd.tuigreet package reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:30:01 -08:00
4b7491c58f Add python3 to ringtail for Ansible compatibility
NixOS doesn't include Python by default. Ansible needs it on the
managed host for module execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:29:09 -08:00
b08ed98881 Enable passwordless sudo for wheel group on ringtail
Required for Ansible unattended provisioning via become: true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:25:32 -08:00
8ee6c1271a Add --accept-routes and --ssh to tailscale config
Makes tailscale settings declarative so they persist across rebuilds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:24:17 -08:00
aaf7e73c27 Fix sway on NVIDIA proprietary drivers
Sway/wlroots refuses to start on proprietary NVIDIA by default.
Add --unsupported-gpu flag and disable hardware cursors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:08:26 -08:00
104e49d337 Allow unfree packages for NVIDIA drivers and Steam
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 08:56:27 -08:00
b9d813cde1 Add NixOS configuration for ringtail workstation (#207)
## Summary
- NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/`
- Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe)
- NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam
- Tailscale integration for tailnet connectivity
- Ansible playbook + `mise run provision-ringtail` for ongoing management
- Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap

## Deployment Order
1. **Merge PR**
2. `pulumi up` in tailscale stack → creates auth key
3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets`
4. On ringtail NixOS installer:
   - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo)
   - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail`
5. Reboot, `tailscale up --auth-key=<key>`
6. Verify: `tailscale status`, SSH from gilbert

## Test plan
- [ ] Review NixOS configuration for completeness
- [ ] Verify disko partition layout matches ringtail hardware
- [ ] Run `pulumi preview` for tailscale stack
- [ ] Install NixOS on ringtail
- [ ] Confirm tailscale connectivity
- [ ] Confirm sway desktop works
- [ ] Test `mise run provision-ringtail` for ongoing management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207
2026-02-18 08:24:25 -08:00
5f9b024b4a Add Apple Silicon ZMQ detector for Frigate (#206)
## Summary

- New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri
- Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms)
- Removes detect FPS cap (no longer needed with fast inference)
- Updates Frigate docs and adds changelog fragment

## Deployment

### Phase 1: Deploy detector on indri (one-time setup + ansible)
```fish
ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector'
ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install'
mise run provision-indri -- --tags frigate_detector --check --diff  # dry run
mise run provision-indri -- --tags frigate_detector                 # apply
ssh indri 'launchctl list mcquack.eblume.frigate-detector'          # verify running
ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log'    # verify bound
```

### Phase 2: Test connectivity
```fish
kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555
```

### Phase 3: Deploy Frigate config (branch workflow)
```fish
argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate
```

### Phase 4: Post-deploy checks
- [ ] Pod starts, no config errors
- [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms
- [ ] detect_fps uncapped
- [ ] Recordings and MQTT events flowing
- [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206
2026-02-17 19:03:28 -08:00
f45897b7c7 Upgrade Frigate 0.16.4 → 0.17.0-rc2 (#205)
## Summary

- Bump Frigate image from `0.16.4-standard-arm64` to `0.17.0-rc2-standard-arm64`
- Adapt `record` config to 0.17 schema: `retain.days`/`mode: all` → `continuous.days`
- Update service docs and version tracker

This is the first step toward the Apple Silicon ZMQ detector. The existing ONNX detector is kept so we can validate the upgrade independently.

## What is NOT changing

- Detector config (still `type: onnx` with YOLO-NAS-s)
- go2rtc streams, MQTT, cameras, zones, review rules
- frigate-notify, storage PVs, Grafana dashboard

## Deployment and Testing

- [ ] `argocd app set frigate --revision upgrade-frigate-0.17 && argocd app sync frigate`
- [ ] Pod starts, `/api/version` returns `0.17.0-rc2`
- [ ] No config errors in pod logs
- [ ] Frigate web UI loads at `https://nvr.ops.eblu.me`
- [ ] Live view works, detection running (`/api/stats` shows `detection_fps > 0`)
- [ ] Recordings being created (`/api/recordings/summary`)
- [ ] MQTT events flowing (check frigate-notify logs)
- [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/205
2026-02-17 16:56:12 -08:00
acd213559e Fix frigate live view by capping detect FPS (#204)
## Summary
- Cap detect FPS to 2 to prevent recording segment backlog from ONNX inference bottleneck (~750ms/frame on ARM64 CPU)
- Sync motion masks from live config (added second mask area)
- Update driveway_entrance zone coordinates from live config
- Add explicit alert labels `[person, car]` while keeping `required_zones: [driveway_entrance]`

## Context
The "No frames have been received" error on the gablecam live view was caused by the detect stream falling behind — ONNX YOLO-NAS-s takes ~750ms per inference on ARM64 CPU, but the sub-stream sends 5 FPS. This caused recording segments to pile up and the ffmpeg watchdog to repeatedly kill/restart the process, creating gaps in the live view.

## Test plan
- [ ] Sync ArgoCD `frigate` app to branch and verify pod restarts cleanly
- [ ] Check `/api/stats` — `skipped_fps` should drop significantly, `process_fps` should be close to 2
- [ ] Verify live view at https://nvr.ops.eblu.me/#gablecam no longer shows "No frames" error
- [ ] Verify detections and alerts still work in the driveway_entrance zone

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/204
2026-02-17 16:18:02 -08:00
1e96866dd3 Grafana helm chart upgrade plan 2026-02-17 11:15:34 -08:00
b9d1acaf3a Service review for external-secrets 2026-02-17 10:48:09 -08:00
105a2c8c08 Update External Secrets Helm chart 1.3.1 → 2.0.0 (#203)
## Summary
- Bump External Secrets Operator Helm chart from `helm-chart-1.3.1` to `helm-chart-2.0.0` (operator v1.3.2)
- Updates both the operator app and CRDs app `targetRevision`
- No Helm values changes needed — `installCRDs`, `resources`, `webhook`, `certController` keys are unchanged

## Breaking changes in chart 2.0.0
- **Removed providers:** Alibaba and Device42 (unmaintained) — does not affect our 1Password setup
- **Templating engine v1 deprecated** — our ExternalSecrets don't set `engineVersion`, so they use the default (v2)
- **Webhook `failurePolicy`** for SecretStore is now dynamic

## Deployment
1. Sync CRDs first: `argocd app set external-secrets-crds --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets-crds`
2. Sync operator: `argocd app set external-secrets --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets`
3. Verify: `kubectl --context=minikube-indri -n external-secrets get pods`
4. After merge, set both apps back to `--revision main`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/203
2026-02-17 10:43:21 -08:00
5fbe70d1ba Port ntfy to locally built container image (#202)
All checks were successful
Build Container / build (push) Successful in 6m28s
ntfy-v1.0.0
## Summary
- Add `containers/ntfy/Dockerfile` — three-stage build (Node web UI, Go+CGO server, Alpine runtime) pinned to commit SHA `a03a37fe` (v2.17.0), sourced from forge mirror
- Update ntfy deployment image from `binwiederhier/ntfy:v2.17.0` to `registry.ops.eblu.me/blumeops/ntfy:v1.0.0`
- Note fish shell in CLAUDE.md

## Deployment
After merge, release the container image:
```fish
mise run container-tag-and-release ntfy v1.0.0
```
Then sync:
```fish
argocd app sync ntfy
```

## Test plan
- [x] `docker build` succeeds
- [x] `dagger call build --src=. --container-name=ntfy` succeeds (exit 0, container ID printed)
- [x] `ntfy --help` works in built container
- [ ] Tag and release `ntfy-v1.0.0` after merge
- [ ] Verify ntfy pod starts with new image
- [ ] Verify health endpoint responds at `ntfy.ops.eblu.me/v1/health`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/202
2026-02-17 10:18:20 -08:00
3e604d8fdc Review ntfy: upgrade to v2.17.0 and add reference docs (#201)
## Summary
- Upgrade ntfy from v2.11.0 to v2.17.0 (6 minor releases, no breaking changes)
- Add reference doc for ntfy service
- Add reference doc for frigate service (ntfy's sole producer via frigate-notify)
- Update reference index and service-versions.yaml tracking

## Notable upstream changes (v2.12.0–v2.17.0)
- **v2.14.0:** Declarative users/ACL config in files
- **v2.15.0:** `require-login` flag for topic-level auth
- **v2.16.0:** Dead man's switch (heartbeat) notifications, notification update/delete
- **v2.17.0:** Priority templating, crash fixes (nil pointer panics)

## Deployment and Testing
- [ ] ArgoCD sync ntfy after merge
- [ ] Verify ntfy pod healthy with new image
- [ ] Send a test notification via `curl -d "test" https://ntfy.ops.eblu.me/test`
- [ ] Verify frigate-notify still delivers alerts to ntfy

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/201
2026-02-17 09:51:40 -08:00
54c3b0a5f3 Expanded some CLAUDE.md stuff manualy 2026-02-17 07:54:34 -08:00
2f599a15bd Fix zk-docs broken path after how-to reorg
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 07:32:54 -08:00
Forgejo Actions
530460171a Update docs release to v1.9.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-17 07:30:39 -08:00
27d8f3cf1f Review gandi-operations doc and reorganize how-to guides (#200) v1.9.4
## Summary
- **Doc review:** Reviewed `gandi-operations.md` — added `last-reviewed` frontmatter, verified all wiki-links, confirmed Pulumi state has no drift
- **Gandi reference fix:** Added missing `cv.eblu.me` CNAME row to `gandi.md` DNS records table (was present in Pulumi but undocumented)
- **Pulumi comment fix:** Updated stale `README.md` reference in `__main__.py` to point to `docs/how-to/gandi-operations.md`
- **How-to reorg:** Moved 14 how-to guides into 3 subdirectories (`deployment/`, `configuration/`, `operations/`), collapsed the Documentation and Database index sections into Configuration and Operations respectively

## Verification
- `docs-check-links` — all 180 wiki-links valid
- `docs-check-filenames` — all 90 filenames unique
- `dns-preview` — 5 resources unchanged, no drift
- All pre-commit hooks pass

## Test plan
- [ ] Verify docs site builds correctly with new paths
- [ ] Spot-check a few wiki-links from other pages to moved how-to guides

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/200
2026-02-17 07:29:33 -08:00
Forgejo Actions
8a48171acf Update docs release to v1.9.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-16 21:25:47 -08:00
779b7d6709 Eliminate double towncrier run in release workflow (#199) v1.9.3
## Summary

- Added a new `build_quartz` Dagger function that builds the Quartz site from a pre-processed source tree (no towncrier)
- Reordered the release workflow so towncrier runs **once** on the runner, then passes the updated working tree to `build-quartz`
- `build_docs` and `build_changelog` are preserved for standalone use — `build_docs` now delegates to `build_quartz` internally

## Motivation

Previously towncrier ran twice per release: once inside a Dagger container (via `build_docs` → `build_changelog`) and once on the runner to capture CHANGELOG.md changes for the git commit. This was wasteful and fragile — if towncrier behavior changed, the two runs could produce different results.

## Test plan

- [ ] Review diff to confirm workflow step ordering is correct
- [ ] Trigger a release and confirm towncrier runs only once
- [ ] Verify the docs tarball contains the updated CHANGELOG.md
- [ ] `dagger call build-quartz --src=. --version=vX.Y.Z` should work standalone

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/199
2026-02-16 21:24:34 -08:00
627e2b7894 Add UniFi admin link to homepage dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 19:15:46 -08:00