blumeops

Author	SHA1	Message	Date
Erich Blume	e0c6b7df99	Add Authentik to homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 13:03:48 -08:00
Erich Blume	71cb256527	Deploy Authentik identity provider (C2 Mikado) (#227 ) ## Summary C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex. This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved. ## Current Mikado State - Goal: `deploy-authentik` (active) - Leaf prerequisites: - `build-authentik-container` — Build Nix container image - `provision-authentik-database` — Create PostgreSQL database on CNPG cluster - `create-authentik-secrets` — Create 1Password item with credentials ## Process refinements - Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early ## Test plan - [ ] `mise run docs-mikado` shows correct dependency chain - [ ] Leaf nodes can be worked independently - [ ] Container builds on ringtail - [ ] Authentik starts and reaches healthy state - [ ] Forgejo OAuth2 connector works Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227	2026-02-20 12:55:59 -08:00
Forgejo Actions	18f1ac61fc	Update docs release to v1.10.0 - Built changelog from towncrier fragments [skip ci]	2026-02-19 20:45:43 -08:00
Erich Blume	0cdc143227	Deploy Dex OIDC identity provider with Grafana SSO (#222 ) ## Summary - Deploys Dex OIDC identity provider on ringtail k3s cluster as central authentication service - Integrates Grafana as first SSO client via `auth.generic_oauth` - Uses Kubernetes CRD storage backend (no PVC needed) - All secrets (bcrypt hash, client secrets) injected via ExternalSecrets from 1Password item "Dex (blumeops)" - NixOS-built container image via `containers/dex/default.nix` ## Pre-requisites (manual, before deployment) 1. Create 1Password item "Dex (blumeops)" in `blumeops` vault with fields: - `password`: strong generated password for Dex login - `static-password-hash`: bcrypt hash of above (`htpasswd -BnC 10 eblume`, copy hash after `eblume:`) - `grafana-client-secret`: random 32-char hex (`openssl rand -hex 16`) 2. Build container: `mise run container-tag-and-release dex v1.0.0` ## Deployment sequence 1. Build container: `mise run container-tag-and-release dex v1.0.0` 2. Deploy Caddy: `mise run provision-indri -- --tags caddy` 3. Sync ArgoCD: `argocd app sync apps` → `argocd app sync dex` 4. Verify Dex: `curl https://dex.ops.eblu.me/.well-known/openid-configuration` 5. Sync Grafana: `argocd app sync grafana-config` → `argocd app sync grafana` 6. Test SSO: Visit `https://grafana.ops.eblu.me/login`, click "Sign in with Dex" ## Verification - [ ] Container image exists: `mise run container-list` shows `dex:v1.0.0-nix` - [ ] `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid OIDC discovery - [ ] `curl https://dex.ops.eblu.me/healthz` returns healthy - [ ] Grafana login shows "Sign in with Dex" button alongside local login - [ ] OIDC flow: click Dex → enter credentials → redirect back → logged in as Admin - [ ] Break-glass: local admin login still works - [ ] `mise run services-check` passes ## Files changed \| File \| Action \| Purpose \| \|------\|--------\|---------\| \| `containers/dex/default.nix` \| Create \| NixOS container build \| \| `argocd/apps/dex.yaml` \| Create \| ArgoCD app targeting ringtail \| \| `argocd/manifests/dex/*` (8 files) \| Create \| K8s manifests (RBAC, ExternalSecret, Deployment, Service, Ingress) \| \| `argocd/manifests/grafana-config/external-secret-dex-oauth.yaml` \| Create \| Grafana OIDC client secret \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Modify \| Add new ExternalSecret resource \| \| `argocd/manifests/grafana/values.yaml` \| Modify \| Add `auth.generic_oauth` config + envFromSecrets \| \| `ansible/roles/caddy/defaults/main.yml` \| Modify \| Add `dex.ops.eblu.me` reverse proxy entry \| \| `docs/changelog.d/feature-dex-oidc.feature.md` \| Create \| Changelog fragment \| Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/222	2026-02-19 20:24:24 -08:00
Erich Blume	b876e39981	Replace Homepage Helm chart with kustomize manifests and custom Dockerfile (#221 ) ## Summary - Replace third-party Helm chart (jameswynn/homepage v2.1.0, pinned at app v1.2.0) with plain kustomize manifests and a custom Dockerfile building from forge mirror at v1.10.1 - Adds Dockerfile (`containers/homepage/`) with multi-stage build (node:22-slim builder, node:22-alpine runtime) - Creates kustomize manifests: Deployment, Service, ConfigMap (6 config files), ServiceAccount, ClusterRole, ClusterRoleBinding - Keeps existing ingress-tailscale.yaml and all 6 ExternalSecret resources unchanged - Updates ArgoCD app definition from multi-source Helm to single directory source ## Prerequisite - Homepage source mirrored at forge.ops.eblu.me/eblume/homepage.git ✅ - Container must be built and pushed before syncing: `mise run container-release homepage v1.10.1` ## Deployment and Testing - [ ] Build and push container image: `mise run container-release homepage v1.10.1` - [ ] Branch-test via ArgoCD: `argocd app set homepage --revision feature/homepage-kustomize && argocd app sync homepage` - [ ] Verify dashboard loads at go.ops.eblu.me / go.tail8d86e.ts.net - [ ] Verify k8s autodiscovery works (services appear on dashboard) - [ ] Verify widgets load (weather, Forgejo, Jellyfin, etc.) - [ ] After merge: `argocd app set homepage --revision main && argocd app sync homepage` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/221	2026-02-19 18:29:19 -08:00
Erich Blume	cabd0bc9cf	Update Frigate zone masks and expand alert notifications (#219 ) ## Summary - Synced driveway_entrance zone coordinates from live Frigate config (adjusted mask boundaries) - Added `inertia: 3` and `loitering_time: 0` to driveway_entrance zone - Expanded review alerts to require either `driveway_entrance` or `driveway` zone (was entrance only) - Updated frigate-notify config to allow alerts from both `driveway_entrance` and `driveway` zones ## Deployment and Testing - [ ] Merge and sync frigate ArgoCD app on ringtail - [ ] Sync frigate-notify (restart pod to pick up ConfigMap change) - [ ] Verify alerts fire for person/car in driveway zone Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/219	2026-02-19 17:32:02 -08:00
Erich Blume	d5d32fe91f	Port Frigate NVR to ringtail k3s with GPU acceleration (#217 ) ## Summary - Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime - Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose `nvidia.com/gpu` resources - Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX) - Switch Frigate image to `-tensorrt` variant with GPU resource limits and increased shared memory ## Manual Prerequisites 1. NFS access: Verify ringtail can mount `sifaka:/volume1/frigate` ```fish ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1' ``` 2. YOLO model: Verify `/volume1/frigate/models/yolov9m.onnx` exists on sifaka ## Deployment Steps 1. Provision ringtail: `mise run provision-ringtail` 2. Sync ArgoCD apps: `argocd app sync apps --prune` 3. Deploy NVIDIA device plugin: `argocd app sync nvidia-device-plugin` 4. Verify GPU: `kubectl --context=k3s-ringtail get nodes -o json \| jq '.items[].status.capacity'` 5. Deploy Frigate: `argocd app sync frigate` ## Verification - [ ] `nvidia.com/gpu: 1` visible in node capacity - [ ] Frigate pod running with GPU allocated - [ ] Frigate UI loads at `https://nvr.ops.eblu.me` - [ ] Detector shows ONNX/TensorRT on System page - [ ] Camera feed with bounding boxes in live view - [ ] TensorRT engine build completes (watch logs on first start) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/217	2026-02-19 14:27:04 -08:00
Erich Blume	16a4a9a616	Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216 ) ## Summary - Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired - Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s - Move ntfy ArgoCD app from indri minikube to ringtail k3s - Update Frigate docs to reflect detector removal and planned RTX 4080 migration - Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail ## Deployment After merge: 1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps: ``` argocd app sync apps --prune ``` 2. Sync new ringtail apps: ``` argocd app sync mosquitto-ringtail argocd app sync ntfy-ringtail ``` 3. Manually clean up the detector LaunchAgent on indri: ``` ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ``` ## Notes - Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns) - ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup - Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed - Frigate + frigate-notify will be ported to ringtail in a follow-up PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216	2026-02-19 11:22:44 -08:00
Erich Blume	61ca1ca305	Deploy Tailscale operator on ringtail k3s cluster (#215 ) ## Summary - Extract shared Tailscale operator resources (CRDs, RBAC, Deployment, ProxyClass, DNSConfig) into `tailscale-operator-base/` so both clusters reference the same manifests - Add `tailscale-operator-ringtail/` overlay with 1-replica ProxyGroup and ExternalSecret for the shared OAuth client - Add ArgoCD Application targeting `ringtail.tail8d86e.ts.net:6443` - Update `.yamllint.yaml` ignore path for the moved `operator.yaml` ## Deployment and Testing - [ ] Sync `apps` app to pick up the new Application definition - [ ] `argocd app sync tailscale-operator-ringtail` - [ ] Verify ExternalSecret syncs: `kubectl --context=k3s-ringtail -n tailscale get externalsecret` - [ ] Verify operator pod runs: `kubectl --context=k3s-ringtail -n tailscale get pods` - [ ] Verify ProxyGroup ready: `kubectl --context=k3s-ringtail -n tailscale get proxygroups` - [ ] Verify indri operator still works: `argocd app diff tailscale-operator` - [ ] Check Tailscale admin for new operator device with `tag:k8s-operator` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/215	2026-02-19 09:33:05 -08:00
Erich Blume	918df9e642	Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209 ) ## Summary Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions runner for building containers with Nix. ### K3s cluster - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint) - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`) - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access - `kubectl` added to system packages ### 1Password Connect + External Secrets Operator - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`, `external-secrets-config-ringtail` - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password` namespace in post_tasks ### Systemd Forgejo Actions runner - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env` - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access ### Nix container build workflow - New `.forgejo/workflows/build-container-nix.yaml` triggers on `-nix-v[0-9]` tags (e.g. `nettest-nix-v1.0.0`) - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy` - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering ### Mise task updates - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`) - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`) ## Post-merge 1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild 2. Register k3s cluster in ArgoCD (first time only): ```fish ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' \| \ sed 's\|127.0.0.1\|ringtail.tail8d86e.ts.net\|' > /tmp/k3s-ringtail.yaml set -x KUBECONFIG /tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail 3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail -> external-secrets-config-ringtail 4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder' 5. Check Forgejo admin panel for ringtail-nix-builder runner online 6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209	2026-02-18 21:15:30 -08:00
Erich Blume	5f9b024b4a	Add Apple Silicon ZMQ detector for Frigate (#206 ) ## Summary - New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri - Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms) - Removes detect FPS cap (no longer needed with fast inference) - Updates Frigate docs and adds changelog fragment ## Deployment ### Phase 1: Deploy detector on indri (one-time setup + ansible) ```fish ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector' ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install' mise run provision-indri -- --tags frigate_detector --check --diff # dry run mise run provision-indri -- --tags frigate_detector # apply ssh indri 'launchctl list mcquack.eblume.frigate-detector' # verify running ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log' # verify bound ``` ### Phase 2: Test connectivity ```fish kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555 ``` ### Phase 3: Deploy Frigate config (branch workflow) ```fish argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate ``` ### Phase 4: Post-deploy checks - [ ] Pod starts, no config errors - [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms - [ ] detect_fps uncapped - [ ] Recordings and MQTT events flowing - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206	2026-02-17 19:03:28 -08:00
Erich Blume	f45897b7c7	Upgrade Frigate 0.16.4 → 0.17.0-rc2 (#205 ) ## Summary - Bump Frigate image from `0.16.4-standard-arm64` to `0.17.0-rc2-standard-arm64` - Adapt `record` config to 0.17 schema: `retain.days`/`mode: all` → `continuous.days` - Update service docs and version tracker This is the first step toward the Apple Silicon ZMQ detector. The existing ONNX detector is kept so we can validate the upgrade independently. ## What is NOT changing - Detector config (still `type: onnx` with YOLO-NAS-s) - go2rtc streams, MQTT, cameras, zones, review rules - frigate-notify, storage PVs, Grafana dashboard ## Deployment and Testing - [ ] `argocd app set frigate --revision upgrade-frigate-0.17 && argocd app sync frigate` - [ ] Pod starts, `/api/version` returns `0.17.0-rc2` - [ ] No config errors in pod logs - [ ] Frigate web UI loads at `https://nvr.ops.eblu.me` - [ ] Live view works, detection running (`/api/stats` shows `detection_fps > 0`) - [ ] Recordings being created (`/api/recordings/summary`) - [ ] MQTT events flowing (check frigate-notify logs) - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/205	2026-02-17 16:56:12 -08:00
Erich Blume	acd213559e	Fix frigate live view by capping detect FPS (#204 ) ## Summary - Cap detect FPS to 2 to prevent recording segment backlog from ONNX inference bottleneck (~750ms/frame on ARM64 CPU) - Sync motion masks from live config (added second mask area) - Update driveway_entrance zone coordinates from live config - Add explicit alert labels `[person, car]` while keeping `required_zones: [driveway_entrance]` ## Context The "No frames have been received" error on the gablecam live view was caused by the detect stream falling behind — ONNX YOLO-NAS-s takes ~750ms per inference on ARM64 CPU, but the sub-stream sends 5 FPS. This caused recording segments to pile up and the ffmpeg watchdog to repeatedly kill/restart the process, creating gaps in the live view. ## Test plan - [ ] Sync ArgoCD `frigate` app to branch and verify pod restarts cleanly - [ ] Check `/api/stats` — `skipped_fps` should drop significantly, `process_fps` should be close to 2 - [ ] Verify live view at https://nvr.ops.eblu.me/#gablecam no longer shows "No frames" error - [ ] Verify detections and alerts still work in the driveway_entrance zone 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/204	2026-02-17 16:18:02 -08:00
Erich Blume	105a2c8c08	Update External Secrets Helm chart 1.3.1 → 2.0.0 (#203 ) ## Summary - Bump External Secrets Operator Helm chart from `helm-chart-1.3.1` to `helm-chart-2.0.0` (operator v1.3.2) - Updates both the operator app and CRDs app `targetRevision` - No Helm values changes needed — `installCRDs`, `resources`, `webhook`, `certController` keys are unchanged ## Breaking changes in chart 2.0.0 - Removed providers: Alibaba and Device42 (unmaintained) — does not affect our 1Password setup - Templating engine v1 deprecated — our ExternalSecrets don't set `engineVersion`, so they use the default (v2) - Webhook `failurePolicy` for SecretStore is now dynamic ## Deployment 1. Sync CRDs first: `argocd app set external-secrets-crds --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets-crds` 2. Sync operator: `argocd app set external-secrets --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets` 3. Verify: `kubectl --context=minikube-indri -n external-secrets get pods` 4. After merge, set both apps back to `--revision main` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/203	2026-02-17 10:43:21 -08:00
Erich Blume	5fbe70d1ba	Port ntfy to locally built container image (#202 ) All checks were successful Build Container / build (push) Successful in 6m28s Details ## Summary - Add `containers/ntfy/Dockerfile` — three-stage build (Node web UI, Go+CGO server, Alpine runtime) pinned to commit SHA `a03a37fe` (v2.17.0), sourced from forge mirror - Update ntfy deployment image from `binwiederhier/ntfy:v2.17.0` to `registry.ops.eblu.me/blumeops/ntfy:v1.0.0` - Note fish shell in CLAUDE.md ## Deployment After merge, release the container image: ```fish mise run container-tag-and-release ntfy v1.0.0 ``` Then sync: ```fish argocd app sync ntfy ``` ## Test plan - [x] `docker build` succeeds - [x] `dagger call build --src=. --container-name=ntfy` succeeds (exit 0, container ID printed) - [x] `ntfy --help` works in built container - [ ] Tag and release `ntfy-v1.0.0` after merge - [ ] Verify ntfy pod starts with new image - [ ] Verify health endpoint responds at `ntfy.ops.eblu.me/v1/health` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/202	2026-02-17 10:18:20 -08:00
Erich Blume	3e604d8fdc	Review ntfy: upgrade to v2.17.0 and add reference docs (#201 ) ## Summary - Upgrade ntfy from v2.11.0 to v2.17.0 (6 minor releases, no breaking changes) - Add reference doc for ntfy service - Add reference doc for frigate service (ntfy's sole producer via frigate-notify) - Update reference index and service-versions.yaml tracking ## Notable upstream changes (v2.12.0–v2.17.0) - v2.14.0: Declarative users/ACL config in files - v2.15.0: `require-login` flag for topic-level auth - v2.16.0: Dead man's switch (heartbeat) notifications, notification update/delete - v2.17.0: Priority templating, crash fixes (nil pointer panics) ## Deployment and Testing - [ ] ArgoCD sync ntfy after merge - [ ] Verify ntfy pod healthy with new image - [ ] Send a test notification via `curl -d "test" https://ntfy.ops.eblu.me/test` - [ ] Verify frigate-notify still delivers alerts to ntfy Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/201	2026-02-17 09:51:40 -08:00
Forgejo Actions	530460171a	Update docs release to v1.9.4 - Built changelog from towncrier fragments [skip ci]	2026-02-17 07:30:39 -08:00
Forgejo Actions	8a48171acf	Update docs release to v1.9.3 - Built changelog from towncrier fragments [skip ci]	2026-02-16 21:25:47 -08:00
Erich Blume	627e2b7894	Add UniFi admin link to homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 19:15:46 -08:00
Erich Blume	d35c26d2b0	Fix mosquitto image tag: use 2.0.22 instead of nonexistent 2.1.2 (#198 ) ## Summary - The `eclipse-mosquitto:2.1.2` tag doesn't exist on Docker Hub — the 2.1.x series only publishes `-alpine` variants - Corrects the pinned tag to `2.0.22`, the latest non-alpine version (matching what the old floating `:2` tag was resolving to) - Updates tracking file and changelog fragment accordingly ## Context The previous PR #197 pinned mosquitto from floating `:2` to `2.1.2`, but the new pod failed with `ErrImagePull` ("manifest unknown"). The old pod is still running on `:2`. ## Test plan - [ ] Verify `eclipse-mosquitto:2.0.22` pulls successfully - [ ] Verify mosquitto pod restarts and passes readiness/liveness probes - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/198	2026-02-16 17:19:32 -08:00
Erich Blume	0aab73af40	Bump mosquitto to 2.1.2 and tailscale-operator to v1.94.2 (#197 ) ## Summary - Pin mosquitto from floating `:2` tag to `2.1.2` (latest upstream, released Feb 9 2026) - Bump tailscale k8s-operator and proxy images from `v1.94.1` to `v1.94.2` - Record 7 reviewed services in `service-versions.yaml` (first service review pass) ## Services reviewed (11 total) \| Service \| Deployed \| Latest \| Status \| \|---------\|----------\|--------\|--------\| \| prometheus \| v3.9.1 \| v3.9.1 \| Current \| \| loki \| 3.6.5 \| 3.6.5 \| Current \| \| kube-state-metrics \| v2.18.0 \| v2.18.0 \| Current \| \| mosquitto \| :2 (floating) \| 2.1.2 \| Pinned in this PR \| \| frigate \| 0.16.4 \| 0.16.4 \| Current \| \| alloy-k8s \| v1.13.1 \| v1.13.1 \| Current \| \| tailscale-operator \| v1.94.1 \| v1.94.2 \| Bumped in this PR \| \| ntfy \| v2.11.0 \| v2.17.0 \| Stale (future PR) \| \| frigate-notify \| v0.3.5 \| v0.5.4 \| Stale (future PR) \| \| homepage \| chart 2.1.0 \| app v1.10.1 \| Stale (future PR) \| \| grafana \| chart 8.8.2 \| chart 10.5.15 \| Stale (future PR) \| ## Deployment and Testing - [ ] `argocd app sync apps` - [ ] `argocd app set mosquitto --revision service-review/mosquitto-tailscale-operator && argocd app sync mosquitto` - [ ] `argocd app set tailscale-operator --revision service-review/mosquitto-tailscale-operator && argocd app sync tailscale-operator` - [ ] Verify mosquitto pod restarts with pinned image - [ ] Verify tailscale operator and proxy pods update - [ ] `mise run services-check` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/197	2026-02-16 17:14:38 -08:00
Forgejo Actions	994bed0693	Update docs release to v1.9.2 - Built changelog from towncrier fragments [skip ci]	2026-02-16 15:51:12 -08:00
Erich Blume	74294094e3	Fix navidrome custom container image v1.0.2 (#194 ) ## Summary - Switch navidrome deployment from upstream `deluan/navidrome:0.60.3` back to custom image `registry.ops.eblu.me/blumeops/navidrome:v1.0.2` - The v1.0.1 image was tagged before the `USER 65534` removal commit, so it still ran as a non-root user that couldn't write to the SQLite data directory - v1.0.2 is built from current main which includes both the `zlib-dev` build fix and the non-root user removal ## Deployment and Testing - [ ] Wait for CI to build `navidrome:v1.0.2` image - [ ] Sync via ArgoCD and verify pod starts without CrashLoopBackOff - [ ] Verify navidrome UI accessible at https://navidrome.ops.eblu.me Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/194	2026-02-16 08:24:33 -08:00
Erich Blume	7ffbd12ac8	Fix Frigate parked car re-detection and enable writable config (#193 ) All checks were successful Build Container / build (push) Successful in 12s Details ## Summary - Remove car-specific `max_frames: 150` which was causing a forget-and-re-detect loop on parked cars (every ~30 seconds at 5fps) - Set `stationary.interval: 0` so Frigate never re-runs detection on stationary objects - Replace read-only configmap subPath mount with initContainer + emptyDir, so Frigate UI changes (zones, masks) persist at runtime ## Context Frigate was spamming notifications because `max_frames` for cars caused it to "forget" a parked car after 150 frames, then immediately re-detect it as a brand new object. The fix follows [Frigate's official parked cars guide](https://docs.frigate.video/guides/parked_cars/). The writable config change also unblocks using `required_zones` for car alerts — zones can now be drawn in the Frigate UI and will survive until pod reschedule (at which point they should be baked into the configmap via IaC). ## Test plan - [ ] Sync frigate app via ArgoCD and verify pod starts with initContainer - [ ] Confirm parked cars no longer trigger repeated alerts - [ ] Draw a zone/mask in Frigate UI, save, verify it persists after Frigate restart - [ ] Set up `driveway_entrance` required zone for car alerts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/193	2026-02-15 17:48:14 -08:00
Erich Blume	6c41338b36	Revert navidrome to upstream image pending container fix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 08:22:04 -08:00
Erich Blume	accbb80683	Update navidrome image to v1.0.1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-15 08:18:17 -08:00
Erich Blume	996441876d	Document container build pattern and port navidrome (#192 ) Some checks failed Build Container / build (push) Failing after 4m28s Details ## Summary - Add how-to guide (`docs/how-to/build-container-image.md`) covering the full container build workflow: directory layout, Dagger local builds, mise release task, and common patterns with links to existing containers - Port navidrome from upstream `deluan/navidrome:0.60.3` to a custom three-stage build (`containers/navidrome/Dockerfile`) using Node + Go + Alpine - Update navidrome deployment to use `registry.ops.eblu.me/blumeops/navidrome:v1.0.0` ## Deployment and Testing - [x] `dagger call build --src=. --container-name=navidrome` builds successfully - [ ] After merge: `mise run container-tag-and-release navidrome v1.0.0` - [ ] After image published: `argocd app sync navidrome` and verify pod starts Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/192	2026-02-15 08:05:11 -08:00
Forgejo Actions	26c1ff5ce6	Update docs release to v1.9.1 - Built changelog from towncrier fragments [skip ci]	2026-02-15 07:43:00 -08:00
Erich Blume	22f418d0dc	Doc review: connect-to-postgres, create-release-artifact-workflow, deploy-k8s-service (#191 ) ## Summary Review session covering 3 docs, plus a codebase-wide cleanup: ### Docs reviewed - connect-to-postgres — verified end-to-end (psql connection tested), stamped - create-release-artifact-workflow — clarified that `build-blumeops.yaml` is only a version bump example (not a packages API example) - deploy-k8s-service — fixed stale repoURL (`indri:2200` → `forge.ops.eblu.me:2222`), wrong Caddy config keys (`upstream` → `backend`, added missing `host`), updated Homepage group to "Services", added Tailscale tag documentation ### Codebase cleanup - Migrated all remaining `op item get --fields` calls to `op read` URI syntax across 7 files (docs, READMEs, YAML comments) - Simplified the `op read` vs `op item get` guidance in CLAUDE.md ## Side findings (not addressed) - New `immich-pg` CNPG cluster not yet documented in the postgresql reference card ## Test plan - [x] `psql` connection to `pg.ops.eblu.me` verified - [x] All pre-commit hooks pass - [x] `docs-check-links`, `docs-check-index`, `docs-check-frontmatter` pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/191	2026-02-15 07:42:01 -08:00
Forgejo Actions	b2b5879e3c	Update docs release to v1.9.0 - Built changelog from towncrier fragments [skip ci]	2026-02-14 21:32:27 -08:00
Erich Blume	04c7f3c45a	Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190 ) ## Summary Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159): - Mosquitto — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth) - Ntfy — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me` - Frigate — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me` - frigate-notify — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT Also includes: - Prometheus scrape target for Frigate metrics - Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage) - Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me` ## Prerequisites - [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri) - [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields ## Deployment (after merge) ```bash argocd app sync apps argocd app sync mosquitto argocd app sync ntfy argocd app sync frigate argocd app sync grafana-config argocd app sync prometheus mise run provision-indri -- --tags caddy mise run services-check ``` ## Verification - [ ] Mosquitto pod running, accepting connections on 1883 - [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me` - [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed - [ ] Object detection working (ONNX, person/car/dog/cat) - [ ] Recordings appearing in NFS share on sifaka - [ ] frigate-notify sending detection alerts to Ntfy - [ ] Prometheus scraping Frigate metrics - [ ] Grafana dashboard showing Frigate data Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190	2026-02-14 21:27:44 -08:00
Erich Blume	b77ae19f20	Fix 1Password Connect credentials for chart 2.3.0 Chart 2.3.0 mounts credentials as a file with standard k8s base64 encoding. The old double-encoding workaround (credentials-base64 in stringData) now produces invalid JSON. Use raw JSON (credentials-file) instead. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 17:30:45 -08:00
Erich Blume	8f4708e26f	Fix navidrome image tag: remove v prefix (0.60.3 not v0.60.3) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 17:23:12 -08:00
Erich Blume	b3747f6c95	Tier 1 version bumps (#186 ) All checks were successful Build Container / build (push) Successful in 8s Details ## Summary Audit and upgrade of all deployed images, helm charts, and custom container Dockerfiles to latest stable versions. This PR covers Tier 1 (low-risk minor/patch bumps only). ### Upstream images \| Image \| Old \| New \| \|-------\|-----\|-----\| \| kube-state-metrics \| v2.13.0 \| v2.18.0 \| \| prometheus \| v3.2.1 \| v3.9.1 \| \| loki \| 3.3.2 \| 3.6.5 \| \| alloy \| v1.5.1 \| v1.13.1 \| \| tailscale (proxy + operator) \| v1.92.5 \| v1.94.1 \| \| navidrome \| :latest \| v0.60.3 (pinned) \| ### Helm charts \| Chart \| Old \| New \| \|-------\|-----\|-----\| \| CloudNativePG \| v0.27.0 \| v0.27.1 \| \| 1Password Connect \| 2.2.1 \| 2.3.0 \| ### Custom containers (Dockerfiles updated, images not yet tagged) \| Container \| Changes \| New tag \| \|-----------\|---------\|---------\| \| miniflux \| 2.2.16→2.2.17 (security), alpine 3.22 \| v1.1.0 \| \| kubectl \| v1.34.1→v1.34.4, alpine 3.22 \| v1.1.0 \| \| kiwix-serve \| alpine 3.22 \| v1.1.0 \| \| nettest \| alpine 3.22 \| v0.14.0 \| \| transmission \| alpine 3.22, pkg 4.0.6-r4 \| v1.1.0 \| All custom containers verified with local `dagger call build`. ### Deferred to Tier 2 (separate PRs) - Forgejo runner 6→12 (major version scheme change) - Docker DinD 27→29 - Grafana chart 8→11 (repo migration) - External Secrets 1→2 (breaking changes) - Python 3.12→3.13, Elixir 1.18→1.19, Node 22→24 - Transmission 4.0.6→4.1.0 (not in Alpine yet) ## Deployment After merge: 1. Tag custom containers: `mise run container-tag-and-release <name> <version>` for each 2. Wait for CI builds to complete 3. `argocd app sync apps` then sync individual apps, or let ArgoCD auto-detect Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/186	2026-02-13 17:16:37 -08:00
Erich Blume	d5c00192d5	Configure DinD to use Zot as pull-through registry mirror (#183 ) ## Summary - Add `daemon.json` with `registry-mirrors` to the forgejo-runner ConfigMap, pointing DinD at `http://host.minikube.internal:5050` - Mount `daemon.json` into the DinD sidecar at `/etc/docker/daemon.json` via `subPath` - Docker Hub pulls during Dagger CI builds will now route through Zot's pull-through cache, reducing bandwidth and avoiding rate limits ## Deployment and Testing - [ ] `argocd app sync forgejo-runner` - [ ] Exec into DinD container: `docker info` should show the registry mirror - [ ] Trigger a workflow build and check Zot logs for cache hits Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/183	2026-02-13 12:36:03 -08:00
Erich Blume	ba9b251759	Update forgejo-runner image to v3.2.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 12:16:52 -08:00
Erich Blume	d0c18043b7	Revert forgejo-runner image to v3.1.0 v3.2.0 build failed (GitHub download timeout), rolling back to working image while it rebuilds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 12:07:51 -08:00
Erich Blume	fdd3f6483a	Update forgejo-runner image to v3.2.0 All checks were successful Build Container / build (push) Successful in 7m31s Details Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 11:08:57 -08:00
Forgejo Actions	02b1397f1a	Update docs release to v1.8.2 - Built changelog from towncrier fragments [skip ci]	2026-02-13 10:36:04 -08:00
Erich Blume	0098ac37e0	Move non-secret runner env vars to deployment spec (#181 ) ## Summary - Move FORGEJO_URL, RUNNER_NAME, and RUNNER_LABELS from ExternalSecret template to deployment env vars - ExternalSecret now only contains the actual secret (RUNNER_TOKEN) - Image version changes in RUNNER_LABELS now trigger automatic pod rollouts ## Deployment 1. Merge this PR 2. `argocd app sync forgejo-runner` — the deployment spec change will auto-roll the pod No manual restart needed — that's the whole point :) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/181	2026-02-13 10:29:23 -08:00
Erich Blume	52bbf88aa6	Update forgejo-runner image to v3.1.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 10:21:43 -08:00
Erich Blume	4942dee182	Update homepage layout for new Content/Misc groups Replace old Apps/Observability/Infrastructure layout entries with Content and Misc to match the recategorized ingress annotations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 09:16:40 -08:00
Erich Blume	ca6a845604	Move ArgoCD to Misc homepage group and rename ingress file ArgoCD's tailscale ingress was missed in the recategorization (filed as service-tailscale.yaml instead of ingress-tailscale.yaml). Fix the group annotation and rename the file to match the convention used by all other services. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-13 09:13:32 -08:00
Erich Blume	48ce5b4120	Recategorize homepage into Content and Misc groups (#179 ) ## Summary - Replace the three homepage groups (Apps, Observability, Infrastructure) with two cleaner groups - Content: Immich, Kiwix, Miniflux, DJ, Grafana - Misc: CV, TeslaMate, Transmission, Docs, Prometheus, PyPI ## Deployment and Testing - [ ] Sync affected ingresses via ArgoCD (all 11 services) - [ ] Verify homepage shows the two new groups correctly Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/179	2026-02-13 09:09:22 -08:00
Forgejo Actions	e21277ae83	Update docs release to v1.8.0 - Built changelog from towncrier fragments [skip ci]	2026-02-12 19:20:27 -08:00
Erich Blume	9c789a1868	Fix cache hit rate on APM and Fly.io dashboards (#177 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m19s Details ## Summary - Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`) - Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites - Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red ## Root cause Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background). ## Deployment - Fly.io proxy redeploy needed for Alloy config change - ArgoCD sync for dashboard ConfigMap changes ## Test plan - [ ] Redeploy Fly.io proxy - [ ] Sync grafana-config in ArgoCD - [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%) - [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited - [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177	2026-02-12 18:40:48 -08:00
Erich Blume	9717863f65	Update CV release to v1.0.3, add X-Clacks-Overhead header (#176 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m5s Details ## Summary - Update CV release URL from v1.0.2 to v1.0.3 - Add `X-Clacks-Overhead: GNU Terry Pratchett` header to both `docs.eblu.me` and `cv.eblu.me` server blocks in the Fly.io proxy nginx config ## Deployment and Testing - [ ] Sync CV app: `argocd app sync cv` - [ ] Verify CV is serving v1.0.3 content - [ ] Deploy fly proxy (workflow or `mise run fly-deploy`) - [ ] Verify header: `curl -sI https://docs.eblu.me \| grep -i clacks` - [ ] Verify header: `curl -sI https://cv.eblu.me \| grep -i clacks` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/176	2026-02-12 17:08:22 -08:00
Erich Blume	ed5c9c9b48	Update CV release to v1.0.2 (#175 ) ## Summary - Update `CV_RELEASE_URL` in cv deployment from v1.0.1 to v1.0.2 ## Deployment and Testing - [ ] `argocd app sync cv` after merge - [ ] Verify cv.eblu.me serves updated content Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/175	2026-02-12 16:18:55 -08:00
Forgejo Actions	70d8881959	Update docs release to v1.7.1 - Built changelog from towncrier fragments [skip ci]	2026-02-12 14:13:12 -08:00
Erich Blume	7dc03c0af1	Add CV to services-check, update homepage link (#174 ) ## Summary - Add CV to services-check (tailnet endpoint + public cv.eblu.me) - Update CV homepage annotation to point to cv.eblu.me instead of cv.ops.eblu.me ## Deployment and Testing - [ ] `argocd app sync cv` (homepage link change) - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/174	2026-02-12 14:10:03 -08:00

1 2 3 4

176 commits