blumeops

Author	SHA1	Message	Date
Erich Blume	c8aec1e2a2	Add NixOS configuration for ringtail workstation Scaffolds a NixOS flake for ringtail (gaming/compute workstation, RTX 4080): - Declarative disk partitioning via disko (GPT, EFI + ext4 on NVMe) - NVIDIA proprietary drivers with CUDA support - Sway/Wayland desktop with greetd, PipeWire audio, Steam - Tailscale for tailnet connectivity - Ansible playbook + mise task for ongoing provisioning via nixos-rebuild - Pulumi auth key for tailnet bootstrap (tag:homelab, tag:blumeops) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 08:07:19 -08:00
Erich Blume	5f9b024b4a	Add Apple Silicon ZMQ detector for Frigate (#206 ) ## Summary - New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri - Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms) - Removes detect FPS cap (no longer needed with fast inference) - Updates Frigate docs and adds changelog fragment ## Deployment ### Phase 1: Deploy detector on indri (one-time setup + ansible) ```fish ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector' ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install' mise run provision-indri -- --tags frigate_detector --check --diff # dry run mise run provision-indri -- --tags frigate_detector # apply ssh indri 'launchctl list mcquack.eblume.frigate-detector' # verify running ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log' # verify bound ``` ### Phase 2: Test connectivity ```fish kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555 ``` ### Phase 3: Deploy Frigate config (branch workflow) ```fish argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate ``` ### Phase 4: Post-deploy checks - [ ] Pod starts, no config errors - [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms - [ ] detect_fps uncapped - [ ] Recordings and MQTT events flowing - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206	2026-02-17 19:03:28 -08:00
Erich Blume	f45897b7c7	Upgrade Frigate 0.16.4 → 0.17.0-rc2 (#205 ) ## Summary - Bump Frigate image from `0.16.4-standard-arm64` to `0.17.0-rc2-standard-arm64` - Adapt `record` config to 0.17 schema: `retain.days`/`mode: all` → `continuous.days` - Update service docs and version tracker This is the first step toward the Apple Silicon ZMQ detector. The existing ONNX detector is kept so we can validate the upgrade independently. ## What is NOT changing - Detector config (still `type: onnx` with YOLO-NAS-s) - go2rtc streams, MQTT, cameras, zones, review rules - frigate-notify, storage PVs, Grafana dashboard ## Deployment and Testing - [ ] `argocd app set frigate --revision upgrade-frigate-0.17 && argocd app sync frigate` - [ ] Pod starts, `/api/version` returns `0.17.0-rc2` - [ ] No config errors in pod logs - [ ] Frigate web UI loads at `https://nvr.ops.eblu.me` - [ ] Live view works, detection running (`/api/stats` shows `detection_fps > 0`) - [ ] Recordings being created (`/api/recordings/summary`) - [ ] MQTT events flowing (check frigate-notify logs) - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/205	2026-02-17 16:56:12 -08:00
Erich Blume	acd213559e	Fix frigate live view by capping detect FPS (#204 ) ## Summary - Cap detect FPS to 2 to prevent recording segment backlog from ONNX inference bottleneck (~750ms/frame on ARM64 CPU) - Sync motion masks from live config (added second mask area) - Update driveway_entrance zone coordinates from live config - Add explicit alert labels `[person, car]` while keeping `required_zones: [driveway_entrance]` ## Context The "No frames have been received" error on the gablecam live view was caused by the detect stream falling behind — ONNX YOLO-NAS-s takes ~750ms per inference on ARM64 CPU, but the sub-stream sends 5 FPS. This caused recording segments to pile up and the ffmpeg watchdog to repeatedly kill/restart the process, creating gaps in the live view. ## Test plan - [ ] Sync ArgoCD `frigate` app to branch and verify pod restarts cleanly - [ ] Check `/api/stats` — `skipped_fps` should drop significantly, `process_fps` should be close to 2 - [ ] Verify live view at https://nvr.ops.eblu.me/#gablecam no longer shows "No frames" error - [ ] Verify detections and alerts still work in the driveway_entrance zone 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/204	2026-02-17 16:18:02 -08:00
Erich Blume	1e96866dd3	Grafana helm chart upgrade plan	2026-02-17 11:15:34 -08:00
Erich Blume	105a2c8c08	Update External Secrets Helm chart 1.3.1 → 2.0.0 (#203 ) ## Summary - Bump External Secrets Operator Helm chart from `helm-chart-1.3.1` to `helm-chart-2.0.0` (operator v1.3.2) - Updates both the operator app and CRDs app `targetRevision` - No Helm values changes needed — `installCRDs`, `resources`, `webhook`, `certController` keys are unchanged ## Breaking changes in chart 2.0.0 - Removed providers: Alibaba and Device42 (unmaintained) — does not affect our 1Password setup - Templating engine v1 deprecated — our ExternalSecrets don't set `engineVersion`, so they use the default (v2) - Webhook `failurePolicy` for SecretStore is now dynamic ## Deployment 1. Sync CRDs first: `argocd app set external-secrets-crds --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets-crds` 2. Sync operator: `argocd app set external-secrets --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets` 3. Verify: `kubectl --context=minikube-indri -n external-secrets get pods` 4. After merge, set both apps back to `--revision main` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/203	2026-02-17 10:43:21 -08:00
Erich Blume	5fbe70d1ba	Port ntfy to locally built container image (#202 ) All checks were successful Build Container / build (push) Successful in 6m28s Details ## Summary - Add `containers/ntfy/Dockerfile` — three-stage build (Node web UI, Go+CGO server, Alpine runtime) pinned to commit SHA `a03a37fe` (v2.17.0), sourced from forge mirror - Update ntfy deployment image from `binwiederhier/ntfy:v2.17.0` to `registry.ops.eblu.me/blumeops/ntfy:v1.0.0` - Note fish shell in CLAUDE.md ## Deployment After merge, release the container image: ```fish mise run container-tag-and-release ntfy v1.0.0 ``` Then sync: ```fish argocd app sync ntfy ``` ## Test plan - [x] `docker build` succeeds - [x] `dagger call build --src=. --container-name=ntfy` succeeds (exit 0, container ID printed) - [x] `ntfy --help` works in built container - [ ] Tag and release `ntfy-v1.0.0` after merge - [ ] Verify ntfy pod starts with new image - [ ] Verify health endpoint responds at `ntfy.ops.eblu.me/v1/health` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/202	2026-02-17 10:18:20 -08:00
Erich Blume	3e604d8fdc	Review ntfy: upgrade to v2.17.0 and add reference docs (#201 ) ## Summary - Upgrade ntfy from v2.11.0 to v2.17.0 (6 minor releases, no breaking changes) - Add reference doc for ntfy service - Add reference doc for frigate service (ntfy's sole producer via frigate-notify) - Update reference index and service-versions.yaml tracking ## Notable upstream changes (v2.12.0–v2.17.0) - v2.14.0: Declarative users/ACL config in files - v2.15.0: `require-login` flag for topic-level auth - v2.16.0: Dead man's switch (heartbeat) notifications, notification update/delete - v2.17.0: Priority templating, crash fixes (nil pointer panics) ## Deployment and Testing - [ ] ArgoCD sync ntfy after merge - [ ] Verify ntfy pod healthy with new image - [ ] Send a test notification via `curl -d "test" https://ntfy.ops.eblu.me/test` - [ ] Verify frigate-notify still delivers alerts to ntfy Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/201	2026-02-17 09:51:40 -08:00
Erich Blume	2f599a15bd	Fix zk-docs broken path after how-to reorg Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 07:32:54 -08:00
Forgejo Actions	530460171a	Update docs release to v1.9.4 - Built changelog from towncrier fragments [skip ci]	2026-02-17 07:30:39 -08:00
Erich Blume	27d8f3cf1f	Review gandi-operations doc and reorganize how-to guides (#200 ) ## Summary - Doc review: Reviewed `gandi-operations.md` — added `last-reviewed` frontmatter, verified all wiki-links, confirmed Pulumi state has no drift - Gandi reference fix: Added missing `cv.eblu.me` CNAME row to `gandi.md` DNS records table (was present in Pulumi but undocumented) - Pulumi comment fix: Updated stale `README.md` reference in `__main__.py` to point to `docs/how-to/gandi-operations.md` - How-to reorg: Moved 14 how-to guides into 3 subdirectories (`deployment/`, `configuration/`, `operations/`), collapsed the Documentation and Database index sections into Configuration and Operations respectively ## Verification - `docs-check-links` — all 180 wiki-links valid - `docs-check-filenames` — all 90 filenames unique - `dns-preview` — 5 resources unchanged, no drift - All pre-commit hooks pass ## Test plan - [ ] Verify docs site builds correctly with new paths - [ ] Spot-check a few wiki-links from other pages to moved how-to guides Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/200	2026-02-17 07:29:33 -08:00
Forgejo Actions	8a48171acf	Update docs release to v1.9.3 - Built changelog from towncrier fragments [skip ci]	2026-02-16 21:25:47 -08:00
Erich Blume	779b7d6709	Eliminate double towncrier run in release workflow (#199 ) ## Summary - Added a new `build_quartz` Dagger function that builds the Quartz site from a pre-processed source tree (no towncrier) - Reordered the release workflow so towncrier runs once on the runner, then passes the updated working tree to `build-quartz` - `build_docs` and `build_changelog` are preserved for standalone use — `build_docs` now delegates to `build_quartz` internally ## Motivation Previously towncrier ran twice per release: once inside a Dagger container (via `build_docs` → `build_changelog`) and once on the runner to capture CHANGELOG.md changes for the git commit. This was wasteful and fragile — if towncrier behavior changed, the two runs could produce different results. ## Test plan - [ ] Review diff to confirm workflow step ordering is correct - [ ] Trigger a release and confirm towncrier runs only once - [ ] Verify the docs tarball contains the updated CHANGELOG.md - [ ] `dagger call build-quartz --src=. --version=vX.Y.Z` should work standalone Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/199	2026-02-16 21:24:34 -08:00
Erich Blume	627e2b7894	Add UniFi admin link to homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-16 19:15:46 -08:00
Erich Blume	d35c26d2b0	Fix mosquitto image tag: use 2.0.22 instead of nonexistent 2.1.2 (#198 ) ## Summary - The `eclipse-mosquitto:2.1.2` tag doesn't exist on Docker Hub — the 2.1.x series only publishes `-alpine` variants - Corrects the pinned tag to `2.0.22`, the latest non-alpine version (matching what the old floating `:2` tag was resolving to) - Updates tracking file and changelog fragment accordingly ## Context The previous PR #197 pinned mosquitto from floating `:2` to `2.1.2`, but the new pod failed with `ErrImagePull` ("manifest unknown"). The old pod is still running on `:2`. ## Test plan - [ ] Verify `eclipse-mosquitto:2.0.22` pulls successfully - [ ] Verify mosquitto pod restarts and passes readiness/liveness probes - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/198	2026-02-16 17:19:32 -08:00
Erich Blume	0aab73af40	Bump mosquitto to 2.1.2 and tailscale-operator to v1.94.2 (#197 ) ## Summary - Pin mosquitto from floating `:2` tag to `2.1.2` (latest upstream, released Feb 9 2026) - Bump tailscale k8s-operator and proxy images from `v1.94.1` to `v1.94.2` - Record 7 reviewed services in `service-versions.yaml` (first service review pass) ## Services reviewed (11 total) \| Service \| Deployed \| Latest \| Status \| \|---------\|----------\|--------\|--------\| \| prometheus \| v3.9.1 \| v3.9.1 \| Current \| \| loki \| 3.6.5 \| 3.6.5 \| Current \| \| kube-state-metrics \| v2.18.0 \| v2.18.0 \| Current \| \| mosquitto \| :2 (floating) \| 2.1.2 \| Pinned in this PR \| \| frigate \| 0.16.4 \| 0.16.4 \| Current \| \| alloy-k8s \| v1.13.1 \| v1.13.1 \| Current \| \| tailscale-operator \| v1.94.1 \| v1.94.2 \| Bumped in this PR \| \| ntfy \| v2.11.0 \| v2.17.0 \| Stale (future PR) \| \| frigate-notify \| v0.3.5 \| v0.5.4 \| Stale (future PR) \| \| homepage \| chart 2.1.0 \| app v1.10.1 \| Stale (future PR) \| \| grafana \| chart 8.8.2 \| chart 10.5.15 \| Stale (future PR) \| ## Deployment and Testing - [ ] `argocd app sync apps` - [ ] `argocd app set mosquitto --revision service-review/mosquitto-tailscale-operator && argocd app sync mosquitto` - [ ] `argocd app set tailscale-operator --revision service-review/mosquitto-tailscale-operator && argocd app sync tailscale-operator` - [ ] Verify mosquitto pod restarts with pinned image - [ ] Verify tailscale operator and proxy pods update - [ ] `mise run services-check` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/197	2026-02-16 17:14:38 -08:00
Erich Blume	faf9682b55	Add service version review system (#196 ) ## Summary - Add `service-versions.yaml` tracking file with 33 services and upstream release URLs - Add `mise run service-review` task (Python uv script) mirroring the docs-review UX - Add `review-services` how-to article covering the review process by service type - Add `[[review-services]]` link to the how-to index Knowledge Base table ## Deployment and Testing - [x] `mise run service-review` displays 33 services, all "never reviewed" - [x] `mise run service-review -- --type ansible` filters to 7 Ansible services - [x] `mise run service-review -- --limit 5` shows 5 rows - [x] `mise run docs-check-links` — no broken wiki-links - [x] `mise run docs-check-frontmatter` — new doc passes validation - [x] All pre-commit hooks pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/196	2026-02-16 17:02:56 -08:00
Forgejo Actions	994bed0693	Update docs release to v1.9.2 - Built changelog from towncrier fragments [skip ci]	2026-02-16 15:51:12 -08:00
Erich Blume	2c55c2316e	Review expose-service-publicly doc (#195 ) ## Summary - Replace stale inline code listings (fly.toml, Dockerfile, start.sh, nginx.conf, mise tasks, CI workflow) with brief descriptions pointing readers to the actual `fly/` and `mise-tasks/` files — prevents future drift - Add observability sidecar section documenting the Alloy integration (logs → Loki, metrics → Prometheus) - Fix broken internal wiki-link (`[[#7. Update Tailscale ACLs if needed]]` → correct heading) - Update per-service nginx templates to current patterns (deferred DNS resolution via `set $upstream` variable, `proxy_intercept_errors`, error pages) - Add `cv.eblu.me` to verification steps (live service not previously documented) - Add `last-reviewed: 2026-02-16` frontmatter - Net -187 lines (58 added, 245 removed) ## Deployment and Testing - [x] All pre-commit hooks pass (link validation, frontmatter, filenames) - [ ] Docs site renders correctly after merge Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/195	2026-02-16 15:49:55 -08:00
Erich Blume	74294094e3	Fix navidrome custom container image v1.0.2 (#194 ) ## Summary - Switch navidrome deployment from upstream `deluan/navidrome:0.60.3` back to custom image `registry.ops.eblu.me/blumeops/navidrome:v1.0.2` - The v1.0.1 image was tagged before the `USER 65534` removal commit, so it still ran as a non-root user that couldn't write to the SQLite data directory - v1.0.2 is built from current main which includes both the `zlib-dev` build fix and the non-root user removal ## Deployment and Testing - [ ] Wait for CI to build `navidrome:v1.0.2` image - [ ] Sync via ArgoCD and verify pod starts without CrashLoopBackOff - [ ] Verify navidrome UI accessible at https://navidrome.ops.eblu.me Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/194	2026-02-16 08:24:33 -08:00
Erich Blume	7ffbd12ac8	Fix Frigate parked car re-detection and enable writable config (#193 ) All checks were successful Build Container / build (push) Successful in 12s Details ## Summary - Remove car-specific `max_frames: 150` which was causing a forget-and-re-detect loop on parked cars (every ~30 seconds at 5fps) - Set `stationary.interval: 0` so Frigate never re-runs detection on stationary objects - Replace read-only configmap subPath mount with initContainer + emptyDir, so Frigate UI changes (zones, masks) persist at runtime ## Context Frigate was spamming notifications because `max_frames` for cars caused it to "forget" a parked car after 150 frames, then immediately re-detect it as a brand new object. The fix follows [Frigate's official parked cars guide](https://docs.frigate.video/guides/parked_cars/). The writable config change also unblocks using `required_zones` for car alerts — zones can now be drawn in the Frigate UI and will survive until pod reschedule (at which point they should be baked into the configmap via IaC). ## Test plan - [ ] Sync frigate app via ArgoCD and verify pod starts with initContainer - [ ] Confirm parked cars no longer trigger repeated alerts - [ ] Draw a zone/mask in Frigate UI, save, verify it persists after Frigate restart - [ ] Set up `driveway_entrance` required zone for car alerts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/193	2026-02-15 17:48:14 -08:00
Erich Blume	996441876d	Document container build pattern and port navidrome (#192 ) Some checks failed Build Container / build (push) Failing after 4m28s Details ## Summary - Add how-to guide (`docs/how-to/build-container-image.md`) covering the full container build workflow: directory layout, Dagger local builds, mise release task, and common patterns with links to existing containers - Port navidrome from upstream `deluan/navidrome:0.60.3` to a custom three-stage build (`containers/navidrome/Dockerfile`) using Node + Go + Alpine - Update navidrome deployment to use `registry.ops.eblu.me/blumeops/navidrome:v1.0.0` ## Deployment and Testing - [x] `dagger call build --src=. --container-name=navidrome` builds successfully - [ ] After merge: `mise run container-tag-and-release navidrome v1.0.0` - [ ] After image published: `argocd app sync navidrome` and verify pod starts Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/192	2026-02-15 08:05:11 -08:00
Forgejo Actions	26c1ff5ce6	Update docs release to v1.9.1 - Built changelog from towncrier fragments [skip ci]	2026-02-15 07:43:00 -08:00
Erich Blume	22f418d0dc	Doc review: connect-to-postgres, create-release-artifact-workflow, deploy-k8s-service (#191 ) ## Summary Review session covering 3 docs, plus a codebase-wide cleanup: ### Docs reviewed - connect-to-postgres — verified end-to-end (psql connection tested), stamped - create-release-artifact-workflow — clarified that `build-blumeops.yaml` is only a version bump example (not a packages API example) - deploy-k8s-service — fixed stale repoURL (`indri:2200` → `forge.ops.eblu.me:2222`), wrong Caddy config keys (`upstream` → `backend`, added missing `host`), updated Homepage group to "Services", added Tailscale tag documentation ### Codebase cleanup - Migrated all remaining `op item get --fields` calls to `op read` URI syntax across 7 files (docs, READMEs, YAML comments) - Simplified the `op read` vs `op item get` guidance in CLAUDE.md ## Side findings (not addressed) - New `immich-pg` CNPG cluster not yet documented in the postgresql reference card ## Test plan - [x] `psql` connection to `pg.ops.eblu.me` verified - [x] All pre-commit hooks pass - [x] `docs-check-links`, `docs-check-index`, `docs-check-frontmatter` pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/191	2026-02-15 07:42:01 -08:00
Forgejo Actions	b2b5879e3c	Update docs release to v1.9.0 - Built changelog from towncrier fragments [skip ci]	2026-02-14 21:32:27 -08:00
Erich Blume	04c7f3c45a	Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190 ) ## Summary Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159): - Mosquitto — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth) - Ntfy — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me` - Frigate — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me` - frigate-notify — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT Also includes: - Prometheus scrape target for Frigate metrics - Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage) - Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me` ## Prerequisites - [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri) - [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields ## Deployment (after merge) ```bash argocd app sync apps argocd app sync mosquitto argocd app sync ntfy argocd app sync frigate argocd app sync grafana-config argocd app sync prometheus mise run provision-indri -- --tags caddy mise run services-check ``` ## Verification - [ ] Mosquitto pod running, accepting connections on 1883 - [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me` - [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed - [ ] Object detection working (ONNX, person/car/dog/cat) - [ ] Recordings appearing in NFS share on sifaka - [ ] frigate-notify sending detection alerts to Ntfy - [ ] Prometheus scraping Frigate metrics - [ ] Grafana dashboard showing Frigate data Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190	2026-02-14 21:27:44 -08:00
Erich Blume	f376c02b76	Move segment-home-network to completed plans Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 10:48:32 -08:00
Erich Blume	2252e5e60d	Update segmentation plan: mark completed, fix firewall details Reflect actual UX7 zone-based firewall UI, correct streaming port (8096 not 443), note indri DHCP reservation, mark plan as completed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 10:43:17 -08:00
Erich Blume	657bb28fd1	Abandon UniFi IaC, add manual network segmentation plan (#189 ) ## Summary - Abandon the UniFi Pulumi IaC approach after provider bugs caused a network outage (no-op update reset undeclared properties on the default LAN network) - Remove untracked IaC artifacts (`pulumi/unifi/`, `mise-tasks/unifi-preview`, `mise-tasks/unifi-up`) locally - Mark `add-unifi-pulumi-stack` plan as Abandoned with explanation - Create new `segment-home-network` plan for manual three-network segmentation (Main/IoT/Guest) via UX7 web UI - Rewrite UniFi reference card to remove all Pulumi/IaC references - Update plan and how-to indexes ## Test plan - [x] `docs-check-links` passes - [x] `docs-check-index` passes - [x] Pre-commit hooks pass - [ ] Review segmentation plan for completeness before executing manually 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/189	2026-02-14 09:47:04 -08:00
Erich Blume	eec1edf43d	Add how-to guide for connecting to PostgreSQL via psql (#188 ) ## Summary - Add new how-to guide (`connect-to-postgres.md`) with the `psql` command using `op read` for 1Password credentials - Add "Database" section to the how-to index linking to the new guide - Link the new guide from the PostgreSQL reference card's Related section ## Test plan - [x] Verified `psql` connection works from gilbert using the documented command - [ ] Review doc formatting and content 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/188	2026-02-14 07:18:06 -08:00
Erich Blume	49ec05041c	Update UniFi Pulumi plan: switch to ubiquiti-community provider (#187 ) ## Summary - Switch provider from filipowm/unifi (inactive maintainer, showstopper bug #94 wiping firewall rules) to ubiquiti-community/unifi (actively maintained, API key auth) - Add UX7 config backup prerequisite before adopting IaC - Fix safety guard: check default route interface instead of hostname (runs from gilbert, not indri) - Update 1Password paths to match actual item (`op://blumeops/unifi/credential`) - Fix ringtail references: not a Raspberry Pi, stays on WiFi (removed from wired topology) - Update doc steps for already-existing reference files ## Test plan - [x] Pre-commit hooks pass - [x] `docs-check-links` pass - [x] `docs-check-index` pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/187	2026-02-13 20:02:16 -08:00
Erich Blume	b3747f6c95	Tier 1 version bumps (#186 ) All checks were successful Build Container / build (push) Successful in 8s Details ## Summary Audit and upgrade of all deployed images, helm charts, and custom container Dockerfiles to latest stable versions. This PR covers Tier 1 (low-risk minor/patch bumps only). ### Upstream images \| Image \| Old \| New \| \|-------\|-----\|-----\| \| kube-state-metrics \| v2.13.0 \| v2.18.0 \| \| prometheus \| v3.2.1 \| v3.9.1 \| \| loki \| 3.3.2 \| 3.6.5 \| \| alloy \| v1.5.1 \| v1.13.1 \| \| tailscale (proxy + operator) \| v1.92.5 \| v1.94.1 \| \| navidrome \| :latest \| v0.60.3 (pinned) \| ### Helm charts \| Chart \| Old \| New \| \|-------\|-----\|-----\| \| CloudNativePG \| v0.27.0 \| v0.27.1 \| \| 1Password Connect \| 2.2.1 \| 2.3.0 \| ### Custom containers (Dockerfiles updated, images not yet tagged) \| Container \| Changes \| New tag \| \|-----------\|---------\|---------\| \| miniflux \| 2.2.16→2.2.17 (security), alpine 3.22 \| v1.1.0 \| \| kubectl \| v1.34.1→v1.34.4, alpine 3.22 \| v1.1.0 \| \| kiwix-serve \| alpine 3.22 \| v1.1.0 \| \| nettest \| alpine 3.22 \| v0.14.0 \| \| transmission \| alpine 3.22, pkg 4.0.6-r4 \| v1.1.0 \| All custom containers verified with local `dagger call build`. ### Deferred to Tier 2 (separate PRs) - Forgejo runner 6→12 (major version scheme change) - Docker DinD 27→29 - Grafana chart 8→11 (repo migration) - External Secrets 1→2 (breaking changes) - Python 3.12→3.13, Elixir 1.18→1.19, Node 22→24 - Transmission 4.0.6→4.1.0 (not in Alpine yet) ## Deployment After merge: 1. Tag custom containers: `mise run container-tag-and-release <name> <version>` for each 2. Wait for CI builds to complete 3. `argocd app sync apps` then sync individual apps, or let ArgoCD auto-detect Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/186	2026-02-13 17:16:37 -08:00
Erich Blume	81690dae0f	Review add-ansible-role doc (#185 ) ## Summary - Replace `op item get --fields` with `op read` for secrets (matches playbook and CLAUDE.md guidance) - Change `tags: [<role>]` to `tags: <role>` to match actual playbook style - Remove redundant `listen:` from handler example, add `changed_when: true` - Name handler after specific service (e.g. `Restart <service>`) to match real roles - Add `last-reviewed: 2026-02-13` frontmatter ## Also noted (not fixed here) Two other docs still use the old `op item get` pattern: - `docs/how-to/troubleshooting.md:72` (ArgoCD login command) - `docs/how-to/gandi-operations.md:35` (Gandi token export) These can be fixed in their own review cycles. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/185	2026-02-13 16:54:42 -08:00
Erich Blume	5b91a1c315	Review why-gitops doc (#184 ) ## Summary - Fix misleading `[[tailscale\|Pulumi]]` wiki-link → `[[pulumi]]` - Simplify `[[ansible\|Ansible]]` and `[[argocd\|ArgoCD]]` to plain wiki-links per convention - Rename "Tailnet" layer to "Network" to reflect Pulumi's full scope (Tailscale ACLs + Gandi DNS) - Fix `apt install` → `brew install` (indri is macOS) - Add `[[pulumi]]` to Related section - Add `last-reviewed: 2026-02-13` frontmatter Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/184	2026-02-13 16:48:06 -08:00
Erich Blume	d5c00192d5	Configure DinD to use Zot as pull-through registry mirror (#183 ) ## Summary - Add `daemon.json` with `registry-mirrors` to the forgejo-runner ConfigMap, pointing DinD at `http://host.minikube.internal:5050` - Mount `daemon.json` into the DinD sidecar at `/etc/docker/daemon.json` via `subPath` - Docker Hub pulls during Dagger CI builds will now route through Zot's pull-through cache, reducing bandwidth and avoiding rate limits ## Deployment and Testing - [ ] `argocd app sync forgejo-runner` - [ ] Exec into DinD container: `docker info` should show the registry mirror - [ ] Trigger a workflow build and check Zot logs for cache hits Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/183	2026-02-13 12:36:03 -08:00
Erich Blume	e364bdd238	Upgrade Node.js from 20 to 22 LTS (#182 ) Some checks failed Build Container / build (push) Failing after 11m14s Details ## Summary - Upgrade Dagger docs build image from `node:20-slim` to `node:22-slim` - Upgrade forgejo-runner container from Node 20 to Node 22 - Fixes Quartz 4.5.2 `EBADENGINE` warning (requires Node >= 22) - Node 20 EOL is 2026-04-30 Both builds verified locally via Dagger. ## Deployment 1. Merge this PR 2. Tag and release forgejo-runner v3.2.0: `mise run container-tag-and-release forgejo-runner v3.2.0` 3. Update RUNNER_LABELS version in `argocd/manifests/forgejo-runner/deployment.yaml` from `v3.1.0` to `v3.2.0` 4. `argocd app sync forgejo-runner` The Dagger docs build change takes effect immediately on merge (no container release needed). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/182	2026-02-13 11:07:41 -08:00
Forgejo Actions	02b1397f1a	Update docs release to v1.8.2 - Built changelog from towncrier fragments [skip ci]	2026-02-13 10:36:04 -08:00
Erich Blume	0098ac37e0	Move non-secret runner env vars to deployment spec (#181 ) ## Summary - Move FORGEJO_URL, RUNNER_NAME, and RUNNER_LABELS from ExternalSecret template to deployment env vars - ExternalSecret now only contains the actual secret (RUNNER_TOKEN) - Image version changes in RUNNER_LABELS now trigger automatic pod rollouts ## Deployment 1. Merge this PR 2. `argocd app sync forgejo-runner` — the deployment spec change will auto-roll the pod No manual restart needed — that's the whole point :) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/181	2026-02-13 10:29:23 -08:00
Erich Blume	2fad8db639	Add yq to forgejo-runner and replace sed YAML edits (#180 ) All checks were successful Build Container / build (push) Successful in 1m31s Details ## Summary - Install yq in the forgejo-runner container image for structured YAML editing - Replace fragile `sed` regex patterns with `yq` in `build-blumeops.yaml` and `cv-deploy.yaml` workflows ## Deployment 1. Merge this PR 2. Tag and release forgejo-runner v3.1.0: `mise run container-tag-and-release forgejo-runner v3.1.0` 3. Update runner label in `argocd/manifests/forgejo-runner/external-secret.yaml` from `v3.0.2` to `v3.1.0` 4. Sync the forgejo-runner app: `argocd app sync forgejo-runner` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/180	2026-02-13 10:20:27 -08:00
Erich Blume	48ce5b4120	Recategorize homepage into Content and Misc groups (#179 ) ## Summary - Replace the three homepage groups (Apps, Observability, Infrastructure) with two cleaner groups - Content: Immich, Kiwix, Miniflux, DJ, Grafana - Misc: CV, TeslaMate, Transmission, Docs, Prometheus, PyPI ## Deployment and Testing - [ ] Sync affected ingresses via ArgoCD (all 11 services) - [ ] Verify homepage shows the two new groups correctly Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/179	2026-02-13 09:09:22 -08:00
Forgejo Actions	e21277ae83	Update docs release to v1.8.0 - Built changelog from towncrier fragments [skip ci]	2026-02-12 19:20:27 -08:00
Erich Blume	517080aeab	Add reference/tools/ category with Dagger, ArgoCD CLI, Ansible, and Pulumi cards (#178 ) ## Summary - Create `docs/reference/tools/` with four reference cards: Dagger (build engine), ArgoCD CLI (deployment workflows), Ansible (config management), and Pulumi (DNS/Tailscale IaC) - Move `ansible/roles.md` → `tools/ansible.md`, broadened with CLI patterns and dry-run usage - Update `reference.md` index: add "Tools" section, remove old "Ansible" section - Update `update-documentation.md` to reflect Dagger build process (workflow steps, manual build recipe, runner environment) - Update `adopt-dagger-ci.md` plan to note how-to articles were handled via reference card + existing how-to updates - Fix all broken `[[roles]]` wiki-links across 5 files → `[[ansible]]` ## Verification - `docs-check-links` ✓ — no broken wiki-links - `docs-check-index` ✓ — all docs referenced in category index - `docs-check-filenames` ✓ — no duplicate filenames - All pre-commit hooks pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/178	2026-02-12 19:18:46 -08:00
Erich Blume	9c789a1868	Fix cache hit rate on APM and Fly.io dashboards (#177 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m19s Details ## Summary - Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`) - Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites - Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red ## Root cause Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background). ## Deployment - Fly.io proxy redeploy needed for Alloy config change - ArgoCD sync for dashboard ConfigMap changes ## Test plan - [ ] Redeploy Fly.io proxy - [ ] Sync grafana-config in ArgoCD - [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%) - [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited - [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177	2026-02-12 18:40:48 -08:00
Erich Blume	9717863f65	Update CV release to v1.0.3, add X-Clacks-Overhead header (#176 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m5s Details ## Summary - Update CV release URL from v1.0.2 to v1.0.3 - Add `X-Clacks-Overhead: GNU Terry Pratchett` header to both `docs.eblu.me` and `cv.eblu.me` server blocks in the Fly.io proxy nginx config ## Deployment and Testing - [ ] Sync CV app: `argocd app sync cv` - [ ] Verify CV is serving v1.0.3 content - [ ] Deploy fly proxy (workflow or `mise run fly-deploy`) - [ ] Verify header: `curl -sI https://docs.eblu.me \| grep -i clacks` - [ ] Verify header: `curl -sI https://cv.eblu.me \| grep -i clacks` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/176	2026-02-12 17:08:22 -08:00
Erich Blume	ed5c9c9b48	Update CV release to v1.0.2 (#175 ) ## Summary - Update `CV_RELEASE_URL` in cv deployment from v1.0.1 to v1.0.2 ## Deployment and Testing - [ ] `argocd app sync cv` after merge - [ ] Verify cv.eblu.me serves updated content Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/175	2026-02-12 16:18:55 -08:00
Forgejo Actions	70d8881959	Update docs release to v1.7.1 - Built changelog from towncrier fragments [skip ci]	2026-02-12 14:13:12 -08:00
Erich Blume	7dc03c0af1	Add CV to services-check, update homepage link (#174 ) ## Summary - Add CV to services-check (tailnet endpoint + public cv.eblu.me) - Update CV homepage annotation to point to cv.eblu.me instead of cv.ops.eblu.me ## Deployment and Testing - [ ] `argocd app sync cv` (homepage link change) - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/174	2026-02-12 14:10:03 -08:00
Erich Blume	df372fccb6	Expose CV publicly at cv.eblu.me (#173 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m57s Details ## Summary - Add nginx server block for `cv.eblu.me` (static site, same pattern as docs) - Add DNS CNAME record in Pulumi (`cv.eblu.me` → `blumeops-proxy.fly.dev`) - Add `cv.eblu.me` cert to `fly-setup` mise task - Tag CV Tailscale ingress with `tag:flyio-target` for ACL access - Remove `/_error` test endpoint from docs proxy ## Deployment and Testing - [ ] `argocd app set cv --revision cv/public-cv-eblu-me && argocd app sync cv` - [ ] `fly certs add cv.eblu.me -a blumeops-proxy` - [ ] `mise run fly-deploy` - [ ] Verify proxy: `curl -I -H "Host: cv.eblu.me" https://blumeops-proxy.fly.dev/` - [ ] `mise run dns-preview` then `mise run dns-up` - [ ] Verify live: `curl -I https://cv.eblu.me` - [ ] Merge, then `argocd app set cv --revision main && argocd app sync cv` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/173	2026-02-12 14:05:00 -08:00
Erich Blume	a68542a602	Update CV release to v1.0.1 (#172 ) ## Summary - Update `CV_RELEASE_URL` from v0.1.0 to v1.0.1 in the CV deployment manifest ## Deployment and Testing - [ ] `argocd app sync cv` after merge - [ ] Verify cv.ops.eblu.me serves updated resume Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/172	2026-02-12 13:38:05 -08:00
Forgejo Actions	200be39492	Update docs release to v1.7.0 - Built changelog from towncrier fragments [skip ci]	2026-02-12 11:46:38 -08:00

1 2 3 4

183 commits