blumeops

Author	SHA1	Message	Date
Forgejo Actions	fa223f8e3b	Update docs release to v1.11.5 - Built changelog from towncrier fragments [skip ci]	2026-02-26 07:56:02 -08:00
Erich Blume	be3cdad1cb	Add HA for CV and Docs: zero-downtime deploys (#273 ) ## Summary - Set `replicas: 2` with `maxUnavailable: 0` / `maxSurge: 1` on CV and Docs deployments so rolling updates never drop below 2 ready pods - Add PodDisruptionBudgets (`minAvailable: 1`) to protect against node drains and cluster maintenance - Add Fly.io cache purge step to `cv-deploy.yaml` workflow (docs already had this) so CV deploys don't serve stale cached content ## Deployment and Testing - [ ] `argocd app diff cv` / `argocd app diff docs` from branch - [ ] Deploy from branch: `argocd app set cv --revision feature/ha-cv-docs-zero-downtime && argocd app sync cv` - [ ] Verify 2 pods running: `kubectl get pods -n cv --context=minikube-indri` - [ ] Test rolling restart: `kubectl rollout restart deployment/cv -n cv --context=minikube-indri` - [ ] During rollout, confirm continuous availability via `curl -I https://cv.eblu.me` - [ ] After merge: reset ArgoCD to main, re-sync both apps Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/273	2026-02-26 07:53:21 -08:00
Erich Blume	fb83c5c577	Add explicit ExternalSecret defaults for SSA sync parity The external-secrets webhook injects conversionStrategy, decodingStrategy, and metadataPolicy defaults on admission. Declaring them explicitly prevents ArgoCD SSA from flagging the resource as OutOfSync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 07:02:54 -08:00
Erich Blume	db561c6b0e	Upgrade ArgoCD v3.2.6 → v3.3.2 with Server-Side Apply (#272 ) ## Summary - Upgrade ArgoCD from v3.2.6 to v3.3.2 - Enable `ServerSideApply=true` sync option (required by v3.3 — ApplicationSet CRD exceeds client-side apply annotation limit) - Update service-versions.yaml with review for argocd and 1password-connect ## Breaking changes reviewed - Server-Side Apply required: Added to syncOptions ✅ - Source Hydrator git notes: Not used — N/A - Application path cleaning removed: Not used — N/A - Settings API field restriction: Authenticated access only — N/A ## Deployment and Testing - [ ] Sync the `apps` app first (picks up SSA syncOption change) - [ ] `argocd app set argocd --revision feature/argocd-v3.3.2` - [ ] `argocd app sync argocd` - [ ] Verify all argocd pods running with v3.3.2 images - [ ] Verify other apps still sync correctly - [ ] After merge: `argocd app set argocd --revision main && argocd app sync argocd` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/272	2026-02-26 06:51:50 -08:00
Erich Blume	95c8424e62	Add Transmission metrics exporter and Grafana dashboard (#271 ) ## Summary - Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091 - Add metrics port to the torrent service for Prometheus scraping - Add Prometheus scrape job targeting the transmission exporter - Create Grafana dashboard with: - Overview stats (download/upload speed, active/total torrents) - Transfer speed timeseries (download + upload over time) - Transfer volume stats (total downloaded/uploaded in selected range) - Per-torrent download and upload rate timeseries - Per-torrent details table (ratio, uploaded, percent done) ## Deployment and Testing - [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts - [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics` - [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me - [ ] Open Grafana, find "Transmission" dashboard, verify panels populate - [ ] Sync ArgoCD `prometheus` app from branch - [ ] Sync ArgoCD `grafana-config` app from branch Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271	2026-02-25 22:23:33 -08:00
Erich Blume	03d71544ec	Add multi-cluster observability with ringtail metrics and dashboards (#270 ) ## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - Kubernetes Clusters dashboard — multi-cluster with `cluster` and `namespace` template variables - Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270	2026-02-25 22:01:00 -08:00
Erich Blume	2243f2e0a1	Filter driveway zone to person/dog/cat only in Frigate Parked car was being re-detected every few minutes at night due to IR illumination noise triggering motion detection. Restrict the driveway zone to [person, dog, cat] so cars and birds no longer create events there. Cars still alert via the driveway_entrance zone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 20:45:07 -08:00
Erich Blume	de54b4e33d	Port CloudNative-PG off Helm to direct release manifest (#268 ) ## Summary - Point ArgoCD app directly at forge-mirrored upstream repo (`mirrors/cloudnative-pg`) instead of the Helm charts repo - Use `directory.include` to select the specific release manifest (`cnpg-1.27.1.yaml`) from the `releases/` directory - No vendored files, no Helm — upgrades are a two-line change (`targetRevision` + `directory.include`) - Delete unused `values.yaml` (was empty, all Helm defaults) ## Deployment and Testing - [ ] Register mirror repo in ArgoCD: `argocd repo add ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git --ssh-private-key-path <key>` - [ ] `argocd app set cloudnative-pg --revision feature/cnpg-direct-source && argocd app sync cloudnative-pg` - [ ] Verify operator pod running: `kubectl get pods -n cnpg-system --context=minikube-indri` - [ ] Verify CRDs exist: `kubectl get crd --context=minikube-indri \| grep cnpg` - [ ] Verify existing clusters healthy: `kubectl get clusters -A --context=minikube-indri` - [ ] After merge: `argocd app set cloudnative-pg --revision main && argocd app sync cloudnative-pg` ## Notes - The forge mirror was created via `mise run mirror-create` from `https://github.com/cloudnative-pg/cloudnative-pg.git` - ArgoCD may need the mirror repo added to its known repositories if the credential template doesn't already match `mirrors/*` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/268	2026-02-25 17:37:53 -08:00
Erich Blume	285ad4141f	Fix Frigate detection events rate metric name in Grafana dashboard The panel queried frigate_camera_events but the actual metric exposed by Frigate is frigate_camera_events_total with a "camera" label (not "camera_name"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 16:51:57 -08:00
Forgejo Actions	4736c7e9bd	Update docs release to v1.11.4 - Built changelog from towncrier fragments [skip ci]	2026-02-25 07:04:23 -08:00
Erich Blume	5f9bc20345	Fix mirror org refs in ArgoCD apps and widen credential template (#266 ) ## Summary - Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials) - Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts) - Fix stale alloy clone comment in Ansible defaults - Bump immich v2.5.2 → v2.5.6 (bug-fix patches only) - Update ArgoCD README bootstrap command and credential docs ## Context Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit ``cd57814``. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos. ## Deployment 1. Sync the ArgoCD `argocd` app first (picks up the widened credential template) 2. Sync the `apps` app (picks up new repo URLs for all 8 apps) 3. Verify immich resolves its ComparisonError: `argocd app get immich` 4. Sync immich to deploy v2.5.6: `argocd app sync immich` 5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266	2026-02-25 06:55:53 -08:00
Erich Blume	4f8f2985c1	Update prometheus and teslamate image tags after mirror migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 21:18:15 -08:00
Erich Blume	e0f9ebebdf	Update homepage, navidrome, ntfy, miniflux image tags after mirror migration Prometheus and teslamate builds still in progress — will update in a follow-up commit once their `33b7f0f` tags land. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 21:06:08 -08:00
Erich Blume	61ee6a4d38	Fix Grafana ConfigMap labels lost in configMapGenerator migration The hand-written configmap.yaml had app.kubernetes.io/name and app.kubernetes.io/instance labels; configMapGenerator dropped them. Add options.labels to both generator entries to restore parity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 14:46:20 -08:00
Erich Blume	9b44a8ec51	Add kustomize images: and configMapGenerator: across services (#264 ) ## Summary - Move hardcoded image tags to kustomization.yaml `images:` transformer across 22 services — image names in manifests become version-agnostic templates, with tags centralized in one place per service - Replace hand-written ConfigMap manifests with `configMapGenerator:` in 12 services — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes - Create new `kustomization.yaml` for forgejo-runner and nvidia-device-plugin (switches ArgoCD from directory mode to kustomize mode, rendered output identical) ### Services modified Images only (8): cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent Images + configMapGenerator (10): alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus Images only, no configMapGenerator (4): authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is) Skipped entirely (6): argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml) ### What changes at deploy time - images: — no functional diff, `kustomize build` produces identical output with tags - configMapGenerator: — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change ## Test plan - [x] `kubectl kustomize` builds all 30 service directories successfully - [x] Image tags verified in rendered output for all modified services - [x] ConfigMap hash suffixes verified in rendered output - [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names - [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.) - [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes - [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264	2026-02-24 14:25:19 -08:00
Erich Blume	86aeb60ec9	Fix TeslaMate dashboards: add database to PostgreSQL jsonData Grafana 12.x's grafana-postgresql-datasource plugin requires the database name in jsonData, not just the top-level database field. Without it, the frontend blocks all queries with "no default database configured", causing all TeslaMate panels to show "No Data." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 13:49:07 -08:00
Erich Blume	495c3e8496	Fix Grafana OAuth role mapping from Authentik groups The INI parser was stripping outer single quotes from role_attribute_path = 'Admin', causing Grafana to evaluate 'Admin' as a JMESPath field identifier instead of a string literal. This resulted in all OAuth users getting the default Viewer role. Replaced with a proper group-based expression that checks for the 'admins' Authentik group and maps to Admin/Viewer accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 13:41:08 -08:00
Erich Blume	4acd2e58d4	Update prometheus and grafana to main-SHA container tags Prometheus: v3.9.1-74029e1 [branch] -> v3.9.1-2ba5d8a [main] Grafana: v12.3.3-09ac36b [branch] -> v12.3.3-d05d2fb [main] These images were built during PR development and referenced branch commits that won't survive branch cleanup. The [main] tags are identical rebuilds from the squash-merge commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 09:58:09 -08:00
Erich Blume	2ba5d8a8aa	Port Prometheus to local container build (#262 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (prometheus) (push) Successful in 2s Details Build Container / build (prometheus) (push) Successful in 7s Details ## Summary - Add three-stage Dockerfile for Prometheus v3.9.1 (Node UI → Go binaries → Alpine runtime) - Produces `prometheus` and `promtool` binaries with embedded web UI assets - Follows navidrome/ntfy pattern for supply chain control via Zot registry ## Deployment and Testing - [ ] `dagger call build --src=. --container-name=prometheus` succeeds - [ ] Container reports correct version via `prometheus --version` - [ ] `promtool --version` works - [ ] Update statefulset image reference after successful build - [ ] Deploy from branch: `argocd app set prometheus --revision <branch> && argocd app sync prometheus` - [ ] Health probes pass (`/-/healthy`, `/-/ready`) - [ ] Web UI loads, scrape targets work, remote write functions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/262	2026-02-24 09:15:57 -08:00
Forgejo Actions	2f78d180e8	Update docs release to v1.11.3 - Built changelog from towncrier fragments [skip ci]	2026-02-23 21:04:33 -08:00
Erich Blume	d05d2fbaff	C2: Upgrade Grafana to 12.x with Nix container and Kustomize (#260 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 1s Details Build Container (Nix) / build (grafana) (push) Successful in 2s Details Build Container / build (grafana) (push) Successful in 7s Details ## Summary Mikado chain to upgrade Grafana from 11.4.0 (Helm chart) to 12.x with: - Home-built Nix container image (`forge.ops.eblu.me/eblume/grafana`) - Kustomize manifests replacing the Helm chart - Single-source ArgoCD app ## Chain Goal: `upgrade-grafana` Leaves: `build-grafana-container`, `kustomize-grafana-deployment` Track with: `mise run docs-mikado upgrade-grafana` ## Test plan - [ ] Container builds successfully via Nix - [ ] Container pushed to registry - [ ] Kustomize manifests produce equivalent resources to current Helm - [ ] Pod runs, UI loads, OIDC works, datasources healthy - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/260	2026-02-23 18:07:18 -08:00
Erich Blume	9b419abf24	Update RUNNER_LABELS to use runner-job-image:v0.19.11-4c5e0f0 Now that the image is built under the new name, point the forgejo runner at it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 17:47:14 -08:00
Erich Blume	75fde54355	Fix Grafana TeslaMate dashboard folder provisioning (#253 ) ## Summary - `foldersFromFilesStructure` was `false` in Grafana's sidecar provider config, causing Grafana to ignore the subdirectory structure the sidecar creates from `grafana_folder` annotations - All 18 TeslaMate dashboards were appearing in the root "Dashboards" folder despite having `grafana_folder: "TeslaMate"` annotations on their ConfigMaps - Flipping to `true` makes Grafana replicate the sidecar's directory structure as UI folders ## Deployment and Testing - [ ] Sync `grafana` app: `argocd app sync grafana` - [ ] Verify TeslaMate dashboards appear under a "TeslaMate" folder in Grafana's dashboard list - [ ] Verify other dashboards remain in the root "Dashboards" folder Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/253	2026-02-22 18:38:51 -08:00
Erich Blume	6871bf32a8	Remove unused gpu panel in frigate dashboard	2026-02-22 18:26:47 -08:00
Erich Blume	2c6c6a244a	Fix Frigate Prometheus metrics & rebuild Grafana dashboard (#252 ) ## Summary - Prometheus scrape target: Changed from `frigate.frigate.svc.cluster.local:5000` (broken after ringtail migration) to `nvr.ops.eblu.me` via HTTPS through Caddy on indri - Grafana dashboard: Rebuilt for Frigate 0.17 metrics — 12 panels total: - Row 1 (stats): Uptime, Inference Speed, Camera FPS, Detection FPS, GPU Usage, GPU Temp - Row 2 (timeseries): CPU Usage, Memory Usage - Row 3 (timeseries): Camera FPS + Skipped FPS, GPU Usage + Memory over time - Row 4 (timeseries): Storage Usage, Detection Events (rate by camera/label) ## Deployment and Testing 1. Sync prometheus app on branch: ``` argocd app set prometheus --revision fix/frigate-metrics-dashboard && argocd app sync prometheus ``` 2. Check `prometheus.ops.eblu.me/targets` — frigate job should show UP 3. Sync grafana-config: ``` argocd app sync grafana-config ``` 4. Check `grafana.ops.eblu.me` — Frigate NVR dashboard should show live data 5. After merge: reset both apps to `--revision main` and sync Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/252	2026-02-22 18:14:17 -08:00
Forgejo Actions	dda7d719b3	Update docs release to v1.11.2 - Built changelog from towncrier fragments [skip ci]	2026-02-22 17:52:05 -08:00
Erich Blume	e655f4556e	Upgrade k8s forgejo-runner from v6.3.1 to v12.7.0 (#251 ) ## Summary Completes the `upgrade-k8s-runner` mikado chain. Both prerequisites (workflow validation in Dagger, config review against v12 defaults) were resolved in #250. - Bump runner image `code.forgejo.org/forgejo/runner:6.3.1` → `12.7.0` - Update `service-versions.yaml` to track new version - Mark goal card complete (remove `status: active`) ## Deployment and Testing After merge: 1. `argocd app sync forgejo-runner` 2. Verify runner registers in Forgejo admin → runners 3. Trigger a test workflow (e.g. `branch-cleanup.yaml` manual dispatch) Rollback: revert image tag to `6.3.1`, push, sync. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/251	2026-02-22 17:43:39 -08:00
Erich Blume	0f6a1898f0	Prepare forgejo-runner v12 upgrade (leaf nodes) (#250 ) ## Summary - Review runner config against v12.7.0 defaults — added `shutdown_timeout: 3h`, no breaking changes found - Add `validate_workflows` Dagger function using `forgejo-runner validate --directory .` inside upstream container - All 6 workflows pass v12.7.0 schema validation - Wire `mise run validate-workflows` task and pre-commit hook on `.forgejo/workflows/` changes - Mark both leaf Mikado cards (`review-runner-config-v12`, `validate-workflows-against-v12`) complete ## Mikado State After merge, `upgrade-k8s-runner` goal card has no unmet dependencies — ready to execute the actual image bump in a follow-up PR. ## Test Plan - [x] `dagger call validate-workflows --src=.` passes (all 6 workflows OK) - [x] Pre-commit hooks pass - [ ] Reviewer: confirm `shutdown_timeout: 3h` addition to ConfigMap looks reasonable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/250	2026-02-22 17:38:32 -08:00
Erich Blume	d51c180fe6	Switch Frigate detection model from YOLO-NAS-S to YOLOv9-c (#246 ) ## Summary - Replace abandoned YOLO-NAS-S (320x320, `yolonas`) with YOLOv9-c (640x640, `yolo-generic`) - YOLOv9-c benefits from CUDA Graphs in Frigate 0.17 on the RTX 4080 - Add `export_yolov9` Dagger pipeline and `frigate-export-model` mise task for reproducible model exports - Model already deployed to `sifaka:/volume1/frigate/models/yolov9-c-640.onnx` ## Config changes - `model_type: yolonas` → `yolo-generic` - `input_dtype: int` → `float` - `width/height: 320` → `640` - `path:` → `yolov9-c-640.onnx` ## Deployment and Testing - [ ] Merge and sync Frigate ArgoCD app: `argocd app sync frigate` - [ ] Verify Frigate starts and detects objects at https://nvr.ops.eblu.me - [ ] Confirm GPU inference via Frigate system metrics Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/246	2026-02-22 15:14:45 -08:00
Erich Blume	2c081eed28	Add Forgejo repository health metrics and Grafana dashboard (#245 ) ## Summary - New `forgejo_metrics` Ansible role that queries the Forgejo REST API every 60s and writes Prometheus textfile metrics (open PRs, issues, languages, releases, commits, Actions runs/duration/success) - Grafana dashboard "Forgejo Repository Health" with 12 panels across 4 rows: overview stats, CI/CD health, repository info, and staleness tracking - Deletes superseded `forgejo-actions-dashboard` plan doc (this implementation covers a broader scope) ## Deployment and Testing - [ ] `mise run provision-indri -- --tags forgejo_metrics` to deploy the collector - [ ] `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/forgejo.prom'` to verify metrics - [ ] `argocd app sync grafana-config` to deploy the dashboard - [ ] Check Grafana dashboard "Forgejo Repository Health" loads with data - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/245	2026-02-22 11:16:03 -08:00
Forgejo Actions	c21cf54847	Update docs release to v1.11.1 - Built changelog from towncrier fragments [skip ci]	2026-02-22 10:21:19 -08:00
Erich Blume	c897fc8e1f	Use Zot registry icon on homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 09:19:46 -08:00
Forgejo Actions	627caeb61f	Update docs release to v1.11.0 - Built changelog from towncrier fragments [skip ci]	2026-02-22 09:16:00 -08:00
Erich Blume	529ba10939	Fix frigate-notify: webapi polling, dedup, hi-res snapshots (#242 ) ## Summary - Switch from MQTT to webapi polling (v0.5.4 requires only one method) - Poll every 15s for responsive alerts - `notify_once: true` — one notification per event instead of repeats as object changes zones - `nosnap: drop` — skip events without snapshots (was causing all events to be dropped on v0.3.5) - `snap_hires: true` — use recording stream for higher quality snapshot images ## Deployment and Testing - [ ] Sync: `argocd app set frigate --revision fix/frigate-notify-config && argocd app sync frigate` - [ ] Verify pod starts: `kubectl --context=k3s-ringtail -n frigate get pods -l app=frigate-notify` - [ ] Check logs for successful startup and event processing (no "No snapshot" drops) - [ ] Wait for a motion event and confirm single ntfy notification with hi-res snapshot - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/242	2026-02-22 09:05:45 -08:00
Erich Blume	7dcab826fa	Upgrade frigate-notify from v0.3.5 to v0.5.4 (#241 ) ## Summary - Service review: upgrade frigate-notify from v0.3.5 to v0.5.4 - No breaking changes for current MQTT + ntfy config - Notable additions: high-res snapshots, MQTT topic parsing fixes, env var parsing fixes ## Deployment and Testing - [ ] Sync frigate app on ringtail: `argocd app set frigate --revision review/frigate-notify-v0.5.4 && argocd app sync frigate` - [ ] Verify pod starts cleanly: `kubectl --context=k3s-ringtail -n frigate get pods` - [ ] Trigger a test alert (motion event) and confirm ntfy notification arrives - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/241	2026-02-22 08:42:47 -08:00
Erich Blume	07fb48626d	Add Authentik SSO integration for Jellyfin (#239 ) ## Summary - Add Authentik OIDC provider + application for Jellyfin via blueprint (all authenticated users allowed, no policy binding) - Wire `jellyfin-client-secret` through ExternalSecret and Authentik worker deployment - Install [jellyfin-plugin-sso](https://github.com/9p4/jellyfin-plugin-sso) v4.0.0.3 via Ansible, with OIDC config template - Authentik `admins` group maps to Jellyfin administrator role - Local login left enabled; SSO is additive ## Deployment and Testing - [ ] Sync ArgoCD `authentik` app on branch — verify provider + application appear in Authentik admin - [ ] `mise run provision-indri -- --tags jellyfin --check --diff` (dry run) - [ ] `mise run provision-indri -- --tags jellyfin` (deploy plugin + config) - [ ] Test SSO flow: `https://jellyfin.ops.eblu.me/sso/OID/start/authentik` - [ ] Verify `eblume` account auto-links via `preferred_username` match - [ ] Verify admins group → Jellyfin admin - [ ] Reset ArgoCD app revision to main after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/239	2026-02-21 20:05:44 -08:00
Erich Blume	e1c2892878	Fix container tags deleted during old-tag cleanup Five container manifests were removed when deleting old-style tags (shared digests). Rebuild on `a72a0d8` and update references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-21 16:26:29 -08:00
Erich Blume	a72a0d8e8e	Update all container images to new upstream-version tagging scheme (#238 ) ## Summary - Updates all 15 container image references across 14 ArgoCD manifest files - Migrates from old internal `vX.Y.Z` tags to new `v<upstream-version>-<sha>` format - Covers: authentik, cv, devpi, forgejo-runner, homepage, kiwix-serve, kubectl, miniflux, navidrome, ntfy, quartz, teslamate, transmission ## Deployment and Testing - [ ] Sync all ArgoCD apps on branch revision - [ ] Verify all services come up healthy - [ ] Merge and re-sync on main - [ ] Clean up old-style tags from zot registry 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/238	2026-02-21 15:58:11 -08:00
Erich Blume	ff63679efb	Enable zot registry auth + wire CI credentials (#237 ) ## Summary - Enable OIDC + API key authentication on zot registry with three-tier accessControl - `anonymousPolicy: ["read"]` — anyone can pull - `artifact-workloads` group: `["read", "create"]` — CI push, no overwrite/delete - `admins` group: `["read", "create", "update", "delete"]` — break-glass - Wire both CI push paths (Dagger and Nix/skopeo) with `ZOT_CI_API_KEY` credentials - Add `artifact-workloads` PolicyBinding in Authentik blueprint for zot app access - Add `ZOT_CI_API_KEY` to Forgejo Actions secrets via existing ansible role Completes the `wire-ci-registry-auth` and `harden-zot-registry` Mikado cards. ## Manual Deployment Steps (after merge) 1. Deploy Authentik blueprint: `argocd app sync authentik` 2. In Authentik admin UI: set a password for the `zot-ci` service account 3. Deploy zot config: `mise run provision-indri -- --tags zot` 4. Log in to `https://registry.ops.eblu.me` as `zot-ci` via OIDC → generate API key 5. Store API key in 1Password as `zot-ci-apikey` in blumeops vault 6. Sync Forgejo secrets: `mise run provision-indri -- --tags forgejo_actions_secrets` 7. Trigger a test container build to verify CI push 8. Verify anonymous pull: `curl -sf https://registry.ops.eblu.me/v2/_catalog` ## Uncertainties - Zot `accessControl` group matching with OIDC: Groups from Authentik's `profile` scope claim should map to zot policy groups, but the exact claim-to-group matching needs runtime verification - `http.auth.apikey: true`: This config key is documented but needs verification against the specific zot version built from source on indri - API key permissions: Need to confirm zot API keys inherit the generating user's group for accessControl evaluation ## Test Plan - [ ] `mise run provision-indri -- --check --diff --tags zot` shows expected config changes - [ ] Anonymous pull works after deploy - [ ] Unauthenticated push fails (401) - [ ] OIDC browser login redirects to Authentik and back - [ ] API key push works after key generation - [ ] CI push succeeds with both Dagger and skopeo paths - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/237	2026-02-21 12:20:29 -08:00
Erich Blume	21b6533aea	Register Zot as OIDC client in Authentik (#236 ) ## Summary - Add Authentik blueprint (`zot.yaml`) with OAuth2 provider, application, `artifact-workloads` group, and `zot-ci` service account - Wire `zot-client-secret` through ExternalSecret → worker Deployment env var → blueprint `!Env` - Add Ansible pre_task to fetch OIDC secret from 1Password (item ID `oor7os5kapczgpbwv7obkca4y4`) - Add `oidc-credentials.json.j2` template and deploy task in zot role (with `when` guard) ## Manual Steps Required Before Deploy 1. Generate client secret: `openssl rand -hex 32` 2. Store in 1Password: add field `zot-client-secret` to "Authentik (blumeops)" item in vault `blumeops` ## What This Does NOT Do - Does NOT modify `config.json.j2` (that's the root goal `harden-zot-registry`) - Does NOT wire CI auth (that's `wire-ci-registry-auth`) - Does NOT set service account password or API keys (manual post-deploy) ## Verification After ArgoCD sync: - [ ] Authentik admin UI shows "Zot Registry" application - [ ] OIDC discovery at `https://authentik.ops.eblu.me/application/o/zot/.well-known/openid-configuration` returns valid JSON - [ ] Blueprint status is `successful` - [ ] `artifact-workloads` group exists with `zot-ci` service account 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/236	2026-02-21 08:45:06 -08:00
Erich Blume	cd50c1454a	Integrate Forgejo with Authentik OIDC (#228 ) ## Summary - Refactor Authentik blueprints: extract shared `admins` group into `common.yaml`, add `groups` scope mapping to all providers for group-based admin propagation - Add Forgejo OAuth2 provider and application blueprint (`forgejo.yaml`) - Add `forgejo-client-secret` to ExternalSecret and worker deployment env - Configure Forgejo `[oauth2_client]` with `ACCOUNT_LINKING=login` to safely link existing accounts - Update documentation (forgejo.md, authentik.md, federated-login.md) ## Deployment and Testing After merge, deployment requires these steps in order: 1. Authentik (ArgoCD): - `argocd app set authentik --revision feature/forgejo-authentik-oidc && argocd app sync authentik` - Verify: Forgejo app/provider visible in Authentik admin UI - Verify: Grafana SSO still works (blueprint refactor) 2. Forgejo app.ini (Ansible): - `mise run provision-indri -- --tags forgejo --check --diff` (dry run) - `mise run provision-indri -- --tags forgejo` (apply, restarts Forgejo) 3. Create Forgejo auth source (CLI on indri): ``` ssh indri 'sudo -u forgejo /opt/homebrew/bin/forgejo admin auth add-oauth \ --name authentik \ --provider openidConnect \ --key forgejo \ --secret "$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/Authentik (blumeops)/forgejo-client-secret")" \ --auto-discover-url https://authentik.ops.eblu.me/application/o/forgejo/.well-known/openid-configuration \ --scopes "openid email profile groups" \ --group-claim-name groups \ --admin-group admins' ``` 4. Link eblume account: Sign in with Authentik on Forgejo, confirm link with local password 5. Verify: `tea repo list`, Forgejo Actions, local password break-glass After merge: `argocd app set authentik --revision main && argocd app sync authentik` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/228	2026-02-20 17:39:50 -08:00
Erich Blume	e0c6b7df99	Add Authentik to homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 13:03:48 -08:00
Erich Blume	71cb256527	Deploy Authentik identity provider (C2 Mikado) (#227 ) ## Summary C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex. This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved. ## Current Mikado State - Goal: `deploy-authentik` (active) - Leaf prerequisites: - `build-authentik-container` — Build Nix container image - `provision-authentik-database` — Create PostgreSQL database on CNPG cluster - `create-authentik-secrets` — Create 1Password item with credentials ## Process refinements - Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early ## Test plan - [ ] `mise run docs-mikado` shows correct dependency chain - [ ] Leaf nodes can be worked independently - [ ] Container builds on ringtail - [ ] Authentik starts and reaches healthy state - [ ] Forgejo OAuth2 connector works Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227	2026-02-20 12:55:59 -08:00
Forgejo Actions	18f1ac61fc	Update docs release to v1.10.0 - Built changelog from towncrier fragments [skip ci]	2026-02-19 20:45:43 -08:00
Erich Blume	0cdc143227	Deploy Dex OIDC identity provider with Grafana SSO (#222 ) ## Summary - Deploys Dex OIDC identity provider on ringtail k3s cluster as central authentication service - Integrates Grafana as first SSO client via `auth.generic_oauth` - Uses Kubernetes CRD storage backend (no PVC needed) - All secrets (bcrypt hash, client secrets) injected via ExternalSecrets from 1Password item "Dex (blumeops)" - NixOS-built container image via `containers/dex/default.nix` ## Pre-requisites (manual, before deployment) 1. Create 1Password item "Dex (blumeops)" in `blumeops` vault with fields: - `password`: strong generated password for Dex login - `static-password-hash`: bcrypt hash of above (`htpasswd -BnC 10 eblume`, copy hash after `eblume:`) - `grafana-client-secret`: random 32-char hex (`openssl rand -hex 16`) 2. Build container: `mise run container-tag-and-release dex v1.0.0` ## Deployment sequence 1. Build container: `mise run container-tag-and-release dex v1.0.0` 2. Deploy Caddy: `mise run provision-indri -- --tags caddy` 3. Sync ArgoCD: `argocd app sync apps` → `argocd app sync dex` 4. Verify Dex: `curl https://dex.ops.eblu.me/.well-known/openid-configuration` 5. Sync Grafana: `argocd app sync grafana-config` → `argocd app sync grafana` 6. Test SSO: Visit `https://grafana.ops.eblu.me/login`, click "Sign in with Dex" ## Verification - [ ] Container image exists: `mise run container-list` shows `dex:v1.0.0-nix` - [ ] `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid OIDC discovery - [ ] `curl https://dex.ops.eblu.me/healthz` returns healthy - [ ] Grafana login shows "Sign in with Dex" button alongside local login - [ ] OIDC flow: click Dex → enter credentials → redirect back → logged in as Admin - [ ] Break-glass: local admin login still works - [ ] `mise run services-check` passes ## Files changed \| File \| Action \| Purpose \| \|------\|--------\|---------\| \| `containers/dex/default.nix` \| Create \| NixOS container build \| \| `argocd/apps/dex.yaml` \| Create \| ArgoCD app targeting ringtail \| \| `argocd/manifests/dex/*` (8 files) \| Create \| K8s manifests (RBAC, ExternalSecret, Deployment, Service, Ingress) \| \| `argocd/manifests/grafana-config/external-secret-dex-oauth.yaml` \| Create \| Grafana OIDC client secret \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Modify \| Add new ExternalSecret resource \| \| `argocd/manifests/grafana/values.yaml` \| Modify \| Add `auth.generic_oauth` config + envFromSecrets \| \| `ansible/roles/caddy/defaults/main.yml` \| Modify \| Add `dex.ops.eblu.me` reverse proxy entry \| \| `docs/changelog.d/feature-dex-oidc.feature.md` \| Create \| Changelog fragment \| Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/222	2026-02-19 20:24:24 -08:00
Erich Blume	b876e39981	Replace Homepage Helm chart with kustomize manifests and custom Dockerfile (#221 ) ## Summary - Replace third-party Helm chart (jameswynn/homepage v2.1.0, pinned at app v1.2.0) with plain kustomize manifests and a custom Dockerfile building from forge mirror at v1.10.1 - Adds Dockerfile (`containers/homepage/`) with multi-stage build (node:22-slim builder, node:22-alpine runtime) - Creates kustomize manifests: Deployment, Service, ConfigMap (6 config files), ServiceAccount, ClusterRole, ClusterRoleBinding - Keeps existing ingress-tailscale.yaml and all 6 ExternalSecret resources unchanged - Updates ArgoCD app definition from multi-source Helm to single directory source ## Prerequisite - Homepage source mirrored at forge.ops.eblu.me/eblume/homepage.git ✅ - Container must be built and pushed before syncing: `mise run container-release homepage v1.10.1` ## Deployment and Testing - [ ] Build and push container image: `mise run container-release homepage v1.10.1` - [ ] Branch-test via ArgoCD: `argocd app set homepage --revision feature/homepage-kustomize && argocd app sync homepage` - [ ] Verify dashboard loads at go.ops.eblu.me / go.tail8d86e.ts.net - [ ] Verify k8s autodiscovery works (services appear on dashboard) - [ ] Verify widgets load (weather, Forgejo, Jellyfin, etc.) - [ ] After merge: `argocd app set homepage --revision main && argocd app sync homepage` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/221	2026-02-19 18:29:19 -08:00
Erich Blume	cabd0bc9cf	Update Frigate zone masks and expand alert notifications (#219 ) ## Summary - Synced driveway_entrance zone coordinates from live Frigate config (adjusted mask boundaries) - Added `inertia: 3` and `loitering_time: 0` to driveway_entrance zone - Expanded review alerts to require either `driveway_entrance` or `driveway` zone (was entrance only) - Updated frigate-notify config to allow alerts from both `driveway_entrance` and `driveway` zones ## Deployment and Testing - [ ] Merge and sync frigate ArgoCD app on ringtail - [ ] Sync frigate-notify (restart pod to pick up ConfigMap change) - [ ] Verify alerts fire for person/car in driveway zone Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/219	2026-02-19 17:32:02 -08:00
Erich Blume	d5d32fe91f	Port Frigate NVR to ringtail k3s with GPU acceleration (#217 ) ## Summary - Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime - Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose `nvidia.com/gpu` resources - Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX) - Switch Frigate image to `-tensorrt` variant with GPU resource limits and increased shared memory ## Manual Prerequisites 1. NFS access: Verify ringtail can mount `sifaka:/volume1/frigate` ```fish ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1' ``` 2. YOLO model: Verify `/volume1/frigate/models/yolov9m.onnx` exists on sifaka ## Deployment Steps 1. Provision ringtail: `mise run provision-ringtail` 2. Sync ArgoCD apps: `argocd app sync apps --prune` 3. Deploy NVIDIA device plugin: `argocd app sync nvidia-device-plugin` 4. Verify GPU: `kubectl --context=k3s-ringtail get nodes -o json \| jq '.items[].status.capacity'` 5. Deploy Frigate: `argocd app sync frigate` ## Verification - [ ] `nvidia.com/gpu: 1` visible in node capacity - [ ] Frigate pod running with GPU allocated - [ ] Frigate UI loads at `https://nvr.ops.eblu.me` - [ ] Detector shows ONNX/TensorRT on System page - [ ] Camera feed with bounding boxes in live view - [ ] TensorRT engine build completes (watch logs on first start) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/217	2026-02-19 14:27:04 -08:00
Erich Blume	16a4a9a616	Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216 ) ## Summary - Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired - Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s - Move ntfy ArgoCD app from indri minikube to ringtail k3s - Update Frigate docs to reflect detector removal and planned RTX 4080 migration - Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail ## Deployment After merge: 1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps: ``` argocd app sync apps --prune ``` 2. Sync new ringtail apps: ``` argocd app sync mosquitto-ringtail argocd app sync ntfy-ringtail ``` 3. Manually clean up the detector LaunchAgent on indri: ``` ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ``` ## Notes - Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns) - ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup - Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed - Frigate + frigate-notify will be ported to ringtail in a follow-up PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216	2026-02-19 11:22:44 -08:00
Erich Blume	61ca1ca305	Deploy Tailscale operator on ringtail k3s cluster (#215 ) ## Summary - Extract shared Tailscale operator resources (CRDs, RBAC, Deployment, ProxyClass, DNSConfig) into `tailscale-operator-base/` so both clusters reference the same manifests - Add `tailscale-operator-ringtail/` overlay with 1-replica ProxyGroup and ExternalSecret for the shared OAuth client - Add ArgoCD Application targeting `ringtail.tail8d86e.ts.net:6443` - Update `.yamllint.yaml` ignore path for the moved `operator.yaml` ## Deployment and Testing - [ ] Sync `apps` app to pick up the new Application definition - [ ] `argocd app sync tailscale-operator-ringtail` - [ ] Verify ExternalSecret syncs: `kubectl --context=k3s-ringtail -n tailscale get externalsecret` - [ ] Verify operator pod runs: `kubectl --context=k3s-ringtail -n tailscale get pods` - [ ] Verify ProxyGroup ready: `kubectl --context=k3s-ringtail -n tailscale get proxygroups` - [ ] Verify indri operator still works: `argocd app diff tailscale-operator` - [ ] Check Tailscale admin for new operator device with `tag:k8s-operator` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/215	2026-02-19 09:33:05 -08:00

1 2 3 4 5

217 commits