blumeops

Author	SHA1	Message	Date
Erich Blume	51626e6630	Update Loki to v3.6.5-3dc4ed7 container image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:01:49 -08:00
Erich Blume	3dc4ed730b	Build Loki container image locally (#280 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (loki) (push) Successful in 2s Details Build Container / build (loki) (push) Successful in 7s Details ## Summary - Add two-stage Dockerfile for Loki (Go build → Alpine runtime) in `containers/loki/` - Rewrite kustomize image to `registry.ops.eblu.me/blumeops/loki` - Tag is `v3.6.5-placeholder` until first CI build; will be updated post-build ## Details - UID 10001 matches existing StatefulSet `securityContext` (runAsUser/fsGroup) - CGO_ENABLED=0, ldflags embed version via `github.com/grafana/loki/v3/pkg/util/build` - Clones from `forge.ops.eblu.me/mirrors/loki` (mirror created this session) - Pattern follows miniflux (two-stage Go) + prometheus (ldflags) ## Deployment and Testing - [ ] Trigger container build: `mise run container-build-and-release loki` - [ ] Update kustomize tag to actual build tag - [ ] Deploy from branch: `argocd app set loki --revision feature/loki-container && argocd app sync loki` - [ ] Verify `/ready` endpoint and log ingestion - [ ] After merge: update to `[main]` tag (C0 follow-up) Reviewed-on: #280	2026-03-03 13:00:43 -08:00
Erich Blume	f914a14653	Update teslamate to v3.0.0-eb9bc57 container image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 12:02:26 -08:00
Erich Blume	01d3b4d1c7	Switch forgejo-runner ArgoCD app to internal SSH repo URL Was the only app still using https://forge.eblu.me (public proxy) for git polling. All other apps already use the internal SSH endpoint at forge.ops.eblu.me. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:43:01 -08:00
Erich Blume	82884436df	Route runner polling through internal forge.ops.eblu.me The k8s and ringtail runners were hitting forge.eblu.me (fly.io proxy) for every FetchTask poll (~every 2s), round-tripping through the public internet unnecessarily. Use forge.ops.eblu.me (Caddy on indri, tailnet) for infrastructure workloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:33:40 -08:00
Erich Blume	7b68be2e80	Add fly.io proxy observability and app logs to Forgejo dashboard Rename "Forgejo Repository Health" to "Forgejo" and add proxy metrics (request rate, error rate, RPS, latency, bandwidth), proxy access logs, and Forgejo application logs from Loki. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:24:53 -08:00
Erich Blume	86a0dee000	Remove ollama LAN NodePort service The sanctioned ingress is ollama.ops.eblu.me via tailnet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:00:05 -08:00
Erich Blume	3af346f1cd	Move ollama LAN NodePort to port 80 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 09:37:50 -08:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	a32c99a252	Limit ollama to one loaded model and one parallel request Prevents OOM when switching between models — only one 14B model fits in 16GB VRAM at a time with KV cache for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 21:23:12 -08:00
Erich Blume	203e3cd567	Add NodePort service for ollama LAN access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 20:57:18 -08:00
Erich Blume	31d925814f	Deploy Ollama LLM server on ringtail (#277 ) ## Summary - Deploy Ollama as a new ArgoCD-managed service on ringtail's k3s cluster with GPU acceleration - Declarative model management via `models.txt` + sidecar sync script (mirrors kiwix torrent pattern) - Initial models: `qwen2.5:14b`, `deepseek-r1:14b`, `phi4:14b`, `gemma3:12b` - hostPath PV on `/mnt/storage1/ollama` for fast local model storage (200Gi) - Tailscale ingress at `ollama.ops.eblu.me` for API access from tailnet - Enable GPU time-slicing (`replicas: 2`) on nvidia-device-plugin so Frigate and Ollama share the RTX 4080 ## Deployment and Testing - [ ] Deploy nvidia-device-plugin changes first: `argocd app sync nvidia-device-plugin` - [ ] Verify GPU time-slicing: `kubectl describe node ringtail --context=k3s-ringtail` shows `nvidia.com/gpu: 2` - [ ] Sync `apps` app with `--revision feature/ollama-ringtail` - [ ] Set ollama app to branch: `argocd app set ollama --revision feature/ollama-ringtail && argocd app sync ollama` - [ ] Verify model-sync sidecar pulls models: `kubectl logs -n ollama deploy/ollama -c model-sync --context=k3s-ringtail` - [ ] Test API: `curl https://ollama.ops.eblu.me/api/tags` - [ ] Test inference: `curl https://ollama.ops.eblu.me/api/generate -d '{"model":"qwen2.5:14b","prompt":"Hello"}'` - [ ] Verify Frigate still works after GPU sharing change - [ ] After merge: `argocd app set ollama --revision main && argocd app sync ollama` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/277	2026-03-02 20:39:51 -08:00
Forgejo Actions	0f79c61c42	Update docs release to v1.12.1 - Built changelog from towncrier fragments [skip ci]	2026-03-02 18:17:07 -08:00
Forgejo Actions	847e47eaf3	Update docs release to v1.12.0 - Built changelog from towncrier fragments [skip ci]	2026-03-01 17:24:09 -08:00
Erich Blume	503775085d	Deploy authentik 2026.2.0 with migration ordering fix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 16:32:10 -08:00
Erich Blume	90621e4155	Deploy authentik 2026.2.0 with entry_points fix Update image tag to v2026.2.0-78027eb-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 16:04:29 -08:00
Erich Blume	e2c650b027	Deploy authentik 2026.2.0 with BASE_DIR fix Update image tag to v2026.2.0-e49d966-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:55:50 -08:00
Erich Blume	c0e29476f3	Deploy authentik 2026.2.0 with TMPDIR fix Update image tag to v2026.2.0-b7bfb0b-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:53:09 -08:00
Erich Blume	38da372f94	Deploy authentik 2026.2.0 with /tmp fix Update image tag to v2026.2.0-2ac353b-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:51:17 -08:00
Erich Blume	098f3e517c	Deploy authentik 2026.2.0 (source-built) to ArgoCD Update image tag to v2026.2.0-efa9806-nix — the first source-built authentik container from the build-authentik-from-source chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:44:35 -08:00
Erich Blume	02eb169403	Pin blumeops-pg to PostgreSQL 18.3 Replace floating :18 tag with pinned :18.3 (upstream out-of-cycle release fixing 18.2 regressions). Stamps service as reviewed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 16:25:32 -08:00
Erich Blume	776caa87f5	Sync Frigate zone coordinates from live API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 07:52:09 -08:00
Forgejo Actions	fa223f8e3b	Update docs release to v1.11.5 - Built changelog from towncrier fragments [skip ci]	2026-02-26 07:56:02 -08:00
Erich Blume	be3cdad1cb	Add HA for CV and Docs: zero-downtime deploys (#273 ) ## Summary - Set `replicas: 2` with `maxUnavailable: 0` / `maxSurge: 1` on CV and Docs deployments so rolling updates never drop below 2 ready pods - Add PodDisruptionBudgets (`minAvailable: 1`) to protect against node drains and cluster maintenance - Add Fly.io cache purge step to `cv-deploy.yaml` workflow (docs already had this) so CV deploys don't serve stale cached content ## Deployment and Testing - [ ] `argocd app diff cv` / `argocd app diff docs` from branch - [ ] Deploy from branch: `argocd app set cv --revision feature/ha-cv-docs-zero-downtime && argocd app sync cv` - [ ] Verify 2 pods running: `kubectl get pods -n cv --context=minikube-indri` - [ ] Test rolling restart: `kubectl rollout restart deployment/cv -n cv --context=minikube-indri` - [ ] During rollout, confirm continuous availability via `curl -I https://cv.eblu.me` - [ ] After merge: reset ArgoCD to main, re-sync both apps Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/273	2026-02-26 07:53:21 -08:00
Erich Blume	fb83c5c577	Add explicit ExternalSecret defaults for SSA sync parity The external-secrets webhook injects conversionStrategy, decodingStrategy, and metadataPolicy defaults on admission. Declaring them explicitly prevents ArgoCD SSA from flagging the resource as OutOfSync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 07:02:54 -08:00
Erich Blume	db561c6b0e	Upgrade ArgoCD v3.2.6 → v3.3.2 with Server-Side Apply (#272 ) ## Summary - Upgrade ArgoCD from v3.2.6 to v3.3.2 - Enable `ServerSideApply=true` sync option (required by v3.3 — ApplicationSet CRD exceeds client-side apply annotation limit) - Update service-versions.yaml with review for argocd and 1password-connect ## Breaking changes reviewed - Server-Side Apply required: Added to syncOptions ✅ - Source Hydrator git notes: Not used — N/A - Application path cleaning removed: Not used — N/A - Settings API field restriction: Authenticated access only — N/A ## Deployment and Testing - [ ] Sync the `apps` app first (picks up SSA syncOption change) - [ ] `argocd app set argocd --revision feature/argocd-v3.3.2` - [ ] `argocd app sync argocd` - [ ] Verify all argocd pods running with v3.3.2 images - [ ] Verify other apps still sync correctly - [ ] After merge: `argocd app set argocd --revision main && argocd app sync argocd` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/272	2026-02-26 06:51:50 -08:00
Erich Blume	95c8424e62	Add Transmission metrics exporter and Grafana dashboard (#271 ) ## Summary - Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091 - Add metrics port to the torrent service for Prometheus scraping - Add Prometheus scrape job targeting the transmission exporter - Create Grafana dashboard with: - Overview stats (download/upload speed, active/total torrents) - Transfer speed timeseries (download + upload over time) - Transfer volume stats (total downloaded/uploaded in selected range) - Per-torrent download and upload rate timeseries - Per-torrent details table (ratio, uploaded, percent done) ## Deployment and Testing - [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts - [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics` - [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me - [ ] Open Grafana, find "Transmission" dashboard, verify panels populate - [ ] Sync ArgoCD `prometheus` app from branch - [ ] Sync ArgoCD `grafana-config` app from branch Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271	2026-02-25 22:23:33 -08:00
Erich Blume	03d71544ec	Add multi-cluster observability with ringtail metrics and dashboards (#270 ) ## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - Kubernetes Clusters dashboard — multi-cluster with `cluster` and `namespace` template variables - Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270	2026-02-25 22:01:00 -08:00
Erich Blume	2243f2e0a1	Filter driveway zone to person/dog/cat only in Frigate Parked car was being re-detected every few minutes at night due to IR illumination noise triggering motion detection. Restrict the driveway zone to [person, dog, cat] so cars and birds no longer create events there. Cars still alert via the driveway_entrance zone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 20:45:07 -08:00
Erich Blume	de54b4e33d	Port CloudNative-PG off Helm to direct release manifest (#268 ) ## Summary - Point ArgoCD app directly at forge-mirrored upstream repo (`mirrors/cloudnative-pg`) instead of the Helm charts repo - Use `directory.include` to select the specific release manifest (`cnpg-1.27.1.yaml`) from the `releases/` directory - No vendored files, no Helm — upgrades are a two-line change (`targetRevision` + `directory.include`) - Delete unused `values.yaml` (was empty, all Helm defaults) ## Deployment and Testing - [ ] Register mirror repo in ArgoCD: `argocd repo add ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git --ssh-private-key-path <key>` - [ ] `argocd app set cloudnative-pg --revision feature/cnpg-direct-source && argocd app sync cloudnative-pg` - [ ] Verify operator pod running: `kubectl get pods -n cnpg-system --context=minikube-indri` - [ ] Verify CRDs exist: `kubectl get crd --context=minikube-indri \| grep cnpg` - [ ] Verify existing clusters healthy: `kubectl get clusters -A --context=minikube-indri` - [ ] After merge: `argocd app set cloudnative-pg --revision main && argocd app sync cloudnative-pg` ## Notes - The forge mirror was created via `mise run mirror-create` from `https://github.com/cloudnative-pg/cloudnative-pg.git` - ArgoCD may need the mirror repo added to its known repositories if the credential template doesn't already match `mirrors/*` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/268	2026-02-25 17:37:53 -08:00
Erich Blume	285ad4141f	Fix Frigate detection events rate metric name in Grafana dashboard The panel queried frigate_camera_events but the actual metric exposed by Frigate is frigate_camera_events_total with a "camera" label (not "camera_name"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 16:51:57 -08:00
Forgejo Actions	4736c7e9bd	Update docs release to v1.11.4 - Built changelog from towncrier fragments [skip ci]	2026-02-25 07:04:23 -08:00
Erich Blume	5f9bc20345	Fix mirror org refs in ArgoCD apps and widen credential template (#266 ) ## Summary - Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials) - Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts) - Fix stale alloy clone comment in Ansible defaults - Bump immich v2.5.2 → v2.5.6 (bug-fix patches only) - Update ArgoCD README bootstrap command and credential docs ## Context Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit ``cd57814``. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos. ## Deployment 1. Sync the ArgoCD `argocd` app first (picks up the widened credential template) 2. Sync the `apps` app (picks up new repo URLs for all 8 apps) 3. Verify immich resolves its ComparisonError: `argocd app get immich` 4. Sync immich to deploy v2.5.6: `argocd app sync immich` 5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266	2026-02-25 06:55:53 -08:00
Erich Blume	4f8f2985c1	Update prometheus and teslamate image tags after mirror migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 21:18:15 -08:00
Erich Blume	e0f9ebebdf	Update homepage, navidrome, ntfy, miniflux image tags after mirror migration Prometheus and teslamate builds still in progress — will update in a follow-up commit once their `33b7f0f` tags land. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 21:06:08 -08:00
Erich Blume	61ee6a4d38	Fix Grafana ConfigMap labels lost in configMapGenerator migration The hand-written configmap.yaml had app.kubernetes.io/name and app.kubernetes.io/instance labels; configMapGenerator dropped them. Add options.labels to both generator entries to restore parity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 14:46:20 -08:00
Erich Blume	9b44a8ec51	Add kustomize images: and configMapGenerator: across services (#264 ) ## Summary - Move hardcoded image tags to kustomization.yaml `images:` transformer across 22 services — image names in manifests become version-agnostic templates, with tags centralized in one place per service - Replace hand-written ConfigMap manifests with `configMapGenerator:` in 12 services — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes - Create new `kustomization.yaml` for forgejo-runner and nvidia-device-plugin (switches ArgoCD from directory mode to kustomize mode, rendered output identical) ### Services modified Images only (8): cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent Images + configMapGenerator (10): alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus Images only, no configMapGenerator (4): authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is) Skipped entirely (6): argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml) ### What changes at deploy time - images: — no functional diff, `kustomize build` produces identical output with tags - configMapGenerator: — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change ## Test plan - [x] `kubectl kustomize` builds all 30 service directories successfully - [x] Image tags verified in rendered output for all modified services - [x] ConfigMap hash suffixes verified in rendered output - [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names - [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.) - [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes - [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264	2026-02-24 14:25:19 -08:00
Erich Blume	86aeb60ec9	Fix TeslaMate dashboards: add database to PostgreSQL jsonData Grafana 12.x's grafana-postgresql-datasource plugin requires the database name in jsonData, not just the top-level database field. Without it, the frontend blocks all queries with "no default database configured", causing all TeslaMate panels to show "No Data." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 13:49:07 -08:00
Erich Blume	495c3e8496	Fix Grafana OAuth role mapping from Authentik groups The INI parser was stripping outer single quotes from role_attribute_path = 'Admin', causing Grafana to evaluate 'Admin' as a JMESPath field identifier instead of a string literal. This resulted in all OAuth users getting the default Viewer role. Replaced with a proper group-based expression that checks for the 'admins' Authentik group and maps to Admin/Viewer accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 13:41:08 -08:00
Erich Blume	4acd2e58d4	Update prometheus and grafana to main-SHA container tags Prometheus: v3.9.1-74029e1 [branch] -> v3.9.1-2ba5d8a [main] Grafana: v12.3.3-09ac36b [branch] -> v12.3.3-d05d2fb [main] These images were built during PR development and referenced branch commits that won't survive branch cleanup. The [main] tags are identical rebuilds from the squash-merge commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 09:58:09 -08:00
Erich Blume	2ba5d8a8aa	Port Prometheus to local container build (#262 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (prometheus) (push) Successful in 2s Details Build Container / build (prometheus) (push) Successful in 7s Details ## Summary - Add three-stage Dockerfile for Prometheus v3.9.1 (Node UI → Go binaries → Alpine runtime) - Produces `prometheus` and `promtool` binaries with embedded web UI assets - Follows navidrome/ntfy pattern for supply chain control via Zot registry ## Deployment and Testing - [ ] `dagger call build --src=. --container-name=prometheus` succeeds - [ ] Container reports correct version via `prometheus --version` - [ ] `promtool --version` works - [ ] Update statefulset image reference after successful build - [ ] Deploy from branch: `argocd app set prometheus --revision <branch> && argocd app sync prometheus` - [ ] Health probes pass (`/-/healthy`, `/-/ready`) - [ ] Web UI loads, scrape targets work, remote write functions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/262	2026-02-24 09:15:57 -08:00
Forgejo Actions	2f78d180e8	Update docs release to v1.11.3 - Built changelog from towncrier fragments [skip ci]	2026-02-23 21:04:33 -08:00
Erich Blume	d05d2fbaff	C2: Upgrade Grafana to 12.x with Nix container and Kustomize (#260 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 1s Details Build Container (Nix) / build (grafana) (push) Successful in 2s Details Build Container / build (grafana) (push) Successful in 7s Details ## Summary Mikado chain to upgrade Grafana from 11.4.0 (Helm chart) to 12.x with: - Home-built Nix container image (`forge.ops.eblu.me/eblume/grafana`) - Kustomize manifests replacing the Helm chart - Single-source ArgoCD app ## Chain Goal: `upgrade-grafana` Leaves: `build-grafana-container`, `kustomize-grafana-deployment` Track with: `mise run docs-mikado upgrade-grafana` ## Test plan - [ ] Container builds successfully via Nix - [ ] Container pushed to registry - [ ] Kustomize manifests produce equivalent resources to current Helm - [ ] Pod runs, UI loads, OIDC works, datasources healthy - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/260	2026-02-23 18:07:18 -08:00
Erich Blume	9b419abf24	Update RUNNER_LABELS to use runner-job-image:v0.19.11-4c5e0f0 Now that the image is built under the new name, point the forgejo runner at it. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 17:47:14 -08:00
Erich Blume	75fde54355	Fix Grafana TeslaMate dashboard folder provisioning (#253 ) ## Summary - `foldersFromFilesStructure` was `false` in Grafana's sidecar provider config, causing Grafana to ignore the subdirectory structure the sidecar creates from `grafana_folder` annotations - All 18 TeslaMate dashboards were appearing in the root "Dashboards" folder despite having `grafana_folder: "TeslaMate"` annotations on their ConfigMaps - Flipping to `true` makes Grafana replicate the sidecar's directory structure as UI folders ## Deployment and Testing - [ ] Sync `grafana` app: `argocd app sync grafana` - [ ] Verify TeslaMate dashboards appear under a "TeslaMate" folder in Grafana's dashboard list - [ ] Verify other dashboards remain in the root "Dashboards" folder Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/253	2026-02-22 18:38:51 -08:00
Erich Blume	6871bf32a8	Remove unused gpu panel in frigate dashboard	2026-02-22 18:26:47 -08:00
Erich Blume	2c6c6a244a	Fix Frigate Prometheus metrics & rebuild Grafana dashboard (#252 ) ## Summary - Prometheus scrape target: Changed from `frigate.frigate.svc.cluster.local:5000` (broken after ringtail migration) to `nvr.ops.eblu.me` via HTTPS through Caddy on indri - Grafana dashboard: Rebuilt for Frigate 0.17 metrics — 12 panels total: - Row 1 (stats): Uptime, Inference Speed, Camera FPS, Detection FPS, GPU Usage, GPU Temp - Row 2 (timeseries): CPU Usage, Memory Usage - Row 3 (timeseries): Camera FPS + Skipped FPS, GPU Usage + Memory over time - Row 4 (timeseries): Storage Usage, Detection Events (rate by camera/label) ## Deployment and Testing 1. Sync prometheus app on branch: ``` argocd app set prometheus --revision fix/frigate-metrics-dashboard && argocd app sync prometheus ``` 2. Check `prometheus.ops.eblu.me/targets` — frigate job should show UP 3. Sync grafana-config: ``` argocd app sync grafana-config ``` 4. Check `grafana.ops.eblu.me` — Frigate NVR dashboard should show live data 5. After merge: reset both apps to `--revision main` and sync Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/252	2026-02-22 18:14:17 -08:00
Forgejo Actions	dda7d719b3	Update docs release to v1.11.2 - Built changelog from towncrier fragments [skip ci]	2026-02-22 17:52:05 -08:00
Erich Blume	e655f4556e	Upgrade k8s forgejo-runner from v6.3.1 to v12.7.0 (#251 ) ## Summary Completes the `upgrade-k8s-runner` mikado chain. Both prerequisites (workflow validation in Dagger, config review against v12 defaults) were resolved in #250. - Bump runner image `code.forgejo.org/forgejo/runner:6.3.1` → `12.7.0` - Update `service-versions.yaml` to track new version - Mark goal card complete (remove `status: active`) ## Deployment and Testing After merge: 1. `argocd app sync forgejo-runner` 2. Verify runner registers in Forgejo admin → runners 3. Trigger a test workflow (e.g. `branch-cleanup.yaml` manual dispatch) Rollback: revert image tag to `6.3.1`, push, sync. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/251	2026-02-22 17:43:39 -08:00
Erich Blume	0f6a1898f0	Prepare forgejo-runner v12 upgrade (leaf nodes) (#250 ) ## Summary - Review runner config against v12.7.0 defaults — added `shutdown_timeout: 3h`, no breaking changes found - Add `validate_workflows` Dagger function using `forgejo-runner validate --directory .` inside upstream container - All 6 workflows pass v12.7.0 schema validation - Wire `mise run validate-workflows` task and pre-commit hook on `.forgejo/workflows/` changes - Mark both leaf Mikado cards (`review-runner-config-v12`, `validate-workflows-against-v12`) complete ## Mikado State After merge, `upgrade-k8s-runner` goal card has no unmet dependencies — ready to execute the actual image bump in a follow-up PR. ## Test Plan - [x] `dagger call validate-workflows --src=.` passes (all 6 workflows OK) - [x] Pre-commit hooks pass - [ ] Reviewer: confirm `shutdown_timeout: 3h` addition to ConfigMap looks reasonable 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/250	2026-02-22 17:38:32 -08:00

1 2 3 4 5

239 commits