blumeops

Author	SHA1	Message	Date
Erich Blume	e0057b46e4	Wire ringtail blumeops-pg into backups + Grafana (#364 ) Prereq for the wave-1 decommission. The cutover moved paperless+teslamate (postgres) and mealie (SQLite) to ringtail, but borgmatic and the Grafana TeslaMate datasource still pointed at the minikube copies — the migrated live data was unbacked since cutover, and dropping the minikube DBs would break the TeslaMate dashboards. - Tailscale Service `blumeops-pg-ringtail` + Caddy L4 route `pg.ops.eblu.me:5434` - borgmatic: teslamate + paperless postgres → :5434; mealie SQLite → ssh:eblume@ringtail - Grafana TeslaMate datasource → pg.ops.eblu.me:5434 Deploy: sync databases-ringtail (tailscale svc) + grafana from branch; provision-indri --tags caddy,borgmatic; verify a backup run + dashboards. Unblocks the decommission PR. Reviewed-on: #364	2026-06-03 12:25:30 -07:00
Erich Blume	2fae0f7161	C0: switch grafana deployment to Recreate strategy Grafana uses an RWO PVC for SQLite + Bleve search index. RollingUpdate spawns the new pod before terminating the old one, so the new pod crashloops on the index lock until rollout timeout. Recreate terminates the old pod first, letting the new pod acquire the lock cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 06:33:26 -07:00
Erich Blume	495e45d01d	Address 6 critical Prowler IaC findings (mute + grafana RBAC tighten) (#340 ) ## Summary The weekly Prowler IaC scan reported 6 critical findings against `argocd/manifests/`. They split cleanly into two patterns: - Legitimate-by-design RBAC → mute with new compensating controls - `external-secrets-controller`, `external-secrets-cert-controller` manage `secrets` (KSV-0041) and the cert-controller mutates its own webhook configurations (KSV-0114). This is what the operator is for. New CC: `operator-purpose-bound-rbac`. - `kube-state-metrics` (both `minikube-indri` and `k3s-ringtail`) holds `list/watch` on secrets to expose `kube_secret_info` and `kube_secret_labels` metrics. KSM's metric schema only reads metadata, never the `data:` field. New CC: `kube-state-metrics-metadata-only`. - Over-broad RBAC → fix - `grafana-clusterrole` had `get/watch/list` on `secrets` because the dashboard-sidecar config used `RESOURCE=both` (ConfigMaps + Secrets). Nothing in the cluster labels Secrets with `grafana_dashboard=1`, so this was unused power. Switched both sidecar instances to `RESOURCE=configmap` and removed `secrets` from the ClusterRole. The IaC cronjob also did not previously pass `--mutelist-file`, which is why every IaC finding reported as unmuted regardless of mutelist configuration. The new `mutelist/iac.yaml` is bundled into the existing `prowler-mutelist` ConfigMap and mounted via `items:` selector. ## Test plan - [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/prowler/` — already passes locally - [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/grafana/` — already passes locally - [ ] Deploy from this branch via `argocd app set prowler --revision prowler-iac-mutelist && argocd app sync prowler` and same for `grafana` - [ ] Manually trigger the IaC cronjob and verify `MUTED=True` on the 6 critical findings (`kubectl --context=minikube-indri -n prowler create job --from=cronjob/prowler-iac-scan prowler-iac-test`) - [ ] Restart grafana pod and confirm dashboards still render (sidecar still finds them via ConfigMap watch) - [ ] After verify, `argocd app set <app> --revision main && argocd app sync <app>` post-merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #340	2026-04-29 10:43:32 -07:00
Erich Blume	fe0e913963	Switch Fly proxy to upstream keepalive pools (#337 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337	2026-04-17 16:39:52 -07:00
Erich Blume	db6d8af8b1	Update grafana-sidecar image tag to v2.6.0-61fcd5d (merge build) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:02:39 -07:00
Erich Blume	61fcd5d70a	Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s Details ## Summary - Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable) - Port build from Dockerfile to native Dagger container.py - Add liveness/readiness probes using the new /healthz endpoint on port 8080 - Update docs to reflect container.py migration and remove stale pin note ## Test plan - [ ] Build container: `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization tag with new image tag - [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana` - [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz` - [ ] Verify dashboards load in Grafana UI - [ ] `mise run services-check` Reviewed-on: #332	2026-04-13 07:57:13 -07:00
Erich Blume	936d29bbe1	Fix UnPoller dashboard UIDs exceeding Grafana 12's 40-char limit Strip redundant "unifi-poller-" prefix from generated slugs, bringing UIDs from 45-48 chars down to 32-35 chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 07:03:39 -07:00
Erich Blume	b1e2811077	Upgrade Grafana 12.3.3 → 12.4.2 (#322 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (grafana) (push) Successful in 7s Details ## Summary - Bumps Grafana from 12.3.3 to 12.4.2 - Patches 7 CVEs, notably CVE-2026-27880 (unauthenticated OOM DoS, CVSS 7.5) and CVE-2026-27879 (authenticated OOM via resample queries) - No config changes required — reviewed alerting, datasources, OIDC, and feature toggles against 12.4.x breaking changes ## Breaking changes reviewed \| Change \| Impact \| \|--------\|--------\| \| Alerting: pending period applies to NoData/Error \| Net positive — reduces noise from transient blips \| \| Default notification uses empty receiver \| No impact — we explicitly set `ntfy-infra` \| \| Removed feature toggles (4) \| No impact — none configured \| \| OAuth ID token signature validation \| Low risk — verify OIDC login post-deploy \| \| OpsGenie deprecated \| No impact — using webhook \| ## Test plan - [ ] Container build completes at forge - [ ] Update kustomization.yaml with new image tag - [ ] `argocd app set grafana --revision upgrade/grafana-12.4.2 && argocd app sync grafana` - [ ] Verify Grafana UI loads at grafana.ops.eblu.me - [ ] Verify OIDC login via Authentik - [ ] Verify dashboards and datasources load - [ ] Check alerting rules are intact 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #322	2026-04-02 11:33:19 -07:00
Erich Blume	2c1652604b	Reduce PodNotReady alert lookback from 5m to 60s The 5-minute lookback window kept stale data from terminated pods visible during rollouts, causing the alert to sit in Pending for ~5 minutes after every routine deployment. 60s still covers two scrape cycles (30s interval) while clearing stale data much faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:48:37 -07:00
Erich Blume	a37012385f	Tighten ArgoCDAppOutOfSync alert timing to clear faster after sync Reduced `for` from 30m to 5m and lookback window from 5m to 1m. The old values caused alerts to linger long after apps returned to Synced state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:44:09 -07:00
Erich Blume	6d65e6928c	C2: Deploy infrastructure alerting pipeline (#303 ) ## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. Design: - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes Chain (bottom-up): 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303	2026-03-22 14:52:56 -07:00
Erich Blume	3d2a97aaf9	Update kustomization tags to OCI-labeled builds (`613f05d`) Point all services at the `613f05d` images which carry the new consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4 version, no matching build) and docs/quartz (no `613f05d` build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 06:34:12 -07:00
Erich Blume	c92b949a20	Fix UID sed to target root-level dashboard uid only The top-level "uid" in Grafana dashboard JSON is at 2-space indent near the end of the file, not the first occurrence. Match on ^ "uid" to avoid clobbering nested datasource uid references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:56:50 -07:00
Erich Blume	334fbbb9e3	Fix TeslaMate/UnPoller dashboard UID sed clobbering datasource refs The previous sed replaced ALL "uid" fields in dashboard JSON files, including datasource references inside panels, causing dashboards to go dark. Scope the replacement to only the first occurrence (the top-level dashboard UID) using GNU sed 0,/pattern/ addressing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:53:00 -07:00
Erich Blume	6f88baeb91	Fix Grafana starred dashboards lost on pod restart Add init container to pre-populate ConfigMap dashboards before Grafana starts, eliminating the race between the sidecar and the provisioner that caused dashboard DB records to be deleted and re-created with new IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards fetched from upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:40:44 -07:00
Erich Blume	b54d87e071	Fix shell syntax error in unpoller dashboard initcontainer Comments can't appear inside a for-in list in sh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 15:59:03 -07:00
Erich Blume	4dc3e5cae2	Add UnPoller for UniFi network metrics (#298 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (unpoller) (push) Successful in 2s Details Build Container / build (unpoller) (push) Successful in 7s Details ## Summary - Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus - Custom-built container from forge mirror (`containers/unpoller/Dockerfile`) - Credentials pulled from 1Password via external-secrets - Prometheus scrape job added, docs and service-versions updated ## Test plan - [ ] Build container: `mise run container-release unpoller v2.34.0` - [ ] Update kustomization tag with built image tag - [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller` - [ ] Verify pod connects to UX7 controller (check logs) - [ ] Confirm `unpoller` target appears in Prometheus - [ ] Query `unifi_` metrics in Grafana 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #298	2026-03-16 15:52:45 -07:00
Erich Blume	4ca2e39901	Externalize TeslaMate dashboards to forge mirror (#296 ) ## Summary - Replaces 18 TeslaMate dashboard ConfigMaps (713 KB / 22,080 lines) with a Grafana init container - Init container fetches dashboard JSON directly from `mirrors/teslamate` on forge, pinned to `v3.0.0` - Grafana's file provider picks them up from `/tmp/dashboards/TeslaMate/` via `foldersFromFilesStructure` - Non-TeslaMate dashboards remain as ConfigMaps (unchanged) ## How it works - New `init-teslamate-dashboards` init container uses busybox `wget` to fetch each JSON file from `https://forge.eblu.me/mirrors/teslamate/raw/tag/v3.0.0/grafana/dashboards/` - Files land in `/tmp/dashboards/TeslaMate/`, same emptyDir volume the sidecar uses - The sidecar continues to handle ConfigMap-based dashboards; the init container handles TeslaMate - Version pin is in the init container args (TESLAMATE_VERSION) ## Deployment and Testing - [ ] Sync `grafana` app from branch — verify init container runs and fetches dashboards - [ ] Sync `grafana-config` app from branch — verify TeslaMate ConfigMaps are pruned - [ ] Check Grafana UI: TeslaMate folder should still contain all 18 dashboards - [ ] Verify non-TeslaMate dashboards are unaffected - [ ] After merge: sync both apps from main Reviewed-on: #296	2026-03-15 18:31:19 -07:00
Erich Blume	6e8d11c6bb	Add :kustomized sentinel tag to manifest images, review devpi Bare image references in manifests were ambiguous — unclear whether the tag was intentionally omitted or managed by kustomize. Add :kustomized sentinel to all 37 image refs overridden by kustomize images transformer. Add sync notes for tailscale-operator proxyclass (CRD fields not processed by kustomize). Mark devpi reviewed (6.19.1 is current). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 08:15:06 -08:00
Erich Blume	c281fb5403	Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286 ) ## Summary Adds the third observability pillar — distributed tracing — alongside existing metrics (Prometheus) and logs (Loki). - Grafana Tempo 2.10.1 on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus - Beyla eBPF auto-instrumentation via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes - Grafana integration — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking - Prometheus scrapes Tempo operational metrics ### Architecture ``` ringtail (k3s) indri (minikube) ┌──────────────────────┐ ┌─────────────────────┐ │ Alloy+Beyla (eBPF) │──OTLP HTTP────────→ │ Tempo │ │ ↳ Frigate, ntfy, │ via tailnet │ ↳ trace storage │ │ Ollama, Immich │ │ ↳ RED → Prometheus │ └──────────────────────┘ │ │ │ Grafana │ │ ↳ Tempo datasource │ └─────────────────────┘ ``` ### New files (12) - `docs/reference/services/tempo.md` — reference doc - `docs/changelog.d/feature-otel-tracing.feature.md` - `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files) - `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files) ### Modified files (6) - `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields - `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target - `service-versions.yaml` — tempo + alloy-tracing-ringtail entries - `docs/reference/services/grafana.md` — Tempo in datasources table - `docs/reference/reference.md` — Tempo in services index - `docs/reference/operations/observability.md` — Tempo in components list ## Deployment and Testing - [ ] Sync `apps` app to pick up new Application definitions - [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo` - [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo` - [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready` - [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring` - [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail` - [ ] Check Beyla discovery in alloy-tracing logs on ringtail - [ ] Sync grafana-config for updated datasources - [ ] Sync prometheus for updated scrape config - [ ] Test Grafana Tempo datasource connection - [ ] Generate test traffic and search traces in Grafana Explore → Tempo - [ ] After merge: reset all ArgoCD app revisions back to main Reviewed-on: #286	2026-03-05 10:51:07 -08:00
Erich Blume	3d065b94f9	Pin grafana-sidecar to main build tag v1.28.0-a2bb9ab (built from merge commit on main). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:51:01 -08:00
Erich Blume	a2bb9abbdb	Home-build grafana-sidecar container (#281 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (grafana-sidecar) (push) Successful in 2s Details Build Container / build (grafana-sidecar) (push) Successful in 6s Details ## Summary - Home-build the k8s-sidecar container (`grafana-sidecar`) from forge mirror, replacing upstream `quay.io/kiwigrid/k8s-sidecar:1.28.0` - Pinned to v1.28.0 — v2.x deferred due to 135% memory regression and readOnlyRootFilesystem crashloop - Adds Dockerfile, service-versions entry, docs, and changelog fragment - Manifest switch to home-built image pending container build ## Deployment and Testing - [ ] `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization.yaml with built image tag - [ ] `argocd app set grafana --revision feature/grafana-sidecar && argocd app sync grafana` - [ ] Verify sidecar logs and dashboards at https://grafana.ops.eblu.me - [ ] Post-merge: `argocd app set grafana --revision main && argocd app sync grafana` Reviewed-on: #281	2026-03-03 13:48:24 -08:00
Erich Blume	61ee6a4d38	Fix Grafana ConfigMap labels lost in configMapGenerator migration The hand-written configmap.yaml had app.kubernetes.io/name and app.kubernetes.io/instance labels; configMapGenerator dropped them. Add options.labels to both generator entries to restore parity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 14:46:20 -08:00
Erich Blume	9b44a8ec51	Add kustomize images: and configMapGenerator: across services (#264 ) ## Summary - Move hardcoded image tags to kustomization.yaml `images:` transformer across 22 services — image names in manifests become version-agnostic templates, with tags centralized in one place per service - Replace hand-written ConfigMap manifests with `configMapGenerator:` in 12 services — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes - Create new `kustomization.yaml` for forgejo-runner and nvidia-device-plugin (switches ArgoCD from directory mode to kustomize mode, rendered output identical) ### Services modified Images only (8): cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent Images + configMapGenerator (10): alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus Images only, no configMapGenerator (4): authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is) Skipped entirely (6): argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml) ### What changes at deploy time - images: — no functional diff, `kustomize build` produces identical output with tags - configMapGenerator: — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change ## Test plan - [x] `kubectl kustomize` builds all 30 service directories successfully - [x] Image tags verified in rendered output for all modified services - [x] ConfigMap hash suffixes verified in rendered output - [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names - [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.) - [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes - [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264	2026-02-24 14:25:19 -08:00
Erich Blume	86aeb60ec9	Fix TeslaMate dashboards: add database to PostgreSQL jsonData Grafana 12.x's grafana-postgresql-datasource plugin requires the database name in jsonData, not just the top-level database field. Without it, the frontend blocks all queries with "no default database configured", causing all TeslaMate panels to show "No Data." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 13:49:07 -08:00
Erich Blume	495c3e8496	Fix Grafana OAuth role mapping from Authentik groups The INI parser was stripping outer single quotes from role_attribute_path = 'Admin', causing Grafana to evaluate 'Admin' as a JMESPath field identifier instead of a string literal. This resulted in all OAuth users getting the default Viewer role. Replaced with a proper group-based expression that checks for the 'admins' Authentik group and maps to Admin/Viewer accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 13:41:08 -08:00
Erich Blume	4acd2e58d4	Update prometheus and grafana to main-SHA container tags Prometheus: v3.9.1-74029e1 [branch] -> v3.9.1-2ba5d8a [main] Grafana: v12.3.3-09ac36b [branch] -> v12.3.3-d05d2fb [main] These images were built during PR development and referenced branch commits that won't survive branch cleanup. The [main] tags are identical rebuilds from the squash-merge commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 09:58:09 -08:00
Erich Blume	d05d2fbaff	C2: Upgrade Grafana to 12.x with Nix container and Kustomize (#260 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 1s Details Build Container (Nix) / build (grafana) (push) Successful in 2s Details Build Container / build (grafana) (push) Successful in 7s Details ## Summary Mikado chain to upgrade Grafana from 11.4.0 (Helm chart) to 12.x with: - Home-built Nix container image (`forge.ops.eblu.me/eblume/grafana`) - Kustomize manifests replacing the Helm chart - Single-source ArgoCD app ## Chain Goal: `upgrade-grafana` Leaves: `build-grafana-container`, `kustomize-grafana-deployment` Track with: `mise run docs-mikado upgrade-grafana` ## Test plan - [ ] Container builds successfully via Nix - [ ] Container pushed to registry - [ ] Kustomize manifests produce equivalent resources to current Helm - [ ] Pod runs, UI loads, OIDC works, datasources healthy - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/260	2026-02-23 18:07:18 -08:00
Erich Blume	75fde54355	Fix Grafana TeslaMate dashboard folder provisioning (#253 ) ## Summary - `foldersFromFilesStructure` was `false` in Grafana's sidecar provider config, causing Grafana to ignore the subdirectory structure the sidecar creates from `grafana_folder` annotations - All 18 TeslaMate dashboards were appearing in the root "Dashboards" folder despite having `grafana_folder: "TeslaMate"` annotations on their ConfigMaps - Flipping to `true` makes Grafana replicate the sidecar's directory structure as UI folders ## Deployment and Testing - [ ] Sync `grafana` app: `argocd app sync grafana` - [ ] Verify TeslaMate dashboards appear under a "TeslaMate" folder in Grafana's dashboard list - [ ] Verify other dashboards remain in the root "Dashboards" folder Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/253	2026-02-22 18:38:51 -08:00
Erich Blume	71cb256527	Deploy Authentik identity provider (C2 Mikado) (#227 ) ## Summary C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex. This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved. ## Current Mikado State - Goal: `deploy-authentik` (active) - Leaf prerequisites: - `build-authentik-container` — Build Nix container image - `provision-authentik-database` — Create PostgreSQL database on CNPG cluster - `create-authentik-secrets` — Create 1Password item with credentials ## Process refinements - Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early ## Test plan - [ ] `mise run docs-mikado` shows correct dependency chain - [ ] Leaf nodes can be worked independently - [ ] Container builds on ringtail - [ ] Authentik starts and reaches healthy state - [ ] Forgejo OAuth2 connector works Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227	2026-02-20 12:55:59 -08:00
Erich Blume	0cdc143227	Deploy Dex OIDC identity provider with Grafana SSO (#222 ) ## Summary - Deploys Dex OIDC identity provider on ringtail k3s cluster as central authentication service - Integrates Grafana as first SSO client via `auth.generic_oauth` - Uses Kubernetes CRD storage backend (no PVC needed) - All secrets (bcrypt hash, client secrets) injected via ExternalSecrets from 1Password item "Dex (blumeops)" - NixOS-built container image via `containers/dex/default.nix` ## Pre-requisites (manual, before deployment) 1. Create 1Password item "Dex (blumeops)" in `blumeops` vault with fields: - `password`: strong generated password for Dex login - `static-password-hash`: bcrypt hash of above (`htpasswd -BnC 10 eblume`, copy hash after `eblume:`) - `grafana-client-secret`: random 32-char hex (`openssl rand -hex 16`) 2. Build container: `mise run container-tag-and-release dex v1.0.0` ## Deployment sequence 1. Build container: `mise run container-tag-and-release dex v1.0.0` 2. Deploy Caddy: `mise run provision-indri -- --tags caddy` 3. Sync ArgoCD: `argocd app sync apps` → `argocd app sync dex` 4. Verify Dex: `curl https://dex.ops.eblu.me/.well-known/openid-configuration` 5. Sync Grafana: `argocd app sync grafana-config` → `argocd app sync grafana` 6. Test SSO: Visit `https://grafana.ops.eblu.me/login`, click "Sign in with Dex" ## Verification - [ ] Container image exists: `mise run container-list` shows `dex:v1.0.0-nix` - [ ] `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid OIDC discovery - [ ] `curl https://dex.ops.eblu.me/healthz` returns healthy - [ ] Grafana login shows "Sign in with Dex" button alongside local login - [ ] OIDC flow: click Dex → enter credentials → redirect back → logged in as Admin - [ ] Break-glass: local admin login still works - [ ] `mise run services-check` passes ## Files changed \| File \| Action \| Purpose \| \|------\|--------\|---------\| \| `containers/dex/default.nix` \| Create \| NixOS container build \| \| `argocd/apps/dex.yaml` \| Create \| ArgoCD app targeting ringtail \| \| `argocd/manifests/dex/*` (8 files) \| Create \| K8s manifests (RBAC, ExternalSecret, Deployment, Service, Ingress) \| \| `argocd/manifests/grafana-config/external-secret-dex-oauth.yaml` \| Create \| Grafana OIDC client secret \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Modify \| Add new ExternalSecret resource \| \| `argocd/manifests/grafana/values.yaml` \| Modify \| Add `auth.generic_oauth` config + envFromSecrets \| \| `ansible/roles/caddy/defaults/main.yml` \| Modify \| Add `dex.ops.eblu.me` reverse proxy entry \| \| `docs/changelog.d/feature-dex-oidc.feature.md` \| Create \| Changelog fragment \| Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/222	2026-02-19 20:24:24 -08:00
Erich Blume	e830a0f26b	Homepage dashboard improvements (#76 ) ## Summary - Fix ArgoCD icon (use `argo-cd.png` per Dashboard Icons naming) - Add Borgmatic backup metrics widget (time since last backup, archive size) - Add Sifaka NAS disk usage widget (used/total space) - Create `[[grafana]]` zk card with management notes ## What didn't work Attempted Grafana iframe embedding for a metrics panel but reverted: - Homepage iframe widget only supports height classes, not width - Some panels fail to load even with anonymous auth enabled - Documented in grafana zk card for future reference ## Deployment and Testing - [x] ArgoCD icon displays correctly - [x] Borgmatic metrics show time since backup and archive size - [x] NAS disk usage shows used/total bytes - [x] Grafana reverted to authenticated-only access 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/76	2026-01-30 15:05:02 -08:00
Erich Blume	272ddb213b	Add TeslaMate deployment for Tesla Model Y data logging (#47 ) ## Summary - Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net - Add teslamate user to CloudNativePG blumeops-pg cluster - Add TeslaMate PostgreSQL datasource to Grafana - Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc. - Add teslamate database to borgmatic backup configuration ## Deployment and Testing - [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key" - [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl \| kubectl apply -f -` - [ ] Sync blumeops-pg: `argocd app sync blumeops-pg` - [ ] Create teslamate database - [ ] Apply teslamate secrets (encryption key, db connection) - [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl \| kubectl apply -f -` - [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config` - [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net - [ ] Verify data collection starts - [ ] Verify Grafana dashboards show data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47	2026-01-22 21:25:44 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	7e6742ad24	K8s Migration Phase 2: Grafana to Kubernetes (#30 ) ## Summary - Migrate Grafana from Homebrew/Ansible to Kubernetes deployment - Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed) - Add Grafana Helm chart deployment via ArgoCD with multi-source pattern - Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps) - Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal` ## Key Changes - `argocd/apps/grafana.yaml` - Grafana Helm chart Application - `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps - `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo - `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0 ## Deployment and Testing - [x] Deploy Loki config change: `mise run provision-indri -- --tags loki` - [x] Create namespace: `ki create namespace monitoring` - [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl \| ki apply -f -` - [x] Sync ArgoCD apps (grafana, grafana-config) - [x] Verify Grafana works at https://grafana.tail8d86e.ts.net - [x] Remove svc:grafana from ansible tailscale_serve - [x] Stop brew grafana: `ssh indri 'brew services stop grafana'` - [x] Delete ansible grafana role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30	2026-01-19 14:40:25 -08:00

35 commits