The 5-minute lookback window kept stale data from terminated pods
visible during rollouts, causing the alert to sit in Pending for
~5 minutes after every routine deployment. 60s still covers two
scrape cycles (30s interval) while clearing stale data much faster.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduced `for` from 30m to 5m and lookback window from 5m to 1m. The old
values caused alerts to linger long after apps returned to Synced state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.
**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes
**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #303
Point all services at the 613f05d images which carry the new
consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4
version, no matching build) and docs/quartz (no 613f05d build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The top-level "uid" in Grafana dashboard JSON is at 2-space indent
near the end of the file, not the first occurrence. Match on ^ "uid"
to avoid clobbering nested datasource uid references.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous sed replaced ALL "uid" fields in dashboard JSON files,
including datasource references inside panels, causing dashboards to
go dark. Scope the replacement to only the first occurrence (the
top-level dashboard UID) using GNU sed 0,/pattern/ addressing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add init container to pre-populate ConfigMap dashboards before Grafana
starts, eliminating the race between the sidecar and the provisioner
that caused dashboard DB records to be deleted and re-created with new
IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards
fetched from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus
- Custom-built container from forge mirror (`containers/unpoller/Dockerfile`)
- Credentials pulled from 1Password via external-secrets
- Prometheus scrape job added, docs and service-versions updated
## Test plan
- [ ] Build container: `mise run container-release unpoller v2.34.0`
- [ ] Update kustomization tag with built image tag
- [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller`
- [ ] Verify pod connects to UX7 controller (check logs)
- [ ] Confirm `unpoller` target appears in Prometheus
- [ ] Query `unifi_` metrics in Grafana
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #298
## Summary
- Replaces 18 TeslaMate dashboard ConfigMaps (713 KB / 22,080 lines) with a Grafana init container
- Init container fetches dashboard JSON directly from `mirrors/teslamate` on forge, pinned to `v3.0.0`
- Grafana's file provider picks them up from `/tmp/dashboards/TeslaMate/` via `foldersFromFilesStructure`
- Non-TeslaMate dashboards remain as ConfigMaps (unchanged)
## How it works
- New `init-teslamate-dashboards` init container uses busybox `wget` to fetch each JSON file from `https://forge.eblu.me/mirrors/teslamate/raw/tag/v3.0.0/grafana/dashboards/`
- Files land in `/tmp/dashboards/TeslaMate/`, same emptyDir volume the sidecar uses
- The sidecar continues to handle ConfigMap-based dashboards; the init container handles TeslaMate
- Version pin is in the init container args (TESLAMATE_VERSION)
## Deployment and Testing
- [ ] Sync `grafana` app from branch — verify init container runs and fetches dashboards
- [ ] Sync `grafana-config` app from branch — verify TeslaMate ConfigMaps are pruned
- [ ] Check Grafana UI: TeslaMate folder should still contain all 18 dashboards
- [ ] Verify non-TeslaMate dashboards are unaffected
- [ ] After merge: sync both apps from main
Reviewed-on: #296
Bare image references in manifests were ambiguous — unclear whether the
tag was intentionally omitted or managed by kustomize. Add :kustomized
sentinel to all 37 image refs overridden by kustomize images transformer.
Add sync notes for tailscale-operator proxyclass (CRD fields not processed
by kustomize). Mark devpi reviewed (6.19.1 is current).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The hand-written configmap.yaml had app.kubernetes.io/name and
app.kubernetes.io/instance labels; configMapGenerator dropped them.
Add options.labels to both generator entries to restore parity.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Move hardcoded image tags to kustomization.yaml `images:` transformer across **22 services** — image names in manifests become version-agnostic templates, with tags centralized in one place per service
- Replace hand-written ConfigMap manifests with `configMapGenerator:` in **12 services** — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes
- Create new `kustomization.yaml` for **forgejo-runner** and **nvidia-device-plugin** (switches ArgoCD from directory mode to kustomize mode, rendered output identical)
### Services modified
**Images only (8):** cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent
**Images + configMapGenerator (10):** alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus
**Images only, no configMapGenerator (4):** authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is)
**Skipped entirely (6):** argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml)
### What changes at deploy time
- **images:** — no functional diff, `kustomize build` produces identical output with tags
- **configMapGenerator:** — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change
## Test plan
- [x] `kubectl kustomize` builds all 30 service directories successfully
- [x] Image tags verified in rendered output for all modified services
- [x] ConfigMap hash suffixes verified in rendered output
- [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names
- [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.)
- [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes
- [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto)
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264
Grafana 12.x's grafana-postgresql-datasource plugin requires the
database name in jsonData, not just the top-level database field.
Without it, the frontend blocks all queries with "no default database
configured", causing all TeslaMate panels to show "No Data."
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The INI parser was stripping outer single quotes from
role_attribute_path = 'Admin', causing Grafana to evaluate 'Admin'
as a JMESPath field identifier instead of a string literal. This
resulted in all OAuth users getting the default Viewer role.
Replaced with a proper group-based expression that checks for the
'admins' Authentik group and maps to Admin/Viewer accordingly.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prometheus: v3.9.1-74029e1 [branch] -> v3.9.1-2ba5d8a [main]
Grafana: v12.3.3-09ac36b [branch] -> v12.3.3-d05d2fb [main]
These images were built during PR development and referenced branch
commits that won't survive branch cleanup. The [main] tags are
identical rebuilds from the squash-merge commit.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- `foldersFromFilesStructure` was `false` in Grafana's sidecar provider config, causing Grafana to ignore the subdirectory structure the sidecar creates from `grafana_folder` annotations
- All 18 TeslaMate dashboards were appearing in the root "Dashboards" folder despite having `grafana_folder: "TeslaMate"` annotations on their ConfigMaps
- Flipping to `true` makes Grafana replicate the sidecar's directory structure as UI folders
## Deployment and Testing
- [ ] Sync `grafana` app: `argocd app sync grafana`
- [ ] Verify TeslaMate dashboards appear under a "TeslaMate" folder in Grafana's dashboard list
- [ ] Verify other dashboards remain in the root "Dashboards" folder
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/253
## Summary
C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex.
This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved.
## Current Mikado State
- **Goal:** `deploy-authentik` (active)
- **Leaf prerequisites:**
- `build-authentik-container` — Build Nix container image
- `provision-authentik-database` — Create PostgreSQL database on CNPG cluster
- `create-authentik-secrets` — Create 1Password item with credentials
## Process refinements
- Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early
## Test plan
- [ ] `mise run docs-mikado` shows correct dependency chain
- [ ] Leaf nodes can be worked independently
- [ ] Container builds on ringtail
- [ ] Authentik starts and reaches healthy state
- [ ] Forgejo OAuth2 connector works
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227
## Summary
- Fix ArgoCD icon (use `argo-cd.png` per Dashboard Icons naming)
- Add Borgmatic backup metrics widget (time since last backup, archive size)
- Add Sifaka NAS disk usage widget (used/total space)
- Create `[[grafana]]` zk card with management notes
## What didn't work
Attempted Grafana iframe embedding for a metrics panel but reverted:
- Homepage iframe widget only supports height classes, not width
- Some panels fail to load even with anonymous auth enabled
- Documented in grafana zk card for future reference
## Deployment and Testing
- [x] ArgoCD icon displays correctly
- [x] Borgmatic metrics show time since backup and archive size
- [x] NAS disk usage shows used/total bytes
- [x] Grafana reverted to authenticated-only access
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/76
Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack.
Summary
- Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal)
- Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses
- Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics
- Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net)
- Add ACL rule for port 9187 (CNPG metrics)
- Delete obsolete ansible roles for prometheus and loki
Changes
- argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress
- argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress
- argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications
- argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS
- argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint
- argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics
- ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints
- pulumi/policy.hujson - ACL for port 9187
- Deleted ansible/roles/prometheus/ and ansible/roles/loki/
Deployment and Testing
- Stop prometheus and loki on indri
- Sync ArgoCD apps (apps, prometheus, loki, grafana)
- Run mise run provision-indri -- --tags alloy
- Verify Grafana dashboards show data
🤖 Generated with https://claude.ai/claude-code
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42