blumeops

Author	SHA1	Message	Date
Erich Blume	c22f9db1c8	C2(deploy-infra-alerting): finalize rewrite cards as historical docs Remove all Mikado frontmatter (status, branch, requires) from chain cards. Add changelog fragment. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:29:30 -07:00
Erich Blume	d53e6244a0	C2(deploy-infra-alerting): close refactor-services-check-to-query-alerts services-check now queries Grafana alerting API for covered checks, displaying firing alerts with summary and runbook links. Uncovered checks remain as direct probes. Grafana credentials from 1Password with graceful fallback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:24:47 -07:00
Erich Blume	2e2a33d7ca	C2(deploy-infra-alerting): close port-services-check-alerts 7 alert rules covering services-check probes: - ServiceProbeFailure (11 HTTP probes via Alloy blackbox) - PodNotReady (kube-state-metrics, both clusters) - PostgresClusterUnhealthy (CNPG collector) - TextfileStale (node_textfile_mtime_seconds) - FrigateCameraDown (frigate_camera_fps) - ArgoCDAppOutOfSync (argocd_app_info) 7 runbooks in docs/how-to/alerts/. Remaining uncovered: local indri services (brew/launchctl), ringtail SSH/tailscale, public Fly.io endpoints, k8s API health, frigate storage. These are effectively covered by downstream alerts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:23:42 -07:00
Erich Blume	52eed44542	C2(deploy-infra-alerting): impl refactor services-check to query alerts Replace covered checks with Grafana alerting API queries: - ServiceProbeFailure: 11 HTTP endpoints - TextfileStale: metrics textfile freshness - FrigateCameraDown: camera FPS - PodNotReady: pod readiness (both clusters) - PostgresClusterUnhealthy: database health - ArgoCDAppOutOfSync: ArgoCD sync status Uncovered checks remain as direct probes (SSH, launchctl, public endpoints, k8s API, frigate storage, some HTTP endpoints). Firing alerts display summary and clickable runbook link. Grafana credentials fetched from 1Password; graceful fallback if unavailable. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:21:42 -07:00
Erich Blume	cdd85c7ac9	C2(deploy-infra-alerting): impl fix TextfileStale to always return data Query all textfile mtimes (time() - node_textfile_mtime_seconds) and threshold at > 3600s, instead of filtering with > 3600 which returns empty results when everything is fresh. This means: - Fresh textfiles: query returns low values, threshold not met → OK - Stale textfiles: query returns high values, threshold met → Alerting - Missing textfiles: series vanishes, noDataState=Alerting → Alerting Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:13:20 -07:00
Erich Blume	da452e2bf5	C2(deploy-infra-alerting): impl fix TextfileStale noDataState to OK Same pattern as PodNotReady: when no textfiles are stale, the query returns no data. noDataState=Alerting incorrectly treats this as a problem. Changed to OK. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:10:05 -07:00
Erich Blume	d9ab004479	C2(deploy-infra-alerting): impl fix Grafana probe port (80 not 3000) Grafana's k8s Service maps port 80 → container port 3000. The blackbox probe was targeting port 3000 directly on the Service ClusterIP, which doesn't work — connection refused. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 14:05:25 -07:00
Erich Blume	957ee90fa2	C2(deploy-infra-alerting): impl add ArgoCD scrape and sync alert - Add ArgoCD metrics scrape target to Prometheus (argocd-metrics:8082) - Add ArgoCDAppOutOfSync alert: fires when argocd_app_info has sync_status != Synced for 30 minutes - Add runbook with diagnostic steps and common fixes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 13:45:34 -07:00
Erich Blume	2fa536e547	C2(deploy-infra-alerting): impl add textfile staleness and Frigate alerts - TextfileStale: fires when a .prom textfile on indri hasn't been updated in 1 hour (node_textfile_mtime_seconds). Covers borgmatic, zot, minikube, jellyfin exporters. - FrigateCameraDown: fires when frigate_camera_fps drops to 0 for 5m. - Add runbooks for both alerts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 13:43:16 -07:00
Erich Blume	b2b0d6efa7	C2(deploy-infra-alerting): impl fix PodNotReady noDataState to OK No unhealthy pods = no query results = noData state. With noDataState set to NoData, Grafana fires an alert with empty labels ("Pod in is not ready"). Change to OK since no results means everything is healthy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 13:35:24 -07:00
Erich Blume	02e07aeb41	C2(deploy-infra-alerting): impl exclude Job-owned pods from PodNotReady CronJob pods (e.g., zim-watcher) are expected to complete and become not-ready. Exclude them with `unless on (namespace, pod) kube_pod_owner{owner_kind="Job"}`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 12:19:33 -07:00
Erich Blume	8e6a803076	C2(deploy-infra-alerting): impl add probes and alert rules for services-check coverage Extend Alloy blackbox probes: - Add prometheus, loki, grafana, teslamate, immich, navidrome - Now probing 11 services (was 5), covering most HTTP checks from services-check Add alert rules: - PostgresClusterUnhealthy: cnpg_collector_up < 1 for 3m (critical) - PodNotReady: kube_pod_status_ready{condition="true"} == 0 for 5m Add runbooks: - runbook-postgres-unhealthy.md - runbook-pod-not-ready.md Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 12:11:12 -07:00
Erich Blume	e33b0bc184	C2(deploy-infra-alerting): close first-alert-and-runbook End-to-end alerting pipeline verified: - ServiceProbeFailure alert rule evaluates 5 blackbox probes - Grafana custom payload produces ntfy-native JSON (topic, title, message, priority, actions) - Firing notification arrives on iOS with clean formatting - "Open Runbook" action button links to docs.eblu.me runbook - Resolved notification delivered on recovery Key discovery: Grafana webhook custom payload provisioning field is settings.payload.template (nested object), not payloadTemplate (flat string). Found by reading grafana/alerting Go source. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 12:05:43 -07:00
Erich Blume	4c0bd0055f	C2(deploy-infra-alerting): impl use custom payload for ntfy-native JSON Use the correct provisioning field name for Grafana webhook custom payloads: settings.payload.template (not payloadTemplate). Found by reading the Go source (grafana/alerting receivers/webhook/v1/config.go): Payload *CustomPayload `json:"payload,omitempty"` CustomPayload.Template string `json:"template,omitempty"` The template uses coll.Dict, coll.Append, and data.ToJSON to produce ntfy-native JSON with topic, title, message, priority, and action buttons linking to runbooks. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 11:59:26 -07:00
Erich Blume	94413f73ba	C2(deploy-infra-alerting): impl fix alert rule multi-series evaluation Add reduce step between Prometheus query and threshold to preserve per-service labels. Without it, Grafana can't distinguish the 5 probe_success series and errors with "duplicate results with labels {}". Chain: A (prometheus query) → B (reduce last) → C (threshold < 1) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 11:00:21 -07:00
Erich Blume	549c57ab82	C2(deploy-infra-alerting): impl add first alert rule and runbook - Add ServiceProbeFailure alert rule to Grafana alerting provisioning - Queries probe_success metric from Alloy blackbox exporter - Extracts service name from job label via label_replace - Fires after 2 minutes of failure, noDataState=Alerting - Annotations include summary with service name and runbook URL - Add runbook at docs/how-to/alerts/runbook-service-probe-failure.md - Covers all 5 probed services (miniflux, kiwix, transmission, devpi, argocd) - Diagnostic steps, common causes, silencing instructions - Add alerting section to observability.md reference doc Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:57:23 -07:00
Erich Blume	c1acc808d5	C2(deploy-infra-alerting): close configure-grafana-alerting-pipeline Pipeline verified: - Grafana unified alerting enabled with provisioned contact point and policy - ntfy webhook contact point delivering to infra-alerts topic - Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h - iOS push notifications confirmed working Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:51:02 -07:00
Erich Blume	261f20601a	C2(deploy-infra-alerting): impl configure grafana alerting pipeline - Enable unified alerting in grafana.ini - Create alerting.yaml provisioning file with: - ntfy-infra webhook contact point (POST to ntfy.ops.eblu.me/infra-alerts) - Notification policy: group_wait 1m, group_interval 12h, repeat_interval 24h - Message templates for title and runbook links - Mount alerting provisioning into Grafana deployment - Add alerting.yaml to kustomization configMapGenerator Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:35:36 -07:00
Erich Blume	1d5990a2f7	C2(deploy-infra-alerting): plan add alerting pipeline cards Mikado chain for deploying Grafana Unified Alerting with ntfy notifications, replacing manual services-check probes. Chain: configure-grafana-alerting-pipeline → first-alert-and-runbook → port-services-check-alerts → refactor-services-check-to-query-alerts → deploy-infra-alerting (goal) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 10:28:31 -07:00
Erich Blume	f1620abb17	Improve Frigate health checks to catch NFS and camera failures Replace single aggregate camera_fps check with per-camera FPS validation and NFS storage accessibility check. Motivated by an outage where Frigate API responded OK but NFS mount was inaccessible, causing "no frames" in UI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 09:55:53 -07:00
Erich Blume	dcab489b60	agent memory ignore	2026-03-21 19:03:21 -07:00
Erich Blume	810340a328	Update service-versions.yaml for loki	2026-03-20 16:10:19 -07:00
Erich Blume	531a49abeb	C0 update deployment for loki to 3.6.7	2026-03-20 16:06:29 -07:00
Erich Blume	f9426b734c	Update loki to 3.6.7 (#302 ) All checks were successful Build Container (Nix) / detect (push) Successful in 1s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (loki) (push) Successful in 1s Details Build Container / build (loki) (push) Successful in 6s Details Reviewed-on: #302	2026-03-20 16:02:28 -07:00
Erich Blume	0f0ee2a319	Update docs and kiwix kustomization tags to `613f05d` builds Also catches kiwix's transmission sidecar up from v4.0.6-r4 to v4.1.1-r1, matching the torrent service (upgraded in PR #282 but the kiwix sidecar was missed). No breaking changes — old RPC protocol is supported through 4.x. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 06:40:49 -07:00
Erich Blume	3d2a97aaf9	Update kustomization tags to OCI-labeled builds (`613f05d`) Point all services at the `613f05d` images which carry the new consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4 version, no matching build) and docs/quartz (no `613f05d` build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 06:34:12 -07:00
Erich Blume	613f05dfde	Add consistent OCI labels to all container Dockerfiles All checks were successful Build Container (Nix) / build (miniflux) (push) Successful in 2s Details Build Container (Nix) / build (navidrome) (push) Successful in 2s Details Build Container / build (devpi) (push) Successful in 41s Details Build Container (Nix) / build (nettest) (push) Successful in 15s Details Build Container / build (grafana-sidecar) (push) Successful in 1m27s Details Build Container / build (grafana) (push) Successful in 3m23s Details Build Container (Nix) / build (ntfy) (push) Successful in 3m19s Details Build Container (Nix) / build (prometheus) (push) Successful in 1s Details Build Container (Nix) / build (quartz) (push) Successful in 1s Details Build Container (Nix) / build (runner-job-image) (push) Successful in 1s Details Build Container (Nix) / build (teslamate) (push) Successful in 2s Details Build Container (Nix) / build (transmission) (push) Successful in 2s Details Build Container (Nix) / build (transmission-exporter) (push) Successful in 1s Details Build Container (Nix) / build (unpoller) (push) Successful in 1s Details Build Container / build (kiwix-serve) (push) Successful in 1m17s Details Build Container / build (kubectl) (push) Successful in 41s Details Build Container / build (homepage) (push) Successful in 8m21s Details Build Container / build (mealie) (push) Successful in 1m1s Details Build Container / build (loki) (push) Successful in 8m21s Details Build Container / build (miniflux) (push) Successful in 2m24s Details Build Container / build (nettest) (push) Successful in 14s Details Build Container / build (ntfy) (push) Successful in 8m33s Details Build Container / build (prometheus) (push) Successful in 37s Details Build Container / build (quartz) (push) Successful in 19s Details Build Container / build (navidrome) (push) Successful in 10m36s Details Build Container / build (runner-job-image) (push) Successful in 3m18s Details Build Container / build (transmission) (push) Successful in 20s Details Build Container / build (transmission-exporter) (push) Successful in 21s Details Build Container / build (unpoller) (push) Successful in 11s Details Build Container / build (teslamate) (push) Successful in 4m42s Details Every container now carries title, description, version, source, and vendor labels per the OCI image spec. Version is derived from the existing CONTAINER_APP_VERSION ARG at build time. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 20:42:00 -07:00
Erich Blume	c92b949a20	Fix UID sed to target root-level dashboard uid only The top-level "uid" in Grafana dashboard JSON is at 2-space indent near the end of the file, not the first occurrence. Match on ^ "uid" to avoid clobbering nested datasource uid references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:56:50 -07:00
Erich Blume	334fbbb9e3	Fix TeslaMate/UnPoller dashboard UID sed clobbering datasource refs The previous sed replaced ALL "uid" fields in dashboard JSON files, including datasource references inside panels, causing dashboards to go dark. Scope the replacement to only the first occurrence (the top-level dashboard UID) using GNU sed 0,/pattern/ addressing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:53:00 -07:00
Erich Blume	6f88baeb91	Fix Grafana starred dashboards lost on pod restart Add init container to pre-populate ConfigMap dashboards before Grafana starts, eliminating the race between the sidecar and the provisioner that caused dashboard DB records to be deleted and re-created with new IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards fetched from upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:40:44 -07:00
Erich Blume	0dffdb9974	Add Claude Code subagents for infrastructure workflows Four project-scoped subagents that formalize existing mise task workflows as constrained, specialized AI agents: - infra-health: background health monitor (wraps services-check) - doc-reviewer: persistent-memory documentation reviewer - change-classifier: C0/C1/C2 triage before work begins - mikado-navigator: C2 chain state advisor (wraps docs-mikado) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 11:57:36 -07:00
Erich Blume	ef8c2118a1	Standardize USAGE pragmas and typer parsing across mise tasks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 11:42:01 -07:00
Erich Blume	86220b7b88	Update Prometheus deployment to v3.10.0-0d27797 C0 fix-forward: update kustomization newTag and mark service reviewed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 08:46:07 -07:00
Erich Blume	0d2779762a	Upgrade Prometheus to v3.10.0 (#301 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (prometheus) (push) Successful in 2s Details Build Container / build (prometheus) (push) Successful in 44m11s Details ## Summary - Bump Prometheus from v3.9.1 to v3.10.0 in custom container Dockerfile - v3.10.0 adds distroless Docker image variants, new PromQL `fill` operators, and performance improvements - Dagger build tested locally — builds cleanly ## Remaining after merge - Update `kustomization.yaml` newTag with the auto-built image tag - Update `service-versions.yaml` (last-reviewed + current-version) - ArgoCD sync 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #301	2026-03-18 07:47:46 -07:00
Erich Blume	528d3da327	Review power.md: add ringtail, mark reviewed Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 07:37:31 -07:00
Erich Blume	e0dbcbd997	Update retention changelog to reflect final PVC decision Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:46:55 -07:00
Erich Blume	21ddc74cdc	Revert PVC size changes, add hostpath comment StatefulSet volumeClaimTemplates are immutable and minikube's hostpath provisioner doesn't enforce PVC size limits anyway. Add comments noting the data grows freely on the 1.8TB backing disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:46:17 -07:00
Erich Blume	ef199b70f0	Increase Prometheus and Loki data retention Prometheus: 15d → 10y (3650d), PVC 20Gi → 200Gi Loki: 31d (744h) → 365d (8760h), PVC 20Gi → 50Gi Indri has 1.6 TB free on the minikube backing disk — the previous 15-day Prometheus retention was losing valuable long-term metrics data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:44:00 -07:00
Erich Blume	50d3b3b21e	Rename Borgmatic to Borg Backups on Homepage Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:34:13 -07:00
Erich Blume	e8bdecdb11	Rename Borgmatic dashboard to Borg Backups, add duration graph Rename dashboard title since borgmatic is just the execution layer. Add Backup Duration Over Time panel next to New Data Per Backup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:33:27 -07:00
Erich Blume	8425f56dc3	Add Fly.io dashboard to Homepage admin bookmarks Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:29:44 -07:00
Erich Blume	64afd40a29	Fix Grafana widget fields (lowercase) and hide Miniflux read count Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:28:41 -07:00
Erich Blume	98584d0d67	Trim Homepage widget metrics for cleaner layout - Forgejo: show only notifications and pull requests - Jellyfin: show only movies/series/episodes, hide now playing - Grafana: hide data sources, show dashboards and alerts only Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:26:15 -07:00
Erich Blume	443e090ec6	Enable equal height tiles in Homepage groups Add useEqualHeights: true so service tiles within each row expand to match the tallest tile, fixing uneven layout from widget metrics. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:23:53 -07:00
Erich Blume	b0ce9be30b	Fix Homepage layout: use row style with columns for full-width groups style: row makes each group span the full page width (one per row), while columns: 4 tiles services horizontally within each group. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:21:44 -07:00
Erich Blume	816fd552f0	Set Homepage to single-column group layout Add maxGroupColumns: 1 so each category gets its own full-width row, with service tiles arranged side-by-side within each group. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:19:40 -07:00
Erich Blume	96d0f668fd	Reorganize Homepage groups: add Home, move Grafana to Infrastructure Move NVR, Jellyfin, and DJ to new Home group. Move Grafana from Content to Infrastructure. Switch all layout groups from column to row style. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:15:52 -07:00
Erich Blume	3e9873d669	Fix borgmatic backup: use correct kubectl context on indri The Mealie SQLite dump hook used `minikube-indri` (the context name on gilbert), but on indri itself the context is just `minikube`. This caused the before_backup hook to fail, aborting all backups since the hook was added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:07:44 -07:00
Erich Blume	cfe3391f1a	Bump Frigate retention and add recording health check Increase retention: continuous 3→180d, detections 14→30d, alerts 30→730d. Plenty of NFS headroom (~9.4 TiB free, ~6.6 GB/day for one camera). Add frigate-recording check to services-check that verifies camera_fps > 0, which would have caught the 6-day outage from the mqtt config removal. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 18:24:11 -07:00
Erich Blume	6617e44e5b	Fix Frigate crash: re-add required mqtt config section Frigate's config schema requires an `mqtt` field even when MQTT isn't used. Commit `40f1568` removed it along with Mosquitto, causing Frigate to fail validation on startup. Add `mqtt.enabled: false` to satisfy the schema without needing a broker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 18:10:23 -07:00

1 2 3 4 5 ...

718 commits