Commit graph

701 commits

Author SHA1 Message Date
f7221f3990 Update ringtail flake inputs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 17:57:54 -07:00
6d65e6928c C2: Deploy infrastructure alerting pipeline (#303)
## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303
2026-03-22 14:52:56 -07:00
f1620abb17 Improve Frigate health checks to catch NFS and camera failures
Replace single aggregate camera_fps check with per-camera FPS validation
and NFS storage accessibility check. Motivated by an outage where Frigate
API responded OK but NFS mount was inaccessible, causing "no frames" in UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 09:55:53 -07:00
dcab489b60 agent memory ignore 2026-03-21 19:03:21 -07:00
810340a328 Update service-versions.yaml for loki 2026-03-20 16:10:19 -07:00
531a49abeb C0 update deployment for loki to 3.6.7 2026-03-20 16:06:29 -07:00
f9426b734c Update loki to 3.6.7 (#302)
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (loki) (push) Successful in 1s
Build Container / build (loki) (push) Successful in 6s
Reviewed-on: #302
2026-03-20 16:02:28 -07:00
0f0ee2a319 Update docs and kiwix kustomization tags to 613f05d builds
Also catches kiwix's transmission sidecar up from v4.0.6-r4 to
v4.1.1-r1, matching the torrent service (upgraded in PR #282 but
the kiwix sidecar was missed). No breaking changes — old RPC
protocol is supported through 4.x.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 06:40:49 -07:00
3d2a97aaf9 Update kustomization tags to OCI-labeled builds (613f05d)
Point all services at the 613f05d images which carry the new
consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4
version, no matching build) and docs/quartz (no 613f05d build).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 06:34:12 -07:00
613f05dfde Add consistent OCI labels to all container Dockerfiles
All checks were successful
Build Container (Nix) / build (miniflux) (push) Successful in 2s
Build Container (Nix) / build (navidrome) (push) Successful in 2s
Build Container / build (devpi) (push) Successful in 41s
Build Container (Nix) / build (nettest) (push) Successful in 15s
Build Container / build (grafana-sidecar) (push) Successful in 1m27s
Build Container / build (grafana) (push) Successful in 3m23s
Build Container (Nix) / build (ntfy) (push) Successful in 3m19s
Build Container (Nix) / build (prometheus) (push) Successful in 1s
Build Container (Nix) / build (quartz) (push) Successful in 1s
Build Container (Nix) / build (runner-job-image) (push) Successful in 1s
Build Container (Nix) / build (teslamate) (push) Successful in 2s
Build Container (Nix) / build (transmission) (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 1s
Build Container (Nix) / build (unpoller) (push) Successful in 1s
Build Container / build (kiwix-serve) (push) Successful in 1m17s
Build Container / build (kubectl) (push) Successful in 41s
Build Container / build (homepage) (push) Successful in 8m21s
Build Container / build (mealie) (push) Successful in 1m1s
Build Container / build (loki) (push) Successful in 8m21s
Build Container / build (miniflux) (push) Successful in 2m24s
Build Container / build (nettest) (push) Successful in 14s
Build Container / build (ntfy) (push) Successful in 8m33s
Build Container / build (prometheus) (push) Successful in 37s
Build Container / build (quartz) (push) Successful in 19s
Build Container / build (navidrome) (push) Successful in 10m36s
Build Container / build (runner-job-image) (push) Successful in 3m18s
Build Container / build (transmission) (push) Successful in 20s
Build Container / build (transmission-exporter) (push) Successful in 21s
Build Container / build (unpoller) (push) Successful in 11s
Build Container / build (teslamate) (push) Successful in 4m42s
Every container now carries title, description, version, source, and
vendor labels per the OCI image spec. Version is derived from the
existing CONTAINER_APP_VERSION ARG at build time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 20:42:00 -07:00
c92b949a20 Fix UID sed to target root-level dashboard uid only
The top-level "uid" in Grafana dashboard JSON is at 2-space indent
near the end of the file, not the first occurrence. Match on ^  "uid"
to avoid clobbering nested datasource uid references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 12:56:50 -07:00
334fbbb9e3 Fix TeslaMate/UnPoller dashboard UID sed clobbering datasource refs
The previous sed replaced ALL "uid" fields in dashboard JSON files,
including datasource references inside panels, causing dashboards to
go dark. Scope the replacement to only the first occurrence (the
top-level dashboard UID) using GNU sed 0,/pattern/ addressing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 12:53:00 -07:00
6f88baeb91 Fix Grafana starred dashboards lost on pod restart
Add init container to pre-populate ConfigMap dashboards before Grafana
starts, eliminating the race between the sidecar and the provisioner
that caused dashboard DB records to be deleted and re-created with new
IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards
fetched from upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 12:40:44 -07:00
0dffdb9974 Add Claude Code subagents for infrastructure workflows
Four project-scoped subagents that formalize existing mise task
workflows as constrained, specialized AI agents:
- infra-health: background health monitor (wraps services-check)
- doc-reviewer: persistent-memory documentation reviewer
- change-classifier: C0/C1/C2 triage before work begins
- mikado-navigator: C2 chain state advisor (wraps docs-mikado)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:57:36 -07:00
ef8c2118a1 Standardize USAGE pragmas and typer parsing across mise tasks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:42:01 -07:00
86220b7b88 Update Prometheus deployment to v3.10.0-0d27797
C0 fix-forward: update kustomization newTag and mark service reviewed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 08:46:07 -07:00
0d2779762a Upgrade Prometheus to v3.10.0 (#301)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (prometheus) (push) Successful in 2s
Build Container / build (prometheus) (push) Successful in 44m11s
## Summary
- Bump Prometheus from v3.9.1 to v3.10.0 in custom container Dockerfile
- v3.10.0 adds distroless Docker image variants, new PromQL `fill` operators, and performance improvements
- Dagger build tested locally — builds cleanly

## Remaining after merge
- Update `kustomization.yaml` newTag with the auto-built image tag
- Update `service-versions.yaml` (last-reviewed + current-version)
- ArgoCD sync

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #301
2026-03-18 07:47:46 -07:00
528d3da327 Review power.md: add ringtail, mark reviewed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 07:37:31 -07:00
e0dbcbd997 Update retention changelog to reflect final PVC decision
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:46:55 -07:00
21ddc74cdc Revert PVC size changes, add hostpath comment
StatefulSet volumeClaimTemplates are immutable and minikube's hostpath
provisioner doesn't enforce PVC size limits anyway. Add comments noting
the data grows freely on the 1.8TB backing disk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:46:17 -07:00
ef199b70f0 Increase Prometheus and Loki data retention
Prometheus: 15d → 10y (3650d), PVC 20Gi → 200Gi
Loki: 31d (744h) → 365d (8760h), PVC 20Gi → 50Gi

Indri has 1.6 TB free on the minikube backing disk — the previous
15-day Prometheus retention was losing valuable long-term metrics data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:44:00 -07:00
50d3b3b21e Rename Borgmatic to Borg Backups on Homepage
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:34:13 -07:00
e8bdecdb11 Rename Borgmatic dashboard to Borg Backups, add duration graph
Rename dashboard title since borgmatic is just the execution layer.
Add Backup Duration Over Time panel next to New Data Per Backup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:33:27 -07:00
8425f56dc3 Add Fly.io dashboard to Homepage admin bookmarks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:29:44 -07:00
64afd40a29 Fix Grafana widget fields (lowercase) and hide Miniflux read count
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:28:41 -07:00
98584d0d67 Trim Homepage widget metrics for cleaner layout
- Forgejo: show only notifications and pull requests
- Jellyfin: show only movies/series/episodes, hide now playing
- Grafana: hide data sources, show dashboards and alerts only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:26:15 -07:00
443e090ec6 Enable equal height tiles in Homepage groups
Add useEqualHeights: true so service tiles within each row expand to
match the tallest tile, fixing uneven layout from widget metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:23:53 -07:00
b0ce9be30b Fix Homepage layout: use row style with columns for full-width groups
style: row makes each group span the full page width (one per row),
while columns: 4 tiles services horizontally within each group.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:21:44 -07:00
816fd552f0 Set Homepage to single-column group layout
Add maxGroupColumns: 1 so each category gets its own full-width row,
with service tiles arranged side-by-side within each group.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:19:40 -07:00
96d0f668fd Reorganize Homepage groups: add Home, move Grafana to Infrastructure
Move NVR, Jellyfin, and DJ to new Home group. Move Grafana from Content
to Infrastructure. Switch all layout groups from column to row style.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:15:52 -07:00
3e9873d669 Fix borgmatic backup: use correct kubectl context on indri
The Mealie SQLite dump hook used `minikube-indri` (the context name on
gilbert), but on indri itself the context is just `minikube`. This caused
the before_backup hook to fail, aborting all backups since the hook was added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:07:44 -07:00
cfe3391f1a Bump Frigate retention and add recording health check
Increase retention: continuous 3→180d, detections 14→30d, alerts 30→730d.
Plenty of NFS headroom (~9.4 TiB free, ~6.6 GB/day for one camera).

Add frigate-recording check to services-check that verifies camera_fps > 0,
which would have caught the 6-day outage from the mqtt config removal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 18:24:11 -07:00
6617e44e5b Fix Frigate crash: re-add required mqtt config section
Frigate's config schema requires an `mqtt` field even when MQTT isn't
used. Commit 40f1568 removed it along with Mosquitto, causing Frigate
to fail validation on startup. Add `mqtt.enabled: false` to satisfy
the schema without needing a broker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 18:10:23 -07:00
4f99b7edaa Update alloy kustomizations to local container tags
Point alloy-k8s at v1.14.0-61f02a0 (Dockerfile) and both ringtail
deployments at v1.14.0-61f02a0-nix (Nix build).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 16:55:55 -07:00
61f02a0335 Localize Alloy container image (#300)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (alloy) (push) Successful in 14s
Build Container / build (alloy) (push) Successful in 38m34s
## Summary

- Add `containers/alloy/` with dual Dockerfile + Nix build files for Grafana Alloy v1.14.0
- Both builds fetch source from forge mirror (`forge.ops.eblu.me/mirrors/alloy.git`), build the web UI (Node), then compile the Go binary with `netgo embedalloyui` tags
- Update all three alloy deployments (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail) to use `registry.ops.eblu.me/blumeops/alloy`
- `promtail_journal_enabled` tag omitted — requires systemd headers and none of our configs use `loki.source.journal`

## Build verification

- **Dockerfile:** Tested locally via `docker build`, binary reports `v1.14.0` with correct tags
- **Nix:** Tested on ringtail via `nix-build`, all three hashes (fetchgit, npmDeps, goModules) resolved and build succeeds

## Post-merge steps

1. Wait for CI to build the container from main (both Dockerfile and Nix workflows)
2. `mise run container-list alloy` to find the `[main]` tagged image
3. C0 follow-up to update `newTag` in all three kustomizations from `v1.14.0-placeholder` to the real tag
4. Sync ArgoCD apps and verify pods come up healthy

Reviewed-on: #300
2026-03-17 16:42:53 -07:00
Forgejo Actions
cdba9dca96 Update docs release to v1.14.2
- Built changelog from towncrier fragments

[skip ci]
2026-03-17 13:24:13 -07:00
1f000c8e39 Add last-updated subsort to docs-review, review gilbert card v1.14.2
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 13:22:01 -07:00
995478b91f Review jellyfin and automounter services
Both services current: jellyfin 10.11.6 (latest upstream),
automounter 1.11.0 (Mac App Store). Add missing frigate share
to automounter docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 13:06:23 -07:00
e5ce510fdc Fix plan-a-meal random recipe API queries
Mealie's orderBy=random requires a paginationSeed parameter, otherwise
the API returns 422. Added the seed to all random query examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:10:48 -07:00
6d7597670e Add plan-a-meal how-to for Mealie cooking timelines
Agent-facing guide for generating unified cooking timelines from
Mealie meal plans. Covers querying the API, picking balanced meals
(protein/carb/vegetable), and interleaving recipe steps into a
relative timeline so everything finishes together.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:07:16 -07:00
3602ed7781 Add OpenAI integration to Mealie
Enable recipe parsing from images/photos, ingredient extraction, and
URL scraping via OpenAI API (gpt-4o). Rename ExternalSecret from
mealie-oidc to mealie-secrets to hold both OIDC and OpenAI credentials.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 07:12:51 -07:00
c2a1e168bd Update Mealie container tag to main build
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:59:48 -07:00
11330ebea0 Deploy Mealie recipe manager (#299)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (mealie) (push) Successful in 2s
Build Container / build (mealie) (push) Successful in 8s
## Summary

- Deploy Mealie (self-hosted recipe manager) on minikube-indri via ArgoCD
- Build container from source via forge mirror (`mirrors/mealie`) — multi-stage Dockerfile with Node.js frontend + Python/uv backend
- Add Caddy proxy entry for `meals.ops.eblu.me`
- Part of a larger meal planning pipeline: Mealie stores categorized recipes, a planner script selects balanced meals, and Ollama generates unified cooking timelines

## Status

- [x] Mirror mealie repo on forge
- [x] Dockerfile (from-source build)
- [x] ArgoCD app + k8s manifests
- [x] Caddy proxy entry
- [x] Service docs, routing table, app registry
- [ ] Local Dagger build test
- [ ] Container build + push to registry
- [ ] Update kustomization.yaml with real image tag
- [ ] Deploy and verify
- [ ] Provision Caddy

## Test plan

- Build container locally via `dagger call build --src=. --container-name=mealie`
- Trigger CI build via `mise run container-build-and-release mealie`
- Deploy from branch: `argocd app set mealie --revision deploy-mealie && argocd app sync mealie`
- Verify Mealie UI at `https://meals.ops.eblu.me`
- Verify API docs at `https://meals.ops.eblu.me/docs`

Reviewed-on: #299
2026-03-16 21:59:10 -07:00
b54d87e071 Fix shell syntax error in unpoller dashboard initcontainer
Comments can't appear inside a for-in list in sh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:59:03 -07:00
b0846ab5fa Update unpoller container tag to main build
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:54:03 -07:00
4dc3e5cae2 Add UnPoller for UniFi network metrics (#298)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (unpoller) (push) Successful in 2s
Build Container / build (unpoller) (push) Successful in 7s
## Summary
- Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus
- Custom-built container from forge mirror (`containers/unpoller/Dockerfile`)
- Credentials pulled from 1Password via external-secrets
- Prometheus scrape job added, docs and service-versions updated

## Test plan
- [ ] Build container: `mise run container-release unpoller v2.34.0`
- [ ] Update kustomization tag with built image tag
- [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller`
- [ ] Verify pod connects to UX7 controller (check logs)
- [ ] Confirm `unpoller` target appears in Prometheus
- [ ] Query `unifi_` metrics in Grafana

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #298
2026-03-16 15:52:45 -07:00
a29ced71b5 Upgrade borgmatic 2.0.13 → 2.1.3 (#297)
## Summary
- Upgraded borgmatic from 2.0.13 to 2.1.3 on indri (via mise/pipx)
- Key changes: improved borg warning handling, memory/performance improvements, `source_directories_must_exist` now defaults to true (already set in our config)
- Verified: config validates, dry-run passed against both sifaka (local) and borgbase (offsite) repos

## Borg Warnings Investigation
The main concern was 2.1.0's change to treat borg warnings as errors. In 2.1.3 this was partially reverted — "file not found" warnings (exit code 107) are back to being warnings. Our config already sets `source_directories_must_exist: true`, and all four source directories were verified present on indri.

## Test plan
- [x] `borgmatic --version` confirms 2.1.3
- [x] `borgmatic config validate` passes
- [x] `borgmatic create --dry-run` succeeds against both repositories
- [x] All source directories verified present on indri
- [ ] Verify next scheduled backup (2:00 AM) completes successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #297
2026-03-16 11:05:24 -07:00
0f5377568d Review operations docs: add last-reviewed dates and improve troubleshooting
Mark run-1password-backup and troubleshooting as reviewed. Troubleshooting
gets inline wiki-links for all referenced services, a new ringtail/k3s
section, and a cross-reference to restart-indri.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 07:38:02 -07:00
f46a04b902 Restructure docs: consolidate, recategorize, and extract
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
- Consolidate 4 Authentik Nix derivation docs into one card
  (authentik-nix-build-components.md)
- Merge build-grafana-container + build-grafana-sidecar into
  build-grafana-images.md
- Move agent-change-process from how-to/ to explanation/ (it's a
  methodology doc, not a task guide)
- Extract Caddy custom build section from reference card into
  how-to/deployment/build-caddy-with-plugins.md
- Move expose-service-publicly from how-to/ to tutorials/ (it's a
  comprehensive walkthrough, not a quick task reference)
- Update all wiki-link references across affected docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:55:59 -07:00
ac01c2d6e2 Fix stale docs and shell quoting in devpi start script
- ArgoCD ref: correct Git Source URL to forge.ops.eblu.me:2222
- Authentik ref: add Zot as active OIDC client, blueprint, and secret
- Federated login: remove Zot from Future Work (completed in PR #236)
- devpi/start.sh: use bash array for command building (proper quoting)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:25:27 -07:00