## Summary
- Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs
- Introduce **attached service pattern**: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates
- Document the pattern in [[review-services]] so future attached services slot in cleanly
- Backfill `parent: grafana` on existing `grafana-sidecar` entry
## Version drift protection
1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build`
2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update
3. Both must agree before commit succeeds
## Test plan
- [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`)
- [ ] Update kustomization `newTag` to branch-built image tag
- [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`)
- [ ] Verify Authentik login, session persistence, and task queue still work
- [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #309
Bump from RC to latest stable (security fixes for config endpoint and
cross-camera auth). Add new 0.17 motion retention tier at 365 days,
reduce continuous from 180 to 30 days.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses the grafana_folder annotation to place the dashboard in the
existing folder created by alert rule provisioning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two panels: currently firing alerts (firing/pending/noData/error) and
recent state changes. Refreshes every 30s. Uses Grafana's built-in
alertlist panel type — no datasource needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dramatiq defaults to one worker process per CPU core. On ringtail (16 cores)
this spawned 16 processes, each loading the full Django app, exceeding the
1Gi memory limit and causing a crash loop (228 restarts over 7 days).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove `group: ""` from ignoreDifferences in tailscale-operator and
tailscale-operator-ringtail — ArgoCD normalizes away the empty string
field, so the live state never matches git.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v1.96.3 exists as a GitHub release but Docker Hub images for both
tailscale/tailscale and tailscale/k8s-operator haven't been published
yet (v1.94.2 is still latest). Revert the image tags; the fly/start.sh
`tailscale wait` improvement and review date stamps are retained.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization)
- Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2)
- Stamp kube-state-metrics review date (already current at v2.18.0)
## Notable upstream changes (v1.94.2 → v1.96.3)
- Go upgraded from 1.25 to 1.26
- `tailscale wait` command — blocks until daemon is running + interface has IP
- AuthKey policy now applies only when users are not logged in (behavioral change)
- Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling)
- UPnP stability fixes
## Deploy plan
1. Merge PR
2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator`
3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...`
4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods`
5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services)
6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change)
## Test plan
- [ ] ArgoCD diff looks clean for both apps before sync
- [ ] Proxy pods on indri come up healthy with v1.96.3 images
- [ ] Proxy pods on ringtail come up healthy with v1.96.3 images
- [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus)
- [ ] Fly proxy rebuild deploys successfully with `tailscale wait`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #304
## Summary
Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.
**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes
**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #303
Also catches kiwix's transmission sidecar up from v4.0.6-r4 to
v4.1.1-r1, matching the torrent service (upgraded in PR #282 but
the kiwix sidecar was missed). No breaking changes — old RPC
protocol is supported through 4.x.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point all services at the 613f05d images which carry the new
consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4
version, no matching build) and docs/quartz (no 613f05d build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The top-level "uid" in Grafana dashboard JSON is at 2-space indent
near the end of the file, not the first occurrence. Match on ^ "uid"
to avoid clobbering nested datasource uid references.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous sed replaced ALL "uid" fields in dashboard JSON files,
including datasource references inside panels, causing dashboards to
go dark. Scope the replacement to only the first occurrence (the
top-level dashboard UID) using GNU sed 0,/pattern/ addressing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add init container to pre-populate ConfigMap dashboards before Grafana
starts, eliminating the race between the sidecar and the provisioner
that caused dashboard DB records to be deleted and re-created with new
IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards
fetched from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
StatefulSet volumeClaimTemplates are immutable and minikube's hostpath
provisioner doesn't enforce PVC size limits anyway. Add comments noting
the data grows freely on the 1.8TB backing disk.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Rename dashboard title since borgmatic is just the execution layer.
Add Backup Duration Over Time panel next to New Data Per Backup.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Forgejo: show only notifications and pull requests
- Jellyfin: show only movies/series/episodes, hide now playing
- Grafana: hide data sources, show dashboards and alerts only
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add useEqualHeights: true so service tiles within each row expand to
match the tallest tile, fixing uneven layout from widget metrics.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
style: row makes each group span the full page width (one per row),
while columns: 4 tiles services horizontally within each group.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add maxGroupColumns: 1 so each category gets its own full-width row,
with service tiles arranged side-by-side within each group.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move NVR, Jellyfin, and DJ to new Home group. Move Grafana from Content
to Infrastructure. Switch all layout groups from column to row style.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Increase retention: continuous 3→180d, detections 14→30d, alerts 30→730d.
Plenty of NFS headroom (~9.4 TiB free, ~6.6 GB/day for one camera).
Add frigate-recording check to services-check that verifies camera_fps > 0,
which would have caught the 6-day outage from the mqtt config removal.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Frigate's config schema requires an `mqtt` field even when MQTT isn't
used. Commit 40f1568 removed it along with Mosquitto, causing Frigate
to fail validation on startup. Add `mqtt.enabled: false` to satisfy
the schema without needing a broker.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point alloy-k8s at v1.14.0-61f02a0 (Dockerfile) and both ringtail
deployments at v1.14.0-61f02a0-nix (Nix build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Add `containers/alloy/` with dual Dockerfile + Nix build files for Grafana Alloy v1.14.0
- Both builds fetch source from forge mirror (`forge.ops.eblu.me/mirrors/alloy.git`), build the web UI (Node), then compile the Go binary with `netgo embedalloyui` tags
- Update all three alloy deployments (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail) to use `registry.ops.eblu.me/blumeops/alloy`
- `promtail_journal_enabled` tag omitted — requires systemd headers and none of our configs use `loki.source.journal`
## Build verification
- **Dockerfile:** Tested locally via `docker build`, binary reports `v1.14.0` with correct tags
- **Nix:** Tested on ringtail via `nix-build`, all three hashes (fetchgit, npmDeps, goModules) resolved and build succeeds
## Post-merge steps
1. Wait for CI to build the container from main (both Dockerfile and Nix workflows)
2. `mise run container-list alloy` to find the `[main]` tagged image
3. C0 follow-up to update `newTag` in all three kustomizations from `v1.14.0-placeholder` to the real tag
4. Sync ArgoCD apps and verify pods come up healthy
Reviewed-on: #300
Enable recipe parsing from images/photos, ingredient extraction, and
URL scraping via OpenAI API (gpt-4o). Rename ExternalSecret from
mealie-oidc to mealie-secrets to hold both OIDC and OpenAI credentials.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus
- Custom-built container from forge mirror (`containers/unpoller/Dockerfile`)
- Credentials pulled from 1Password via external-secrets
- Prometheus scrape job added, docs and service-versions updated
## Test plan
- [ ] Build container: `mise run container-release unpoller v2.34.0`
- [ ] Update kustomization tag with built image tag
- [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller`
- [ ] Verify pod connects to UX7 controller (check logs)
- [ ] Confirm `unpoller` target appears in Prometheus
- [ ] Query `unifi_` metrics in Grafana
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #298
## Summary
- Replaces 18 TeslaMate dashboard ConfigMaps (713 KB / 22,080 lines) with a Grafana init container
- Init container fetches dashboard JSON directly from `mirrors/teslamate` on forge, pinned to `v3.0.0`
- Grafana's file provider picks them up from `/tmp/dashboards/TeslaMate/` via `foldersFromFilesStructure`
- Non-TeslaMate dashboards remain as ConfigMaps (unchanged)
## How it works
- New `init-teslamate-dashboards` init container uses busybox `wget` to fetch each JSON file from `https://forge.eblu.me/mirrors/teslamate/raw/tag/v3.0.0/grafana/dashboards/`
- Files land in `/tmp/dashboards/TeslaMate/`, same emptyDir volume the sidecar uses
- The sidecar continues to handle ConfigMap-based dashboards; the init container handles TeslaMate
- Version pin is in the init container args (TESLAMATE_VERSION)
## Deployment and Testing
- [ ] Sync `grafana` app from branch — verify init container runs and fetches dashboards
- [ ] Sync `grafana-config` app from branch — verify TeslaMate ConfigMaps are pruned
- [ ] Check Grafana UI: TeslaMate folder should still contain all 18 dashboards
- [ ] Verify non-TeslaMate dashboards are unaffected
- [ ] After merge: sync both apps from main
Reviewed-on: #296
## Summary
- Mirrors `tailscale/tailscale` on forge (`mirrors/tailscale`)
- Replaces vendored `operator.yaml` (495 KB / 5,386 lines) with ArgoCD apps sourcing the upstream static manifest, pinned via `targetRevision: v1.94.2`
- Adds `tailscale-operator-base` app for indri and `tailscale-operator-base-ringtail` for ringtail
- Local kustomization retains only ProxyClass and DNSConfig custom resources
- Updates `[[tailscale-operator]]` doc to reflect new sourcing
## Deployment and Testing
- [ ] Register `mirrors/tailscale` repo in ArgoCD (it needs to know about the new repo)
- [ ] Sync `apps` app to pick up the new `tailscale-operator-base` app definitions
- [ ] Sync `tailscale-operator-base` — verify CRDs, RBAC, operator Deployment come up
- [ ] Sync `tailscale-operator` — verify ProxyClass, DNSConfig still apply cleanly
- [ ] Verify existing Tailscale Ingresses still work (ProxyGroup pods healthy)
- [ ] Repeat for ringtail cluster
- [ ] After merge: apps already point at tags, no revision reset needed
Reviewed-on: #295
The 27B Q4_K_M model needs ~7.3 GiB system RAM for CPU-offloaded layers
but only 6.8 GiB was available within the 22Gi cgroup. Bumping to 24Gi
and enabling flash attention (reduces KV cache memory) should provide
enough headroom.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 27B Q4_K_M model is ~17 GB, exceeding the 16 GB VRAM on the RTX 4080
by ~1 GB. Ollama will offload a few layers to CPU RAM, so the pod memory
limit needs headroom beyond the previous 16Gi.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Mosquitto has been dormant since frigate-notify switched from MQTT to
webapi polling (529ba10). Tear down live infra (ArgoCD app, namespace)
and remove all manifests, service-versions entry, services-check, and
doc references.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Add jobsync pod check (ringtail k3s) and HTTP endpoint to `services-check`
- Add JobSync entry to homepage dashboard under new "Apps" group
- Mark jobsync as reviewed at v1.1.4 (current with upstream)
- Changelog fragment added
## Deployment and Testing
- [ ] Sync homepage app from branch: `argocd app set homepage --revision review/jobsync && argocd app sync homepage`
- [ ] Verify JobSync appears on go.ops.eblu.me dashboard
- [ ] Run `mise run services-check` to verify new checks pass
- [ ] After merge: `argocd app set homepage --revision main && argocd app sync homepage`
Reviewed-on: #291