## Summary
- Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable)
- Port build from Dockerfile to native Dagger container.py
- Add liveness/readiness probes using the new /healthz endpoint on port 8080
- Update docs to reflect container.py migration and remove stale pin note
## Test plan
- [ ] Build container: `mise run container-build-and-release grafana-sidecar`
- [ ] Update kustomization tag with new image tag
- [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana`
- [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz`
- [ ] Verify dashboards load in Grafana UI
- [ ] `mise run services-check`
Reviewed-on: #332
Point miniflux kustomization at the main-built v2.2.19-138e23d image
(replacing the branch tag). Disable the OTLP metrics exporter at module
import time to prevent ~11s retry delays in CI — the env var must be set
inside the module, not the runner shell, because the SDK runs inside the
Dagger engine container.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Dagger Python SDK's OTLP metrics exporter hits a non-functional
local endpoint (500s), burning ~9s per retry cycle. Set
OTEL_METRICS_EXPORTER=none in the build-dagger CI job.
Also update navidrome kustomization to the main-SHA tag (c86b5d7).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops`
- Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step
- Migrate navidrome as the first container (`containers/navidrome/container.py`)
- Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding)
- Add `dagger call container-version` for CI version extraction without Dockerfile parsing
- All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode
- Legacy `docker_build()` fallback preserved for all other containers
## Motivation
When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline.
## Container build dispatch needed
After merge, dispatch container build for navidrome:
```
mise run container-build-and-release navidrome --ref 470b4bd
```
## Deploy steps
1. Wait for container build to complete
2. Back up navidrome-data PVC (non-reversible DB migrations)
3. `argocd app set navidrome --revision main && argocd app sync navidrome`
4. Verify at https://dj.ops.eblu.me
## Future
Remaining containers migrate incrementally in follow-up PRs using the same pattern.
Reviewed-on: #330
preview.quality was at the top level (invalid); moved under record
with a valid preset (very_low). Also fix services-check to catch
Grafana "Alerting (NoData)" state which was silently passing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previews are ~4MB/hour at default quality (CRF 1), served over NFS from
sifaka. Reducing to CRF 8 shrinks preview files to improve review page
load times when scrubbing older footage.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip redundant "unifi-poller-" prefix from generated slugs, bringing
UIDs from 45-48 chars down to 32-35 chars.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
C0 follow-up to #327: update from branch-SHA tags to main-SHA tags
after squash-merge rebuild.
indri: v2.18.0-f59f885
ringtail: v2.18.0-f59f885-nix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Build kube-state-metrics v2.18.0 locally from forge mirror, replacing upstream `registry.k8s.io` image
- Dockerfile (two-stage Go build) for indri/minikube
- default.nix (buildGoModule + buildLayeredImage) for ringtail/k3s
- Both kustomization files updated with `newName` pointing to local registry
## Verification
- [x] Nix build succeeded on ringtail (`nix-build` → 10-layer image)
- [x] Dockerfile build succeeded locally (`dagger call build` → ~2min)
- [x] `container-version-check --all-files` passes (2.18.0 consistent across Dockerfile, nix, service-versions.yaml)
- [ ] CI builds container images from this branch
- [ ] Update kustomization `newTag` with SHA-tagged version from CI
- [ ] ArgoCD sync on both clusters
## Test plan
- Trigger CI build: `mise run container-build-and-release kube-state-metrics`
- Verify tags: `mise run container-list kube-state-metrics`
- Update newTag in kustomization files with CI-produced tag
- Sync ArgoCD on indri: `argocd app sync kube-state-metrics`
- Sync ArgoCD on ringtail: `argocd app sync kube-state-metrics --context=k3s-ringtail` (note: argocd uses its own auth, not kubectl context)
- Verify metrics still flowing to Prometheus
Reviewed-on: #327
Worker forks 4 Dramatiq processes each loading the full Django app
(~250MB each), hitting the 1Gi limit on startup. Ringtail has ample
RAM headroom.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
teslamate had superuser on the shared blumeops-pg cluster (which also
hosts miniflux and authentik). Downgraded to plain database owner with
extension ownership (cube, earthdistance) transferred manually so it
can still ALTER EXTENSION UPDATE. earthdistance is untrusted in PG so
DROP+CREATE would need temporary superuser escalation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Patch upgrade with bug fixes (diff normalization, installation ID cache).
Pin the upstream manifest URL to commit SHA for supply chain integrity.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Resolves 4 unmuted Prowler core_seccomp_profile_docker_default
findings on alloy, immich-server, immich-machine-learning, and
immich-valkey.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the Helm chart deployment with plain kustomize manifests following
the Authentik pattern (separate deployments per component). Consolidate
the immich-storage ArgoCD app into the main immich app. Add no-helm-policy
doc establishing kustomize as the standard deployment mechanism.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR #10470 merged 2026-03-30; initContainer workaround stays until a
Prowler release includes the fix (latest is 5.22.0).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Add `containers/tempo/Dockerfile` — two-stage Go build from forge mirror, modeled on loki
- Switch kustomization from upstream `grafana/tempo` to `registry.ops.eblu.me/blumeops/tempo`
- Bump Tempo 2.10.1 → 2.10.3
## Test plan
- [ ] Kick off container build via `mise run container-build-and-release tempo`
- [ ] Update kustomization `newTag` with built image tag
- [ ] Deploy from branch: `argocd app set tempo --revision local-tempo-container && argocd app sync tempo`
- [ ] Verify Tempo health: `curl tempo.ops.eblu.me/ready`
- [ ] Verify traces flowing in Grafana Tempo datasource
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #323
Patch upgrade picks up idempotent FetchTask API, offline registration
fix, cloudflare/circl security dep update, and custom gRPC user-agent.
No config defaults changed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Kingfisher exits 200 (findings) or 205 (validated findings) on success.
Normalize these to 0 so the CronJob completes instead of restarting.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Mirror repos cause scan failures (likely ephemeral storage or timeout).
Scan only eblume/ repos until we investigate the root cause.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Nix containers don't include a shell by default. The CronJob needs
/bin/bash for the inline script that generates timestamped filenames.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pod-level RuntimeDefault seccomp profile (07e9c81) overrides the
DinD sidecar's privileged flag in newer Kubernetes versions, blocking
Docker daemon syscalls. Set Unconfined explicitly on the DinD container
while keeping RuntimeDefault on the runner container.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Deploys MongoDB Kingfisher as a weekly CronJob on minikube-indri
- Scans all Forgejo repos (eblume + all orgs) for leaked secrets with live validation
- Produces timestamped HTML and JSON reports on sifaka NFS (`/volume1/reports/kingfisher/`)
- Forgejo API token sourced from 1Password via ExternalSecret
- Uses official `ghcr.io/mongodb/kingfisher:1.91.0` container image
- Runs Sunday 4am (after Prowler's 3am k8s scan)
## Resources
- CronJob, PV/PVC (sifaka NFS), ExternalSecret
- ArgoCD Application with manual sync + CreateNamespace
## Test plan
- [x] Sync ArgoCD `apps` app to pick up new kingfisher Application
- [x] Set `--revision feature/kingfisher-cronjob` on kingfisher app
- [x] Verify ExternalSecret creates the `kingfisher-forgejo-token` Secret
- [x] Trigger manual job: `kubectl create job --from=cronjob/kingfisher kingfisher-manual -n kingfisher --context=minikube-indri`
- [ ] Verify reports appear on sifaka at `/volume1/reports/kingfisher/`
- [ ] After merge: set `--revision main` and re-sync
Reviewed-on: #317
Resources were under wrong Helm value keys (server.resources,
machine-learning.resources) and never applied to pods. Move to correct
bjw-s chart paths (*.controllers.main.containers.main.resources).
Increase liveness/readiness probe timeouts from 1s to 5s to prevent
kubelet from killing healthy-but-busy pods during ML inference load.
Remove CPU limits (keep requests only) to avoid throttling.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM
- Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup
- Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label
- Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos
- Docs updated: `backups.md`, `borgmatic.md`
## Prerequisites (manual)
- [x] Create `immich-photos` repo on BorgBase with same SSH key
- [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit)
- [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run)
## Test plan
- [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics`
- [ ] Deploy borgmatic role and verify both configs deployed
- [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours)
- [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom`
- [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #315
## Summary
- Add `authentik` database (blumeops-pg cluster) to borgmatic pg_dump backups
- Add `immich` database (immich-pg cluster) to borgmatic pg_dump backups
- For immich-pg: new borgmatic managed role with `pg_read_all_data`, ExternalSecret, Tailscale LoadBalancer service, and Caddy L4 TCP proxy on port 5433
- Update backup docs to reflect all four CNPG databases + mealie SQLite
## Deploy plan
Deploy order matters — k8s resources must exist before ansible can route to them:
1. **ArgoCD (databases app):** sync to pick up immich-pg borgmatic role, ExternalSecret, and Tailscale service
```
argocd app set blumeops-pg --revision feature/borgmatic-all-pg-backups
argocd app sync blumeops-pg
```
2. **Wait** for `immich-pg-tailscale` service to get a Tailscale IP and `immich-pg.tail8d86e.ts.net` to resolve
3. **Ansible (caddy):** deploy Caddy L4 route for port 5433
```
mise run provision-indri -- --tags caddy
```
4. **Ansible (borgmatic):** deploy updated config and .pgpass
```
mise run provision-indri -- --tags borgmatic
```
5. **Verify:** trigger a manual borgmatic run and check all four pg_dump streams succeed
```
borgmatic --verbosity 1 2>&1 | grep -E '(Dumping|ERROR)'
```
## Test plan
- [x] `kubectl kustomize` builds cleanly
- [x] `ansible --check --diff` for borgmatic and caddy show expected changes
- [ ] ArgoCD sync succeeds for databases app
- [ ] `immich-pg.tail8d86e.ts.net` resolves
- [ ] `pg.ops.eblu.me:5433` accepts connections
- [ ] `borgmatic --verbosity 1` dumps all four databases without errors
Reviewed-on: #314
The 5-minute lookback window kept stale data from terminated pods
visible during rollouts, causing the alert to sit in Pending for
~5 minutes after every routine deployment. 60s still covers two
scrape cycles (30s interval) while clearing stale data much faster.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reduced `for` from 30m to 5m and lookback window from 5m to 1m. The old
values caused alerts to linger long after apps returned to Synced state.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>