blumeops

Author	SHA1	Message	Date
Erich Blume	db6d8af8b1	Update grafana-sidecar image tag to v2.6.0-61fcd5d (merge build) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:02:39 -07:00
Erich Blume	61fcd5d70a	Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s Details ## Summary - Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable) - Port build from Dockerfile to native Dagger container.py - Add liveness/readiness probes using the new /healthz endpoint on port 8080 - Update docs to reflect container.py migration and remove stale pin note ## Test plan - [ ] Build container: `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization tag with new image tag - [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana` - [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz` - [ ] Verify dashboards load in Grafana UI - [ ] `mise run services-check` Reviewed-on: #332	2026-04-13 07:57:13 -07:00
Erich Blume	a18ec9d958	Update miniflux to main image tag, disable OTEL metrics in Dagger module Point miniflux kustomization at the main-built v2.2.19-138e23d image (replacing the branch tag). Disable the OTLP metrics exporter at module import time to prevent ~11s retry delays in CI — the env var must be set inside the module, not the runner shell, because the SDK runs inside the Dagger engine container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:59:32 -07:00
Erich Blume	138e23d525	Miniflux 2.2.19 + container.py migration + ty typechecker (#331 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (miniflux) (push) Successful in 1m3s Details ## Summary - Upgrade miniflux from 2.2.17 to 2.2.19 (security hardening, performance improvements) - Migrate miniflux from Dockerfile to native Dagger container.py build - Refactor `alpine_runtime()` helper to support existing users (nobody/65534) - Add `ty` (Astral) Python typechecker to prek hooks ## Test plan - [ ] `dagger call build --src=. --container-name=miniflux` succeeds - [ ] `dagger call container-version --container-name=miniflux` returns 2.2.19 - [ ] `mise run container-version-check` passes - [ ] `ty check` passes cleanly - [ ] `prek run --all-files` passes - [ ] CI builds container successfully - [ ] Miniflux healthcheck passes after deploy from branch Reviewed-on: #331	2026-04-12 08:54:32 -07:00
Erich Blume	94c937d588	Disable OTLP metrics exporter in CI, update navidrome to main tag The Dagger Python SDK's OTLP metrics exporter hits a non-functional local endpoint (500s), burning ~9s per retry cycle. Set OTEL_METRICS_EXPORTER=none in the build-dagger CI job. Also update navidrome kustomization to the main-SHA tag (`c86b5d7`). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 17:26:25 -07:00
Erich Blume	c86b5d7772	Native Dagger container builds + Navidrome v0.61.1 (#330 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (navidrome) (push) Successful in 22m26s Details ## Summary - Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops` - Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step - Migrate navidrome as the first container (`containers/navidrome/container.py`) - Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding) - Add `dagger call container-version` for CI version extraction without Dockerfile parsing - All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode - Legacy `docker_build()` fallback preserved for all other containers ## Motivation When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline. ## Container build dispatch needed After merge, dispatch container build for navidrome: ``` mise run container-build-and-release navidrome --ref `470b4bd` ``` ## Deploy steps 1. Wait for container build to complete 2. Back up navidrome-data PVC (non-reversible DB migrations) 3. `argocd app set navidrome --revision main && argocd app sync navidrome` 4. Verify at https://dj.ops.eblu.me ## Future Remaining containers migrate incrementally in follow-up PRs using the same pattern. Reviewed-on: #330	2026-04-11 17:11:56 -07:00
Erich Blume	5757df115d	Upgrade ollama from 0.17.5 to 0.20.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 06:42:05 -07:00
Erich Blume	22fc615a28	Update paperless image tag to main build Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:01:02 -07:00
Erich Blume	07f52e9488	Deploy Paperless-ngx document management (#328 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (paperless) (push) Successful in 9s Details ## Summary - Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri - Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay - PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery - NFS document storage on sifaka (`/volume1/paperless`) - Authentik OIDC SSO via baked JSON blob from 1Password - Caddy route at `paperless.ops.eblu.me` - 1Password item "Paperless (blumeops)" created with all secrets ## Files - `containers/paperless/Dockerfile` — multi-stage build - `argocd/manifests/paperless/` — full k8s manifest set - `argocd/apps/paperless.yaml` — ArgoCD application - `argocd/manifests/databases/` — CNPG role + ExternalSecret - `ansible/roles/caddy/defaults/main.yml` — Caddy route - `service-versions.yaml` — version tracking entry - `docs/reference/services/paperless.md` — reference card ## Remaining deploy steps 1. Build container: `mise run container-build-and-release paperless` 2. Update kustomization.yaml `newTag` with actual image tag 3. Create Authentik application/provider for paperless 4. Create `paperless` database on blumeops-pg 5. Sync ArgoCD apps, then sync paperless from branch 6. Provision Caddy: `mise run provision-indri -- --tags caddy` 7. Verify at https://paperless.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #328	2026-04-08 17:54:12 -07:00
Erich Blume	22b77ac141	Fix Frigate preview config and services-check NoData detection preview.quality was at the top level (invalid); moved under record with a valid preset (very_low). Also fix services-check to catch Grafana "Alerting (NoData)" state which was silently passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:12:42 -07:00
Erich Blume	ec63d560f3	Deploy authentik 2026.2.2 container to ringtail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 10:56:50 -07:00
Erich Blume	0366a0346b	Set Frigate preview quality to CRF 8 for faster timeline loading Previews are ~4MB/hour at default quality (CRF 1), served over NFS from sifaka. Reducing to CRF 8 shrinks preview files to improve review page load times when scrubbing older footage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 08:43:43 -07:00
Erich Blume	936d29bbe1	Fix UnPoller dashboard UIDs exceeding Grafana 12's 40-char limit Strip redundant "unifi-poller-" prefix from generated slugs, bringing UIDs from 45-48 chars down to 32-35 chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 07:03:39 -07:00
Erich Blume	3c894e659d	Pin kube-state-metrics to main-SHA container tags C0 follow-up to #327: update from branch-SHA tags to main-SHA tags after squash-merge rebuild. indri: v2.18.0-f59f885 ringtail: v2.18.0-f59f885-nix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 16:10:14 -07:00
Erich Blume	f59f8859dc	Localize kube-state-metrics container (Dockerfile + nix) (#327 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (kube-state-metrics) (push) Successful in 5s Details Build Container / build-nix (kube-state-metrics) (push) Successful in 7s Details ## Summary - Build kube-state-metrics v2.18.0 locally from forge mirror, replacing upstream `registry.k8s.io` image - Dockerfile (two-stage Go build) for indri/minikube - default.nix (buildGoModule + buildLayeredImage) for ringtail/k3s - Both kustomization files updated with `newName` pointing to local registry ## Verification - [x] Nix build succeeded on ringtail (`nix-build` → 10-layer image) - [x] Dockerfile build succeeded locally (`dagger call build` → ~2min) - [x] `container-version-check --all-files` passes (2.18.0 consistent across Dockerfile, nix, service-versions.yaml) - [ ] CI builds container images from this branch - [ ] Update kustomization `newTag` with SHA-tagged version from CI - [ ] ArgoCD sync on both clusters ## Test plan - Trigger CI build: `mise run container-build-and-release kube-state-metrics` - Verify tags: `mise run container-list kube-state-metrics` - Update newTag in kustomization files with CI-produced tag - Sync ArgoCD on indri: `argocd app sync kube-state-metrics` - Sync ArgoCD on ringtail: `argocd app sync kube-state-metrics --context=k3s-ringtail` (note: argocd uses its own auth, not kubectl context) - Verify metrics still flowing to Prometheus Reviewed-on: #327	2026-04-07 16:09:25 -07:00
Erich Blume	84eda0301f	Bump authentik worker memory limit 1Gi → 2Gi (OOMKilled after ringtail restart) Worker forks 4 Dramatiq processes each loading the full Django app (~250MB each), hitting the 1Gi limit on startup. Ringtail has ample RAM headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 15:39:29 -07:00
Erich Blume	efae404d1e	Remove superuser from teslamate PG role, transfer extension ownership teslamate had superuser on the shared blumeops-pg cluster (which also hosts miniflux and authentik). Downgraded to plain database owner with extension ownership (cube, earthdistance) transferred manually so it can still ALTER EXTENSION UPDATE. earthdistance is untrusted in PG so DROP+CREATE would need temporary superuser escalation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 15:36:39 -07:00
Erich Blume	1fd8aae8f6	Upgrade ArgoCD v3.3.2 → v3.3.6, SHA-pin install manifest Patch upgrade with bug fixes (diff normalization, installation ID cache). Pin the upstream manifest URL to commit SHA for supply chain integrity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 08:21:11 -07:00
Erich Blume	18fe172a54	Add seccomp RuntimeDefault profiles to alloy-k8s and immich pods Resolves 4 unmuted Prowler core_seccomp_profile_docker_default findings on alloy, immich-server, immich-machine-learning, and immich-valkey. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 10:21:23 -07:00
Forgejo Actions	370a3574b2	Update docs release to v1.15.4 - Built changelog from towncrier fragments [skip ci]	2026-04-06 07:53:54 -07:00
Erich Blume	c7e5af6d51	Migrate 1Password Connect from Helm to kustomize (1.8.1 → 1.8.2) (#326 ) ## Summary - Renders manifests from `connect-helm-charts v2.4.1` as plain kustomize (deployment + service) - Bumps 1Password Connect from 1.8.1 → 1.8.2 - Completes the no-helm-policy migration — all services now use kustomize - Retains all production hardening from the Helm chart (securityContext, runAsNonRoot, drop ALL, seccomp, resource limits) ## Changes - New: `deployment.yaml`, `service.yaml`, `kustomization.yaml` in `argocd/manifests/1password-connect/` - Rewritten: Both ArgoCD app definitions (indri + ringtail) — single source kustomize instead of multi-source Helm - Deleted: `values.yaml` (Helm values no longer needed) - Updated: `no-helm-policy.md`, `service-versions.yaml`, `README.md` ## Deployment plan 1. Sync `apps` app to pick up the new app definitions 2. `argocd app set 1password-connect --revision 1password-connect-kustomize` 3. `argocd app sync 1password-connect` — verify on indri 4. Repeat for ringtail 5. After merge: reset revision to main, re-sync both ## Test plan - [ ] `kubectl kustomize` renders cleanly (verified locally) - [ ] ArgoCD diff shows expected changes (Helm labels removed, images bumped) - [ ] Pods come up healthy on indri - [ ] External Secrets still resolves 1Password items - [ ] Repeat on ringtail Reviewed-on: #326	2026-04-06 07:31:40 -07:00
Forgejo Actions	facb803010	Update docs release to v1.15.3 - Built changelog from towncrier fragments [skip ci]	2026-04-05 21:24:25 -07:00
Erich Blume	5597e02467	Fix Homepage pod-selector for Immich (Helm labels → kustomize labels) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 12:12:48 -07:00
Erich Blume	64200a55c5	Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3) Replace the Helm chart deployment with plain kustomize manifests following the Authentik pattern (separate deployments per component). Consolidate the immich-storage ArgoCD app into the main immich app. Add no-helm-policy doc establishing kustomize as the standard deployment mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:42:25 -07:00
Erich Blume	464e3222d2	Document upstream fix for Prowler --registry bug (pending release) PR #10470 merged 2026-03-30; initContainer workaround stays until a Prowler release includes the fix (latest is 5.22.0). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 20:21:19 -07:00
Erich Blume	306f580bdb	Point Tempo at main-built container v2.10.3-75f9ba4 C0 follow-up: update tag from branch-built image to main-SHA image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-02 13:45:57 -07:00
Erich Blume	75f9ba4943	Build Tempo container from source (2.10.3) (#323 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (tempo) (push) Successful in 6s Details ## Summary - Add `containers/tempo/Dockerfile` — two-stage Go build from forge mirror, modeled on loki - Switch kustomization from upstream `grafana/tempo` to `registry.ops.eblu.me/blumeops/tempo` - Bump Tempo 2.10.1 → 2.10.3 ## Test plan - [ ] Kick off container build via `mise run container-build-and-release tempo` - [ ] Update kustomization `newTag` with built image tag - [ ] Deploy from branch: `argocd app set tempo --revision local-tempo-container && argocd app sync tempo` - [ ] Verify Tempo health: `curl tempo.ops.eblu.me/ready` - [ ] Verify traces flowing in Grafana Tempo datasource 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #323	2026-04-02 13:45:02 -07:00
Erich Blume	b1e2811077	Upgrade Grafana 12.3.3 → 12.4.2 (#322 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (grafana) (push) Successful in 7s Details ## Summary - Bumps Grafana from 12.3.3 to 12.4.2 - Patches 7 CVEs, notably CVE-2026-27880 (unauthenticated OOM DoS, CVSS 7.5) and CVE-2026-27879 (authenticated OOM via resample queries) - No config changes required — reviewed alerting, datasources, OIDC, and feature toggles against 12.4.x breaking changes ## Breaking changes reviewed \| Change \| Impact \| \|--------\|--------\| \| Alerting: pending period applies to NoData/Error \| Net positive — reduces noise from transient blips \| \| Default notification uses empty receiver \| No impact — we explicitly set `ntfy-infra` \| \| Removed feature toggles (4) \| No impact — none configured \| \| OAuth ID token signature validation \| Low risk — verify OIDC login post-deploy \| \| OpsGenie deprecated \| No impact — using webhook \| ## Test plan - [ ] Container build completes at forge - [ ] Update kustomization.yaml with new image tag - [ ] `argocd app set grafana --revision upgrade/grafana-12.4.2 && argocd app sync grafana` - [ ] Verify Grafana UI loads at grafana.ops.eblu.me - [ ] Verify OIDC login via Authentik - [ ] Verify dashboards and datasources load - [ ] Check alerting rules are intact 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #322	2026-04-02 11:33:19 -07:00
Forgejo Actions	2b7b21dc9b	Update docs release to v1.15.2 - Built changelog from towncrier fragments [skip ci]	2026-03-30 17:48:40 -07:00
Erich Blume	4059b3d27b	Add compensating controls framework and date-based report dirs (#320 ) ## Summary - Add `compensating-controls.yaml` tracking 9 named controls that justify suppressed security findings - Update all Prowler mutelist descriptions with `CC: <id>` references to named controls - Add `mise run review-compensating-controls` task — surfaces stalest control with all codebase references - Add [[review-compensating-controls]] how-to doc - Organize Prowler and Kingfisher reports into `YYYY-MM-DD` subdirectories ### Compensating controls \| ID \| Mitigates \| \|----\|-----------\| \| `single-user-cluster` \| Image cache abuse, RBAC breadth, system pod privileges \| \| `tailscale-network-isolation` \| Profiling endpoints, weak TLS, debug ports \| \| `local-registry` \| AlwaysPullImages gap \| \| `sso-gated-admin-tools` \| ArgoCD wildcard RBAC \| \| `operator-managed-pods` \| Tailscale proxy pod security settings \| \| `ephemeral-privileged-jobs` \| Prowler hostPID exposure \| \| `trusted-ci-only` \| Forgejo runner DinD \| \| `init-container-isolation` \| Grafana root init container \| \| `observability-stack-audit` \| Missing apiserver audit logging \| ## Test plan - [ ] `mise run review-compensating-controls` shows table and references - [ ] `kubectl kustomize argocd/manifests/prowler/` renders correctly - [ ] Sync prowler and kingfisher, verify next scan writes to dated subdirectory - [ ] Grep for `CC:` in mutelist files — every muted finding should have at least one 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #320	2026-03-30 17:44:11 -07:00
Erich Blume	a76e471d54	Add Prowler mutelist and fix kube-state-metrics seccomp (#319 ) ## Summary - Add mutelist files to suppress expected/accepted Prowler CIS findings from components we don't control - Mutelist files stored in `mutelist/` directory, grouped by category, merged at runtime via initContainer - Fix missing seccomp `RuntimeDefault` profile on kube-state-metrics deployment ### Mutelist categories \| File \| Checks \| Covers \| \|------\|--------\|--------\| \| `apiserver.yaml` \| 12 \| Minikube apiserver flags \| \| `control-plane.yaml` \| 3 \| Scheduler, controller-manager, kubelet \| \| `core-pod-security.yaml` \| 7 \| System pods, Tailscale operator, Grafana init, Prowler hostPID, forgejo-runner \| \| `rbac.yaml` \| 3 \| Built-in K8s roles, ArgoCD, CNPG \| Muted findings appear as `status=MUTED` in reports (not hidden), preserving audit trail. ### Not muted (follow-up) - Alloy, Immich pods missing seccomp — need separate investigation (Helm/operator-managed) ## Test plan - [ ] `kubectl kustomize argocd/manifests/prowler/` renders cleanly - [ ] Trigger manual scan: `kubectl --context=minikube-indri -n prowler create job prowler-mutelist-test --from=cronjob/prowler` - [ ] Verify initContainer merges successfully (check pod logs) - [ ] Verify muted findings show as `MUTED` in report - [ ] Sync kube-state-metrics and verify pod starts with seccomp profile 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #319	2026-03-30 17:22:31 -07:00
Erich Blume	1e391f96bb	Upgrade forgejo-runner 12.7.0 → 12.7.3, add service card Patch upgrade picks up idempotent FetchTask API, offline registration fix, cloudflare/circl security dep update, and custom gRPC user-agent. No config defaults changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 16:31:06 -07:00
Erich Blume	b000efd6c3	Fix Kingfisher CronJob exit code handling Kingfisher exits 200 (findings) or 205 (validated findings) on success. Normalize these to 0 so the CronJob completes instead of restarting. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 07:16:02 -07:00
Erich Blume	457ab19416	Scope Kingfisher scan to eblume user only on ringtail Mirror repos cause scan failures (likely ephemeral storage or timeout). Scan only eblume/ repos until we investigate the root cause. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 07:11:52 -07:00
Erich Blume	2c1f0abefc	Deploy Kingfisher v165768b-0fe0eed-nix (tmp permissions fix) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 06:54:57 -07:00
Erich Blume	14f366f993	Deploy Kingfisher v165768b-c494b62-nix (/tmp fix) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 06:51:17 -07:00
Erich Blume	b01afb1c1d	Deploy Kingfisher v165768b-aa9cc70-nix (bash fix) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 06:47:26 -07:00
Erich Blume	aa9cc709ec	Fix Kingfisher container: add bash and coreutils for CronJob shell All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-nix (kingfisher) (push) Successful in 22s Details Nix containers don't include a shell by default. The CronJob needs /bin/bash for the inline script that generates timestamped filenames. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 06:45:39 -07:00
Erich Blume	f0c6845f0f	Deploy custom Kingfisher container v165768b-f9206bf-nix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 06:42:24 -07:00
Erich Blume	f9206bf10b	Build custom Kingfisher container from sporked deploy branch (#318 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-nix (kingfisher) (push) Successful in 12s Details ## Summary - Add Dockerfile for Kingfisher built from source (sporked deploy branch) - Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime - Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher` - Add kingfisher to service-versions.yaml (version tracks upstream main SHA) - Document spork workflow in CLAUDE.md ## Test plan - [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29` - [ ] Verify image on registry: `mise run container-list` - [ ] Update kustomization newTag - [ ] Sync ArgoCD kingfisher app from branch - [ ] Trigger manual CronJob and verify scan completes - [ ] Verify reports on sifaka Reviewed-on: #318	2026-03-30 06:34:49 -07:00
Erich Blume	924325ebd5	Fix DinD seccomp profile broken by RuntimeDefault rollout The pod-level RuntimeDefault seccomp profile (`07e9c81`) overrides the DinD sidecar's privileged flag in newer Kubernetes versions, blocking Docker daemon syscalls. Set Unconfined explicitly on the DinD container while keeping RuntimeDefault on the runner container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 17:09:57 -07:00
Erich Blume	bb60369956	Simplify Kingfisher CronJob to HTML-only output Remove the second scan pass for JSON — one format is enough for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:50:54 -07:00
Erich Blume	35705faca2	Add Kingfisher secret scanner CronJob (#317 ) ## Summary - Deploys MongoDB Kingfisher as a weekly CronJob on minikube-indri - Scans all Forgejo repos (eblume + all orgs) for leaked secrets with live validation - Produces timestamped HTML and JSON reports on sifaka NFS (`/volume1/reports/kingfisher/`) - Forgejo API token sourced from 1Password via ExternalSecret - Uses official `ghcr.io/mongodb/kingfisher:1.91.0` container image - Runs Sunday 4am (after Prowler's 3am k8s scan) ## Resources - CronJob, PV/PVC (sifaka NFS), ExternalSecret - ArgoCD Application with manual sync + CreateNamespace ## Test plan - [x] Sync ArgoCD `apps` app to pick up new kingfisher Application - [x] Set `--revision feature/kingfisher-cronjob` on kingfisher app - [x] Verify ExternalSecret creates the `kingfisher-forgejo-token` Secret - [x] Trigger manual job: `kubectl create job --from=cronjob/kingfisher kingfisher-manual -n kingfisher --context=minikube-indri` - [ ] Verify reports appear on sifaka at `/volume1/reports/kingfisher/` - [ ] After merge: set `--revision main` and re-sync Reviewed-on: #317	2026-03-28 21:39:55 -07:00
Forgejo Actions	7fb6eff388	Update docs release to v1.15.1 - Built changelog from towncrier fragments [skip ci]	2026-03-28 09:15:21 -07:00
Erich Blume	b632cd9ffb	Fix Immich resource limits and probe timeouts Resources were under wrong Helm value keys (server.resources, machine-learning.resources) and never applied to pods. Move to correct bjw-s chart paths (*.controllers.main.containers.main.resources). Increase liveness/readiness probe timeouts from 1s to 5s to prevent kubelet from killing healthy-but-busy pods during ML inference load. Remove CPU limits (keep requests only) to avoid throttling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:36:32 -07:00
Erich Blume	c78b86c72c	Add offsite backup for immich photo library to BorgBase (#315 ) ## Summary - Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM - Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup - Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label - Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos - Docs updated: `backups.md`, `borgmatic.md` ## Prerequisites (manual) - [x] Create `immich-photos` repo on BorgBase with same SSH key - [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit) - [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run) ## Test plan - [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics` - [ ] Deploy borgmatic role and verify both configs deployed - [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours) - [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom` - [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #315	2026-03-27 19:43:05 -07:00
Erich Blume	ca0c9354ee	Add borgmatic backups for authentik and immich databases (#314 ) ## Summary - Add `authentik` database (blumeops-pg cluster) to borgmatic pg_dump backups - Add `immich` database (immich-pg cluster) to borgmatic pg_dump backups - For immich-pg: new borgmatic managed role with `pg_read_all_data`, ExternalSecret, Tailscale LoadBalancer service, and Caddy L4 TCP proxy on port 5433 - Update backup docs to reflect all four CNPG databases + mealie SQLite ## Deploy plan Deploy order matters — k8s resources must exist before ansible can route to them: 1. ArgoCD (databases app): sync to pick up immich-pg borgmatic role, ExternalSecret, and Tailscale service ``` argocd app set blumeops-pg --revision feature/borgmatic-all-pg-backups argocd app sync blumeops-pg ``` 2. Wait for `immich-pg-tailscale` service to get a Tailscale IP and `immich-pg.tail8d86e.ts.net` to resolve 3. Ansible (caddy): deploy Caddy L4 route for port 5433 ``` mise run provision-indri -- --tags caddy ``` 4. Ansible (borgmatic): deploy updated config and .pgpass ``` mise run provision-indri -- --tags borgmatic ``` 5. Verify: trigger a manual borgmatic run and check all four pg_dump streams succeed ``` borgmatic --verbosity 1 2>&1 \| grep -E '(Dumping\|ERROR)' ``` ## Test plan - [x] `kubectl kustomize` builds cleanly - [x] `ansible --check --diff` for borgmatic and caddy show expected changes - [ ] ArgoCD sync succeeds for databases app - [ ] `immich-pg.tail8d86e.ts.net` resolves - [ ] `pg.ops.eblu.me:5433` accepts connections - [ ] `borgmatic --verbosity 1` dumps all four databases without errors Reviewed-on: #314	2026-03-27 16:59:58 -07:00
Erich Blume	831b82950a	Upgrade nvidia-device-plugin v0.18.2 → v0.19.0 and add reference card Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:19:24 -07:00
Erich Blume	2c1652604b	Reduce PodNotReady alert lookback from 5m to 60s The 5-minute lookback window kept stale data from terminated pods visible during rollouts, causing the alert to sit in Pending for ~5 minutes after every routine deployment. 60s still covers two scrape cycles (30s interval) while clearing stale data much faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:48:37 -07:00
Erich Blume	a37012385f	Tighten ArgoCDAppOutOfSync alert timing to clear faster after sync Reduced `for` from 30m to 5m and lookback window from 5m to 1m. The old values caused alerts to linger long after apps returned to Synced state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:44:09 -07:00

1 2 3 4 5 ...

391 commits