blumeops

Author	SHA1	Message	Date
Erich Blume	35ae171783	C0: fix sync button location in manage-forgejo-mirrors The verify step pointed to the main repo page, but the "Synchronize now" button is in the Mirror settings section of the settings page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 07:15:07 -07:00
Erich Blume	08a1cb164a	C0: fix 1password export filename in backup how-to 1Password's desktop app names exports as 1PasswordExport-<uuid>-<timestamp>.1pux automatically — you can't choose the name. Procedure now points the task at that glob.	2026-05-22 21:36:13 -07:00
Erich Blume	ee51bcafb4	Rip out compensating-controls framework (#359 ) ## Summary Removes the compensating-controls (CC) framework. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files stay in place but no longer carry \`CC: <id>\` prefixes — each entry now just keeps a free-form \`Description\` of why it's muted. The CC review cadence proved to be more process overhead than this single-operator homelab needed. ## What changed Deleted - \`compensating-controls.yaml\` — the CC registry - \`mise-tasks/review-compensating-controls\` — the staleness-review task - \`docs/how-to/operations/review-compensating-controls.md\` - \`docs/how-to/operations/record-review-evidence.md\` (was aspirational) - \`docs/explanation/compliance-mute-categories.md\` (proposed-future CC/NA/RA work) - 5 orphan \`+review-cc-\` / \`+compliance-mute-categories\` changelog fragments Modified* - 6 mutelist YAML files: stripped \`CC: <id>.\` prefix from every \`Description\` / \`statement\` field, kept the free-form text - \`mise-tasks/review-compliance-reports\`: removed CC mentions from docstrings, panel text, and the node-verification table title. Node-verification logic itself is unchanged. - \`docs/reference/operations/security.md\`: removed the "Compensating controls" section - \`docs/how-to/operations/read-compliance-reports.md\`: rewrote step 3 of "Acting on findings" to point at the mutelist YAML directly - \`docs/changelog.d/prowler-iac-mutelist.infra.md\`: rewrote to drop the "two new compensating controls" framing ## What did not change - All Prowler manifests (cronjobs, RBAC, PVs, kustomization) — scans still run on the same schedule - The Kingfisher deployment - The trivy-shim in the Prowler container — that's about Trivy ignorefile plumbing, independent of the CC concept - The mutelist entries themselves — each \`Resources\` list is unchanged; only the prose of \`Description\` was edited - \`CHANGELOG.md\` — historical releases are left as-is ## Test plan - [ ] Wait for human review before deploying — once merged, re-point ArgoCD: \`argocd app set prowler --revision main && argocd app sync prowler\` (no manifest changes besides the ConfigMap, so impact is limited to muted-finding descriptions in next week's report) - [ ] Confirm next weekly Prowler K8s CIS run (Sunday 3am) still completes and produces a report on sifaka - [ ] Confirm next weekly Prowler IaC run still honors \`trivyignore.yaml\` (the trivy shim is untouched but the ignorefile content was rewritten) - [ ] \`mise run review-compliance-reports\` — verify node-verification block still runs and prints the renamed table title Reviewed-on: #359	2026-05-22 21:08:53 -07:00
Erich Blume	947e4310c3	C2: migrate immich from minikube to ringtail (mikado chain) (#356 ) ## Summary C2 Mikado chain to move the entire Immich stack (server, ML, valkey, postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the largest single tenant on minikube (~1.5 GiB resident) and minikube is currently memory-saturated (97% RAM, swapping). This is the first concrete chain in the broader indri-k8s decommission effort. This PR contains the planning layer only — 7 cards (1 goal + 6 prerequisites). Implementation cycles follow per the Mikado Branch Invariant. ## Goal end-state - Immich `server`, `machine-learning`, `valkey` on ringtail. - ML pod uses ringtail's RTX 4080 (performance win — currently CPU-only). - CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail. - Library still on sifaka NFS — ringtail mounts the same path. - `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress. - Minikube `immich` and `immich-pg` are removed. ## Cards \| Card \| Depends on \| \|---\|---\| \| `migrate-immich-to-ringtail` (goal) \| all six below \| \| `cnpg-on-ringtail` \| — \| \| `immich-pg-on-ringtail` \| cnpg-on-ringtail \| \| `immich-pg-data-migration` \| immich-pg-on-ringtail \| \| `sifaka-nfs-from-ringtail` \| — \| \| `immich-app-on-ringtail` \| immich-pg-on-ringtail, sifaka-nfs-from-ringtail \| \| `immich-cutover-and-decommission` \| immich-pg-data-migration, immich-app-on-ringtail \| ## Key constraints - No data loss. Downtime is acceptable; data loss is not. Two surfaces matter: postgres (ML embeddings, face data — slow to re-derive) and the library files (don't move, but NFS access from ringtail must be verified). - Migration method: Option A is a CNPG `externalCluster` basebackup → promote. Option B is `pg_dump`/`pg_restore` as a documented fallback. Either way, dry-run against a scratch cluster first. - Why pg moves too (not cross-cluster): keeping pg on minikube would block the whole decommission, and Immich is chatty with pg so tailnet round-trips would hurt. ## Test plan - [ ] Plan review — does the dependency graph make sense? - [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the chain correctly. - [ ] Per-card implementation cycles land separately (commit convention enforced by hook). Reviewed-on: #356	2026-05-13 16:46:17 -07:00
Erich Blume	292d354902	C1: deploy adelaide-baby-shower-app to ringtail k3s (#349 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m12s Details ## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - ArgoCD app + manifests under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - Public surface (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - fail2ban filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - Caddy route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - Pulumi Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - Grafana APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - Container `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - Docs runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] NFS share on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] 1Password item `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] Container build: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] DNS: `mise run dns-up` after merge - [x] Fly cert: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] Caddy push: `mise run provision-indri -- --tags caddy` - [ ] Fly redeploy to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] ArgoCD sync: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes Reviewed-on: #349	2026-05-11 13:47:18 -07:00
Erich Blume	074887cd57	C0: docs — explanation article on compliance mute categories Captures the CC vs NA vs RA distinction surfaced during the 2026-05-03 weekly compliance review (CVE-2026-31789), and the image-scan mutelist gap that blocks acting on it. Links the new article from the review-compensating-controls how-to so it isn't orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:19:53 -07:00
Erich Blume	a2c61b625d	C0: rotate-fly-deploy-token — fish+bash one-shot, op validator gotcha Combine mint+store into a single command with both fish and bash forms (the doc previously only showed manual paste). Document the 1Password CLI "Password item requires ps value" validator error and the placeholder-password workaround for Password-category items with empty primary password fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 13:42:57 -07:00
Erich Blume	f6e392b80c	C1: SHA-pin tooling dependencies (2026-04 cycle) (#344 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m45s Details ## Summary Monthly tooling dependency refresh, with a one-time conversion from version-tag pins (`rev = "vX.Y.Z"`, `image:tag`, `>=`) to SHA / digest pins everywhere. ## Changes - prek hooks: all `rev = "vX.Y.Z"` → commit SHA + `# vX.Y.Z` comment. Bumped trufflehog (3.94.0→3.95.2), kingfisher (1.91.0→1.97.0), ruff (0.15.7→0.15.12), shfmt (3.13.0→3.13.1), prettier (3.8.1→3.8.3), actionlint (1.7.11→1.7.12). - fly/Dockerfile: tag pins → `image@sha256:...` digest pins. Bumped nginx (1.29.6→1.30.0-alpine), tailscale (v1.94.1→v1.94.2 — still inside the safe pre-1.96.5 range), alloy (v1.14.1→v1.16.0). - mise-tasks: PEP 723 inline deps converted from `>=` to `==` (PEP 508 doesn't support hashes inline). All scripts pinned to current latest: rich 15.0.0, typer 0.25.0, pyyaml 6.0.3, httpx 0.28.1. - prek `additional_dependencies`: ansible-lint==26.4.0, ansible-core==2.20.5. - taplo-lint: pass `--no-schema`. Upstream's `--default-schema-catalogs` returns a format taplo v0.9.3 can't parse — we don't validate against TOML schemas anyway, so this turns off the broken catalog fetch. - docs/update-tooling-dependencies: documents the SHA-pin convention, `docker buildx imagetools inspect` for digest lookup, and `prek clean` before re-verifying (cache grows to several GiB). Forgejo workflow `actions/checkout@v6.0.2` was already at the latest SHA — no change. ## Test plan - [x] `prek run --all-files` passes after `prek clean` - [x] `deploy-fly` workflow builds and deploys the new fly image on merge - [x] `fly status -a blumeops-proxy` healthy after deploy - [x] Spot-check a few mise tasks (`mise run blumeops-tasks`, `mise run docs-check-links`) to confirm pinned deps resolve cleanly Reviewed-on: #344	2026-04-30 16:51:43 -07:00
Erich Blume	8d634861f6	C1: migrate cv + docs from minikube to indri-native (#342 ) ## Summary Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down. ## What's in this commit - `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/` - `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`) - `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy` - `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit - `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi) - Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment ## What is not in this commit The dead artifacts. After PR review and successful cutover, a follow-up commit deletes: - `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/` - `containers/cv/`, `containers/quartz/` - `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check` - bumps `docs.current-version` in `service-versions.yaml` to the release tag ## Cutover plan (manual, from gilbert, after review) 1. Take down old: - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade` - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns \| grep -E '^(cv\|docs)\\b'` (should be empty) - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail 2. Bring up new: - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch) - `mise run provision-indri -- --tags cv,docs,caddy` - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"` 3. Verify: `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md` 4. Cleanup commit + merge. Total expected downtime: minutes (not the few-hour budget you authorized). ## Test plan - [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean - [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread - [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200 - [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment` - [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page - [ ] `mise run services-check` clean - [ ] `mise run service-review --type ansible` shows cv and docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #342	2026-04-29 14:55:11 -07:00
Erich Blume	14ca0160ba	Migrate devpi from minikube to indri (launchd) (#341 ) ## Summary Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent. ## What changed - New ansible role `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box). - Caddy: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`. - Playbook: `indri.yml` gains pre_tasks for the root password and the new role. - service-versions.yaml: devpi flipped from `type: argocd` to `type: ansible`. - ArgoCD: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted. - Docs: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list. ## Already deployed (live on indri) - Service running: `launchctl list mcquack.eblume.devpi` → PID 53888 - `curl https://pypi.ops.eblu.me/+api` returns 200 ✅ - `mise run docs-mikado` works again ✅ - 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/` - Minikube namespace and PVC fully reclaimed ## Test plan - [ ] `mise run services-check` (after merge) - [ ] CI workflows that use devpi succeed - [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines) ## Context This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain. Reviewed-on: #341	2026-04-29 13:38:36 -07:00
Erich Blume	005e2a03ed	C0: split gandi-operations docs; add dns-acme-cleanup mise task Splits the nebulous gandi-operations how-to into two single-topic cards (manage-eblu-me-dns, rotate-gandi-pat) and adds a mise task for the recurring _acme-challenge TXT cleanup needed due to a value-comparison bug in libdns/gandi v1.1.0 that prevents certmagic's cleanup phase from removing presented TXT values. The gandi reference card is updated to drop the false "different credential from Pulumi PAT" claim — verified during the 2026-04-27 incident that Caddy and Pulumi share a single PAT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 09:48:46 -07:00
Erich Blume	72b27b7fd2	C0: docs — add mealie borg restore how-to Captures the procedure used to restore mealie's SQLite DB from a borgmatic archive after the post-DR wipe: extract from borg, snapshot the wiped DB, swap via a helper pod on the ReadWriteOnce PVC, fix UID 911 ownership. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:04:28 -07:00
Erich Blume	7d94b9073a	C0: docs — default argocd login to --sso; drop extraneous --grpc-web Now that argocd's Authentik OAuth2 client is public, `argocd login --sso` works for day-to-day use. Promote it to the default in AGENTS.md, argocd-cli reference, and troubleshooting; keep the admin/password flow documented as a break-glass fallback for when Authentik is unavailable. Also drops --grpc-web from every interactive login command — confirmed extraneous (login succeeds without it). Left in CI workflows and `argocd cluster add` untouched; those are different contexts that I didn't re-test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 10:43:21 -07:00
Erich Blume	1425bf1f5c	Upgrade forgejo-runner to v12.8, adopt server.connections, and clean up docs (#338 ) ## Summary - consolidate forgejo-runner how-to docs into current cards - upgrade the k8s forgejo-runner deployment to the latest v12.8.x runner image - switch the k8s runner from first-boot register flow to declarative server.connections config - keep the runner image on the native Dagger build path and update the surrounding manifests/secrets ## Notes - PR opened early for C1 review - implementation and deployment verification will follow in subsequent commits Reviewed-on: #338	2026-04-20 09:03:54 -07:00
Erich Blume	353e2785c3	docs: review zot oidc client card	2026-04-20 07:55:25 -07:00
Erich Blume	53a7374ac1	C0: drop fix-ntfy-nix-version mikado card Historical one-shot fix from the zot hardening chain — knowledge is self-evident in containers/ntfy/default.nix and container-version-check regex. Should have been removed at mikado finalization. Scrubbed the two wiki-link references in add-container-version-sync-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 07:26:53 -07:00
Erich Blume	d26a6ae3b2	Update docs for Caddy routing and direct WireGuard peering Comprehensive docs pass reflecting the new Fly proxy architecture: - Fly proxy routes through Caddy on indri (not per-service TS Ingress) - Direct WireGuard peering via --port=41641 pinning - DERP relay performance lesson in Tailscale docs - Caddy now in public traffic path - indri tagged as flyio-target - Removed fly-reload references - Updated architecture diagrams and per-service setup guide - Added changelog fragment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 09:57:30 -07:00
Erich Blume	9bafe85b2b	Add teslamate extensions to DR restore procedure The earthdistance extension (depends on cube) must be created before restoring the teslamate database — discovered missing after 2026-04-13 DR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 08:12:26 -07:00
Erich Blume	fe0e913963	Switch Fly proxy to upstream keepalive pools (#337 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337	2026-04-17 16:39:52 -07:00
Erich Blume	3ecd888537	Switch container builds to manual-only workflow dispatch Shared Dagger helpers (src/blumeops/) affect all Dagger-built containers, making path-based auto-triggers unreliable. All builds now go through `mise run container-build-and-release <name>`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 14:25:14 -07:00
Erich Blume	223b134776	Document uv.lock as the source of devpi dependency in Dagger builds The lockfile bakes in devpi URLs — Dagger does a locked install, not fresh resolution. This is the mechanism behind the cold-cache failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:41:45 -07:00
Erich Blume	ccaef4c1a7	Document devpi cold cache failure mode and deploy teslamate v3.0.0-08c698e After a DR rebuild, devpi's empty cache causes race conditions under concurrent load — metadata is served but wheel files 404. Also deploys the first container.py-built teslamate image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:38:06 -07:00
Erich Blume	08c698e833	Migrate teslamate to native Dagger container.py (#333 ) Some checks failed Build Container / detect (push) Successful in 2s Details Build Container / build-dagger (teslamate) (push) Failing after 6s Details ## Summary - Replace legacy Dockerfile with native Dagger `container.py` build - Two-stage pipeline: Elixir+Node builder, Debian slim runtime - Uses shared helpers (`clone_from_forge`, `oci_labels`) - Delete old Dockerfile (pipeline auto-discovers container.py) - Update build-container-image docs and mark service reviewed ## Test plan - [x] `dagger call build --src=. --container-name=teslamate` succeeds locally - [ ] CI container build passes - [ ] Deploy from branch and verify teslamate starts cleanly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #333	2026-04-14 07:20:52 -07:00
Erich Blume	4ca0630d76	Review enforce-tag-immutability doc: add review date and zot reference link Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:00:55 -07:00
Erich Blume	d7c3c687f4	Document DR rebuild procedure and update restart-indri - New how-to: rebuild-minikube-cluster with full bootstrap procedure validated during 2026-04-13 DR event - Update restart-indri: warn about minikube delete, macOS permission dialog on first Tailscale SSH, forgejo_actions_secrets dep cycle - Update disaster-recovery reference: link to rebuild procedure - Update CLAUDE.md: never run minikube delete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:07:54 -07:00
Erich Blume	61fcd5d70a	Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s Details ## Summary - Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable) - Port build from Dockerfile to native Dagger container.py - Add liveness/readiness probes using the new /healthz endpoint on port 8080 - Update docs to reflect container.py migration and remove stale pin note ## Test plan - [ ] Build container: `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization tag with new image tag - [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana` - [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz` - [ ] Verify dashboards load in Grafana UI - [ ] `mise run services-check` Reviewed-on: #332	2026-04-13 07:57:13 -07:00
Erich Blume	6e60287e99	Doc review: delete install-dagger-on-nix-runner, add service-versions ref card Outdated leaf card removed; zot.md now links to new service-versions reference card instead. Added reverse link from review-services. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:52:38 -07:00
Erich Blume	8d80a4a3a5	Rewrite runner-logs: API-based log fetching, multi-repo support Replace broken SSH+filesystem log retrieval with Forgejo web API endpoint. Fix CLI to use run numbers (not task IDs), add --repo for querying any forge repo (e.g. sporks), --limit/-n for listing size. Document runner-logs as the way to verify build success in CLAUDE.md and container build docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:42:58 -07:00
Erich Blume	138e23d525	Miniflux 2.2.19 + container.py migration + ty typechecker (#331 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (miniflux) (push) Successful in 1m3s Details ## Summary - Upgrade miniflux from 2.2.17 to 2.2.19 (security hardening, performance improvements) - Migrate miniflux from Dockerfile to native Dagger container.py build - Refactor `alpine_runtime()` helper to support existing users (nobody/65534) - Add `ty` (Astral) Python typechecker to prek hooks ## Test plan - [ ] `dagger call build --src=. --container-name=miniflux` succeeds - [ ] `dagger call container-version --container-name=miniflux` returns 2.2.19 - [ ] `mise run container-version-check` passes - [ ] `ty check` passes cleanly - [ ] `prek run --all-files` passes - [ ] CI builds container successfully - [ ] Miniflux healthcheck passes after deploy from branch Reviewed-on: #331	2026-04-12 08:54:32 -07:00
Erich Blume	c86b5d7772	Native Dagger container builds + Navidrome v0.61.1 (#330 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (navidrome) (push) Successful in 22m26s Details ## Summary - Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops` - Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step - Migrate navidrome as the first container (`containers/navidrome/container.py`) - Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding) - Add `dagger call container-version` for CI version extraction without Dockerfile parsing - All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode - Legacy `docker_build()` fallback preserved for all other containers ## Motivation When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline. ## Container build dispatch needed After merge, dispatch container build for navidrome: ``` mise run container-build-and-release navidrome --ref `470b4bd` ``` ## Deploy steps 1. Wait for container build to complete 2. Back up navidrome-data PVC (non-reversible DB migrations) 3. `argocd app set navidrome --revision main && argocd app sync navidrome` 4. Verify at https://dj.ops.eblu.me ## Future Remaining containers migrate incrementally in follow-up PRs using the same pattern. Reviewed-on: #330	2026-04-11 17:11:56 -07:00
Erich Blume	a059d81314	Add review-compliance-reports task and reorganize report storage New mise task fetches Prowler reports from sifaka, parses with proper muted/unmuted distinction, shows week-over-week delta, and includes a scaffold for Kingfisher once JSON/CSV output is available upstream. Moved all legacy top-level reports on sifaka into date subdirectories to match the current CronJob output structure. Updated read-compliance-reports doc with task reference and links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 10:16:46 -07:00
Erich Blume	64200a55c5	Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3) Replace the Helm chart deployment with plain kustomize manifests following the Authentik pattern (separate deployments per component). Consolidate the immich-storage ArgoCD app into the main immich app. Add no-helm-policy doc establishing kustomize as the standard deployment mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:42:25 -07:00
Erich Blume	baee7ae54b	Review single-user-cluster control and add evidence collection card Stamp single-user-cluster last-reviewed to 2026-04-01 after verifying Tailscale ACLs and kubeconfig distribution. Add aspirational how-to card documenting what PCI DSS evidence collection would look like (CCW, artifacts, Drata workflow). Link from existing review process card. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 22:01:57 -07:00
Erich Blume	a18a424866	Pin NixOS service versions via nixpkgs-services overlay (#321 ) ## Summary - Add `nixpkgs-services` flake input pinned to a specific nixpkgs commit, with an overlay that pulls `forgejo-runner`, `snowflake`, and `k3s` from it instead of the rolling `nixpkgs` - Dagger `flake-update` pipeline now excludes `nixpkgs-services` via `--exclude` - Fix stale nix-container-builder version in service-versions.yaml (was 12.6.4, actually running 12.7.2) - Add k3s and minikube to service-versions.yaml tracking - Document the pinning approach in review-services how-to and ringtail reference ## Motivation During service review, discovered that flake updates had silently upgraded forgejo-runner from 12.6.4 → 12.7.2 without updating service-versions.yaml. This "sneak-in upgrade" bypasses the service review process. The overlay ensures these three services only change versions deliberately. ## Test plan - [ ] Verify `nix flake update` from `nixos/ringtail/` does not change `nixpkgs-services` lock entry - [ ] Verify `mise run provision-ringtail` builds successfully with the overlay - [ ] Confirm running service versions unchanged after deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #321	2026-04-01 21:37:57 -07:00
Erich Blume	4059b3d27b	Add compensating controls framework and date-based report dirs (#320 ) ## Summary - Add `compensating-controls.yaml` tracking 9 named controls that justify suppressed security findings - Update all Prowler mutelist descriptions with `CC: <id>` references to named controls - Add `mise run review-compensating-controls` task — surfaces stalest control with all codebase references - Add [[review-compensating-controls]] how-to doc - Organize Prowler and Kingfisher reports into `YYYY-MM-DD` subdirectories ### Compensating controls \| ID \| Mitigates \| \|----\|-----------\| \| `single-user-cluster` \| Image cache abuse, RBAC breadth, system pod privileges \| \| `tailscale-network-isolation` \| Profiling endpoints, weak TLS, debug ports \| \| `local-registry` \| AlwaysPullImages gap \| \| `sso-gated-admin-tools` \| ArgoCD wildcard RBAC \| \| `operator-managed-pods` \| Tailscale proxy pod security settings \| \| `ephemeral-privileged-jobs` \| Prowler hostPID exposure \| \| `trusted-ci-only` \| Forgejo runner DinD \| \| `init-container-isolation` \| Grafana root init container \| \| `observability-stack-audit` \| Missing apiserver audit logging \| ## Test plan - [ ] `mise run review-compensating-controls` shows table and references - [ ] `kubectl kustomize argocd/manifests/prowler/` renders correctly - [ ] Sync prowler and kingfisher, verify next scan writes to dated subdirectory - [ ] Grep for `CC:` in mutelist files — every muted finding should have at least one 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #320	2026-03-30 17:44:11 -07:00
Erich Blume	f9206bf10b	Build custom Kingfisher container from sporked deploy branch (#318 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-nix (kingfisher) (push) Successful in 12s Details ## Summary - Add Dockerfile for Kingfisher built from source (sporked deploy branch) - Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime - Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher` - Add kingfisher to service-versions.yaml (version tracks upstream main SHA) - Document spork workflow in CLAUDE.md ## Test plan - [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29` - [ ] Verify image on registry: `mise run container-list` - [ ] Update kustomization newTag - [ ] Sync ArgoCD kingfisher app from branch - [ ] Trigger manual CronJob and verify scan completes - [ ] Verify reports on sifaka Reviewed-on: #318	2026-03-30 06:34:49 -07:00
Erich Blume	99df78664e	Note upstream history rewrite as a spork sync failure mode Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 08:22:11 -07:00
Erich Blume	6ecfaf02b6	Add spork strategy: tooling and documentation Spork-create mise task sets up a floating-branch soft-fork of a mirrored upstream project with daily mirror-sync via Forgejo Actions. Includes explanation card, how-to guides for setup and branch management, and the spork-create uv script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 22:58:10 -07:00
Erich Blume	2bd1611ac1	Document sifaka NFS/Tailscale TUN troubleshooting Sifaka's Tailscale can revert to userspace networking after package updates, causing NFS mounts to fail because the NFS daemon sees 127.0.0.1 instead of the client's Tailscale IP. Added troubleshooting how-to doc and updated sifaka reference card with frigate export and TUN requirement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 09:12:00 -07:00
Erich Blume	3017f759a7	Migrate Forgejo from Homebrew to source build (#316 ) ## Summary - Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent - Matches the established pattern used by zot, caddy, and alloy - Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more) ## Changes - Ansible role: Replace brew install/services with binary stat check + LaunchAgent - Paths: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo` - Run user: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`) - Docs: Updated Forgejo reference card, restart-indri guide - Service review: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current ## One-time migration steps (manual, on indri) 1. Clone from Codeberg, add forge mirror remote 2. Check out v14.0.3, build with `make build && make forgejo` 3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership 4. Run `provision-indri --tags forgejo` 5. Verify, then `brew uninstall forgejo` ## Data safety - `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config) - Brew version stays installed as rollback until verification passes - No schema changes between 14.0.2 → 14.0.3 Reviewed-on: #316	2026-03-28 08:19:23 -07:00
Erich Blume	66a47738dd	Add ringtail post-deploy maintenance: kernel check, generation pruning, GC Update manage-lockfile doc with post-deploy steps (kernel update detection, reboot guidance, generation pruning). Add prune-ringtail-generations mise task that keeps the 5 most recent generations plus the most recent one matching the booted kernel for safe rollback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:55:45 -07:00
Erich Blume	687e972713	Review CV doc and close build-dep review gap Fix stale CV service doc (URL, forge domain, container tag) and add guidance for reviewing build-time dependencies in private forge repos during service reviews. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:12:38 -07:00
Erich Blume	fe201a495c	Add Prowler IaC scanning of blumeops repo (Saturday 2am) Clone repo in init container, scan Dockerfiles and K8s manifests with Prowler's IaC provider (Trivy). Reports written to sifaka:/volume1/reports/prowler-iac/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:49:38 -07:00
Erich Blume	696024306c	Add Prowler image vulnerability scanning for blumeops containers All checks were successful Build Container / detect (push) Successful in 39s Details Build Container / build-dockerfile (prowler) (push) Successful in 10m15s Details Add Trivy to the Prowler container for image and IaC scanning. New CronJob (Saturday 3am) scans all blumeops/* images in the registry for CVEs, embedded secrets, and Dockerfile misconfigs. Reports written to sifaka:/volume1/reports/prowler-images/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:43:08 -07:00
Erich Blume	d021b3534f	Deploy Prowler CIS scanner (#310 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dockerfile (prowler) (push) Successful in 10s Details ## Summary - Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning - Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream) - Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/` - Read-only ClusterRole for pod, RBAC, and control plane inspection - Host path mounts + hostPID for kubelet file permission checks ## Follow-ups - Mirror prowler-cloud/prowler on forge for supply chain control - Build and push container image, update kustomization.yaml newTag - Consider adding k3s-ringtail scanning (core + RBAC checks only) ## Test plan - [ ] Build container: `mise run container-release prowler v5.22.0` - [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag - [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler` - [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri` - [ ] Verify reports appear on sifaka NFS share - [ ] `mise run services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #310	2026-03-24 16:08:09 -07:00
Erich Blume	fd0bebb0fc	Localize authentik-redis container (#309 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dockerfile (alloy) (push) Successful in 12s Details Build Container / build-dockerfile (ntfy) (push) Successful in 11s Details Build Container / build-nix (alloy) (push) Successful in 20s Details Build Container / build-nix (authentik) (push) Successful in 6m10s Details Build Container / build-nix (authentik-redis) (push) Successful in 20s Details Build Container / build-nix (ntfy) (push) Successful in 6s Details ## Summary - Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs - Introduce attached service pattern: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates - Document the pattern in [[review-services]] so future attached services slot in cleanly - Backfill `parent: grafana` on existing `grafana-sidecar` entry ## Version drift protection 1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build` 2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update 3. Both must agree before commit succeeds ## Test plan - [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`) - [ ] Update kustomization `newTag` to branch-built image tag - [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`) - [ ] Verify Authentik login, session persistence, and task queue still work - [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #309	2026-03-24 13:27:36 -07:00
Erich Blume	fc45989a6c	Decommission JobSync service (#308 ) All checks were successful Build Container / detect (push) Successful in 3s Details ## Summary - Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation - Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed ## Motivation Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation. ## Remaining manual steps after merge - Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy` - Sync Homepage: `argocd app sync homepage` - Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #308	2026-03-24 08:44:23 -07:00
Erich Blume	bd0ff30d3f	Unify container build workflows (#306 ) All checks were successful Build Container / detect (push) Successful in 3s Details ## Summary - Merges `build-container.yaml` and `build-container-nix.yaml` into a single workflow - Detect job classifies each changed container by presence of `Dockerfile` and/or `default.nix` - Dockerfile containers build on `k8s` (indri) via Dagger; Nix containers build on `nix-container-builder` (ringtail) via nix-build + skopeo - Containers with both build files (alloy, nettest, ntfy) get built on both runners ## Test plan - [ ] Push a change to a Dockerfile-only container (e.g. grafana) — verify it builds on k8s only - [ ] Push a change to a nix-only container (e.g. jobsync) — verify it builds on nix-container-builder only - [ ] Push a change to a dual container (e.g. ntfy) — verify it builds on both runners - [ ] Test workflow_dispatch with a specific container name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #306	2026-03-23 20:55:50 -07:00
Erich Blume	6d65e6928c	C2: Deploy infrastructure alerting pipeline (#303 ) ## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. Design: - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes Chain (bottom-up): 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303	2026-03-22 14:52:56 -07:00
Erich Blume	e5ce510fdc	Fix plan-a-meal random recipe API queries Mealie's orderBy=random requires a paginationSeed parameter, otherwise the API returns 422. Added the seed to all random query examples. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 11:10:48 -07:00

1 2 3 4

179 commits