blumeops

Author	SHA1	Message	Date
Erich Blume	b21d13fe20	C2(migrate-immich-to-ringtail): finalize chain — strip mikado frontmatter, add changelog Immich is fully migrated off minikube-indri onto k3s-ringtail. All six prerequisite cards plus the goal card converted to historical documentation by removing status/branch/requires Mikado frontmatter. Changelog fragment added at docs/changelog.d/migrate-immich-to-ringtail.infra.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:46:27 -07:00
Erich Blume	7400807be3	C2(migrate-immich-to-ringtail): close immich-cutover-and-decommission Sequence executed: 1. Quiesced source: immich-server + immich-machine-learning on minikube scaled to 0 (done in immich-pg-data-migration). 2. Deleted minikube immich-tailscale Ingress; waited for "photos" Tailscale device to deregister. 3. (Promote of ringtail pg was done in immich-pg-data-migration.) 4. Renamed ringtail ingress tls.hosts photos-ringtail -> photos. 5. Caddy was already pointing photos.ops.eblu.me -> photos.tail8d86e.ts.net so no Ansible change needed. 6. Smoke test: photos.ops.eblu.me/api/server/ping -> 200, /api/server/version -> {"major":2,"minor":6,"patch":3}. 7. Borgmatic continuity: added a ringtail immich-pg-tailscale Service (same FQDN as before, immich-pg.tail8d86e.ts.net). Verified borgmatic role can SELECT count(*) FROM asset over the tailnet (returned 12681, matches source). Decommission: - Deleted argocd Application "immich" with --cascade (clears Deployments, Services, etc. on minikube). - Pruned blumeops-pg Application against the branch which removed the Cluster immich-pg, its ExternalSecret, and the old immich-pg-tailscale Service from minikube. - Deleted leftover Released PVs on minikube. - Deleted the empty immich namespace on minikube. Did not verify minikube host memory drop directly (tailscale-ssh re-auth was prompting at the time). Caller should confirm via "docker stats minikube" once SSH is re-authenticated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:42:31 -07:00
Erich Blume	18e6c7ef5d	C2(migrate-immich-to-ringtail): close immich-app-on-ringtail All three pods Running, 1/1 Ready: - immich-server: v2.6.3, connected to ringtail pg + valkey ("/api/server/ping" returns 200, "/api/server/version" returns v2.6.3) - immich-machine-learning: CUDA variant, RTX 4080 attached (nvidia-smi shows 8 GiB used / 16 GiB total — shared with frigate via time-slicing), gunicorn workers booted - immich-valkey: upstream multi-arch docker.io/valkey/valkey:8.1.6 immich-db Secret in the immich namespace created manually with source's immich-pg-app password (matches minikube pattern). Tailscale ingress staging hostname: photos-ringtail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:23:24 -07:00
Erich Blume	674ca2ced9	C2(migrate-immich-to-ringtail): close immich-pg-data-migration Migration via CNPG pg_basebackup (Option A) completed cleanly. Sequence: 1. Stopped immich-server + immich-machine-learning on minikube (scaled to 0). valkey + source pg kept running. 2. Copied minikube's immich-pg-ca + immich-pg-replication secrets to ringtail as source-immich-pg-{ca,replication}. 3. Recreated the ringtail immich-pg Cluster with bootstrap.pg_basebackup, replica.enabled=true, externalClusters pointing at immich-pg.tail8d86e.ts.net via the streaming_replica TLS cert. 4. Basebackup completed in ~50s. Replica caught up streaming. 5. Verified row counts identical between source and replica: asset=12681, user=1, album=28, smart_search=9624, activity=0, asset_face=3917. 6. Promoted via replica.enabled=false. pg_is_in_recovery → false. Write test passed. All 7 expected extensions present in immich db (vector, vchord, cube, earthdistance, pg_trgm, unaccent, uuid-ossp). 7. Pruned bootstrap + externalClusters blocks; deleted out-of-band replication secrets. Source minikube immich-pg is intact and untouched — recovery path remains available until immich-cutover-and-decommission completes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:12:21 -07:00
Erich Blume	be5255d685	C2(migrate-immich-to-ringtail): close sifaka-nfs-from-ringtail Verified on k3s-ringtail: - Sifaka NFS export /volume1/photos covers 192.168.1.0/24 + 100.64.0.0/10. Ringtail at 192.168.1.21 is in scope; no DSM rule changes needed. - nfs-test pod mounted the share, read existing library/ thumbs/ backups/ encoded-video/ profile/, wrote a temp file, deleted it. - DNS resolution: sifaka → 192.168.1.203 (LAN). NFS traffic stays off tailnet, avoiding the sifaka-tailscale-userspace concern. - Committed PV + PVC bind on first apply (RWX, 2Ti). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:12:21 -07:00
Erich Blume	4c6695868d	C2(migrate-immich-to-ringtail): close immich-pg-on-ringtail Verified on k3s-ringtail: - Cluster immich-pg reached "Cluster in healthy state" (1/1 instance) - borgmatic role: rolcanlogin=t, member of pg_read_all_data - ExternalSecret immich-pg-borgmatic: Ready=True, username=borgmatic - Extensions vchord, vector, cube, earthdistance installed in postgres db (immich db extensions deferred to app startup per the card) 10 GiB local-path storage; same VectorChord image as minikube source. Bootstrap is empty initdb today; will be rewritten when immich-pg-data-migration picks its import method. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:12:21 -07:00
Erich Blume	e1fe5d2ea6	C2(migrate-immich-to-ringtail): close cnpg-on-ringtail Verified: cnpg-controller-manager pod Ready on k3s-ringtail; CRDs clusters.postgresql.cnpg.io etc. installed; ArgoCD app Synced/Healthy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:12:21 -07:00
Erich Blume	bca5c40663	C2(migrate-immich-to-ringtail): plan capture GPU contention + valkey arch on immich-app-on-ringtail Two discovered prereqs while bringing the immich stack up on ringtail: 1. nvidia-device-plugin time-slicing on ringtail advertises only 2 virtual GPUs. Frigate + Ollama consume both. immich-ml's nvidia.com/gpu:1 cannot schedule until replicas is bumped to >= 3. 2. The registry.ops.eblu.me/blumeops/valkey image was built on indri (arm64) and is single-arch. Pulling on ringtail (amd64) crashloops with "exec format error". Use the upstream multi-arch docker.io/valkey/valkey image directly until the mirror gets a multi-arch tag. Card body updated to capture both. Next impl incorporates the fixes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 13:12:09 -07:00
Erich Blume	355be3fbc4	C2(migrate-immich-to-ringtail): plan correct extension-verification on immich-pg-on-ringtail card CNPG's bootstrap.initdb.postInitSQL runs against the postgres superuser database, not the application database. Extensions declared there end up in the postgres db, not the immich db. The Immich app installs them in its own database at startup. This matches the existing minikube cluster's behavior — same Cluster CR, same effect. Adjusting the card's verification to reflect reality rather than (incorrectly) requiring extensions to be present in the immich db pre-app-deploy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:25:30 -07:00
Erich Blume	db37e7cc3e	C2(migrate-immich-to-ringtail): plan capture two discovered concerns 1. Registering new ArgoCD apps from a feature branch: the app-of-apps "apps" Application is self-managing (re-reads apps.yaml on every sync, which pins targetRevision: main). So setting its revision to a branch doesn't stick across syncs, and new app definitions on a branch are invisible to the cluster via the normal flow. The goal card now documents the kubectl-apply + per-new-app `argocd app set --revision <branch>` workaround. 2. Tailscale device-name collision on cutover. The minikube immich ingress claims tailnet hostname "photos" (tls.hosts: [photos]). The ringtail ingress can't claim the same name while minikube's is alive (Tailscale enforces uniqueness). Staging uses tls.hosts: [photos-ringtail], with the rename to "photos" baked into immich-cutover-and-decommission step 2 + step 5. Card dependency graph unchanged; no new cards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 12:21:57 -07:00
Erich Blume	4623733695	C2(migrate-immich-to-ringtail): plan introduce mikado chain Goal: move immich (server, ML, valkey, postgres) off minikube-indri onto k3s-ringtail. Immich is the largest single tenant on minikube (~1.5 GiB resident) and minikube is memory-saturated. Prerequisite cards: - cnpg-on-ringtail - immich-pg-on-ringtail (requires cnpg-on-ringtail) - immich-pg-data-migration (requires immich-pg-on-ringtail) - sifaka-nfs-from-ringtail - immich-app-on-ringtail (requires immich-pg-on-ringtail, sifaka-nfs-from-ringtail) - immich-cutover-and-decommission (requires immich-pg-data-migration, immich-app-on-ringtail) Data loss is a critical failure; downtime is acceptable. The cutover plan favors a CNPG externalCluster basebackup (Option A) with pg_dump as the documented fallback (Option B). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 11:05:40 -07:00
Erich Blume	292d354902	C1: deploy adelaide-baby-shower-app to ringtail k3s (#349 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m12s Details ## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - ArgoCD app + manifests under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - Public surface (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - fail2ban filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - Caddy route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - Pulumi Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - Grafana APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - Container `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - Docs runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] NFS share on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] 1Password item `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] Container build: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] DNS: `mise run dns-up` after merge - [x] Fly cert: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] Caddy push: `mise run provision-indri -- --tags caddy` - [ ] Fly redeploy to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] ArgoCD sync: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes Reviewed-on: #349	2026-05-11 13:47:18 -07:00
Erich Blume	074887cd57	C0: docs — explanation article on compliance mute categories Captures the CC vs NA vs RA distinction surfaced during the 2026-05-03 weekly compliance review (CVE-2026-31789), and the image-scan mutelist gap that blocks acting on it. Links the new article from the review-compensating-controls how-to so it isn't orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:19:53 -07:00
Erich Blume	a2c61b625d	C0: rotate-fly-deploy-token — fish+bash one-shot, op validator gotcha Combine mint+store into a single command with both fish and bash forms (the doc previously only showed manual paste). Document the 1Password CLI "Password item requires ps value" validator error and the placeholder-password workaround for Password-category items with empty primary password fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 13:42:57 -07:00
Erich Blume	f6e392b80c	C1: SHA-pin tooling dependencies (2026-04 cycle) (#344 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m45s Details ## Summary Monthly tooling dependency refresh, with a one-time conversion from version-tag pins (`rev = "vX.Y.Z"`, `image:tag`, `>=`) to SHA / digest pins everywhere. ## Changes - prek hooks: all `rev = "vX.Y.Z"` → commit SHA + `# vX.Y.Z` comment. Bumped trufflehog (3.94.0→3.95.2), kingfisher (1.91.0→1.97.0), ruff (0.15.7→0.15.12), shfmt (3.13.0→3.13.1), prettier (3.8.1→3.8.3), actionlint (1.7.11→1.7.12). - fly/Dockerfile: tag pins → `image@sha256:...` digest pins. Bumped nginx (1.29.6→1.30.0-alpine), tailscale (v1.94.1→v1.94.2 — still inside the safe pre-1.96.5 range), alloy (v1.14.1→v1.16.0). - mise-tasks: PEP 723 inline deps converted from `>=` to `==` (PEP 508 doesn't support hashes inline). All scripts pinned to current latest: rich 15.0.0, typer 0.25.0, pyyaml 6.0.3, httpx 0.28.1. - prek `additional_dependencies`: ansible-lint==26.4.0, ansible-core==2.20.5. - taplo-lint: pass `--no-schema`. Upstream's `--default-schema-catalogs` returns a format taplo v0.9.3 can't parse — we don't validate against TOML schemas anyway, so this turns off the broken catalog fetch. - docs/update-tooling-dependencies: documents the SHA-pin convention, `docker buildx imagetools inspect` for digest lookup, and `prek clean` before re-verifying (cache grows to several GiB). Forgejo workflow `actions/checkout@v6.0.2` was already at the latest SHA — no change. ## Test plan - [x] `prek run --all-files` passes after `prek clean` - [x] `deploy-fly` workflow builds and deploys the new fly image on merge - [x] `fly status -a blumeops-proxy` healthy after deploy - [x] Spot-check a few mise tasks (`mise run blumeops-tasks`, `mise run docs-check-links`) to confirm pinned deps resolve cleanly Reviewed-on: #344	2026-04-30 16:51:43 -07:00
Erich Blume	8d634861f6	C1: migrate cv + docs from minikube to indri-native (#342 ) ## Summary Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down. ## What's in this commit - `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/` - `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`) - `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy` - `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit - `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi) - Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment ## What is not in this commit The dead artifacts. After PR review and successful cutover, a follow-up commit deletes: - `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/` - `containers/cv/`, `containers/quartz/` - `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check` - bumps `docs.current-version` in `service-versions.yaml` to the release tag ## Cutover plan (manual, from gilbert, after review) 1. Take down old: - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade` - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns \| grep -E '^(cv\|docs)\\b'` (should be empty) - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail 2. Bring up new: - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch) - `mise run provision-indri -- --tags cv,docs,caddy` - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"` 3. Verify: `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md` 4. Cleanup commit + merge. Total expected downtime: minutes (not the few-hour budget you authorized). ## Test plan - [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean - [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread - [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200 - [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment` - [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page - [ ] `mise run services-check` clean - [ ] `mise run service-review --type ansible` shows cv and docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #342	2026-04-29 14:55:11 -07:00
Erich Blume	14ca0160ba	Migrate devpi from minikube to indri (launchd) (#341 ) ## Summary Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent. ## What changed - New ansible role `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box). - Caddy: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`. - Playbook: `indri.yml` gains pre_tasks for the root password and the new role. - service-versions.yaml: devpi flipped from `type: argocd` to `type: ansible`. - ArgoCD: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted. - Docs: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list. ## Already deployed (live on indri) - Service running: `launchctl list mcquack.eblume.devpi` → PID 53888 - `curl https://pypi.ops.eblu.me/+api` returns 200 ✅ - `mise run docs-mikado` works again ✅ - 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/` - Minikube namespace and PVC fully reclaimed ## Test plan - [ ] `mise run services-check` (after merge) - [ ] CI workflows that use devpi succeed - [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines) ## Context This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain. Reviewed-on: #341	2026-04-29 13:38:36 -07:00
Erich Blume	005e2a03ed	C0: split gandi-operations docs; add dns-acme-cleanup mise task Splits the nebulous gandi-operations how-to into two single-topic cards (manage-eblu-me-dns, rotate-gandi-pat) and adds a mise task for the recurring _acme-challenge TXT cleanup needed due to a value-comparison bug in libdns/gandi v1.1.0 that prevents certmagic's cleanup phase from removing presented TXT values. The gandi reference card is updated to drop the false "different credential from Pulumi PAT" claim — verified during the 2026-04-27 incident that Caddy and Pulumi share a single PAT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 09:48:46 -07:00
Erich Blume	72b27b7fd2	C0: docs — add mealie borg restore how-to Captures the procedure used to restore mealie's SQLite DB from a borgmatic archive after the post-DR wipe: extract from borg, snapshot the wiped DB, swap via a helper pod on the ReadWriteOnce PVC, fix UID 911 ownership. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:04:28 -07:00
Erich Blume	7d94b9073a	C0: docs — default argocd login to --sso; drop extraneous --grpc-web Now that argocd's Authentik OAuth2 client is public, `argocd login --sso` works for day-to-day use. Promote it to the default in AGENTS.md, argocd-cli reference, and troubleshooting; keep the admin/password flow documented as a break-glass fallback for when Authentik is unavailable. Also drops --grpc-web from every interactive login command — confirmed extraneous (login succeeds without it). Left in CI workflows and `argocd cluster add` untouched; those are different contexts that I didn't re-test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 10:43:21 -07:00
Erich Blume	1425bf1f5c	Upgrade forgejo-runner to v12.8, adopt server.connections, and clean up docs (#338 ) ## Summary - consolidate forgejo-runner how-to docs into current cards - upgrade the k8s forgejo-runner deployment to the latest v12.8.x runner image - switch the k8s runner from first-boot register flow to declarative server.connections config - keep the runner image on the native Dagger build path and update the surrounding manifests/secrets ## Notes - PR opened early for C1 review - implementation and deployment verification will follow in subsequent commits Reviewed-on: #338	2026-04-20 09:03:54 -07:00
Erich Blume	353e2785c3	docs: review zot oidc client card	2026-04-20 07:55:25 -07:00
Erich Blume	53a7374ac1	C0: drop fix-ntfy-nix-version mikado card Historical one-shot fix from the zot hardening chain — knowledge is self-evident in containers/ntfy/default.nix and container-version-check regex. Should have been removed at mikado finalization. Scrubbed the two wiki-link references in add-container-version-sync-check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-20 07:26:53 -07:00
Erich Blume	d26a6ae3b2	Update docs for Caddy routing and direct WireGuard peering Comprehensive docs pass reflecting the new Fly proxy architecture: - Fly proxy routes through Caddy on indri (not per-service TS Ingress) - Direct WireGuard peering via --port=41641 pinning - DERP relay performance lesson in Tailscale docs - Caddy now in public traffic path - indri tagged as flyio-target - Removed fly-reload references - Updated architecture diagrams and per-service setup guide - Added changelog fragment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 09:57:30 -07:00
Erich Blume	9bafe85b2b	Add teslamate extensions to DR restore procedure The earthdistance extension (depends on cube) must be created before restoring the teslamate database — discovered missing after 2026-04-13 DR. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 08:12:26 -07:00
Erich Blume	fe0e913963	Switch Fly proxy to upstream keepalive pools (#337 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337	2026-04-17 16:39:52 -07:00
Erich Blume	3ecd888537	Switch container builds to manual-only workflow dispatch Shared Dagger helpers (src/blumeops/) affect all Dagger-built containers, making path-based auto-triggers unreliable. All builds now go through `mise run container-build-and-release <name>`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 14:25:14 -07:00
Erich Blume	223b134776	Document uv.lock as the source of devpi dependency in Dagger builds The lockfile bakes in devpi URLs — Dagger does a locked install, not fresh resolution. This is the mechanism behind the cold-cache failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:41:45 -07:00
Erich Blume	ccaef4c1a7	Document devpi cold cache failure mode and deploy teslamate v3.0.0-08c698e After a DR rebuild, devpi's empty cache causes race conditions under concurrent load — metadata is served but wheel files 404. Also deploys the first container.py-built teslamate image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:38:06 -07:00
Erich Blume	08c698e833	Migrate teslamate to native Dagger container.py (#333 ) Some checks failed Build Container / detect (push) Successful in 2s Details Build Container / build-dagger (teslamate) (push) Failing after 6s Details ## Summary - Replace legacy Dockerfile with native Dagger `container.py` build - Two-stage pipeline: Elixir+Node builder, Debian slim runtime - Uses shared helpers (`clone_from_forge`, `oci_labels`) - Delete old Dockerfile (pipeline auto-discovers container.py) - Update build-container-image docs and mark service reviewed ## Test plan - [x] `dagger call build --src=. --container-name=teslamate` succeeds locally - [ ] CI container build passes - [ ] Deploy from branch and verify teslamate starts cleanly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #333	2026-04-14 07:20:52 -07:00
Erich Blume	4ca0630d76	Review enforce-tag-immutability doc: add review date and zot reference link Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:00:55 -07:00
Erich Blume	d7c3c687f4	Document DR rebuild procedure and update restart-indri - New how-to: rebuild-minikube-cluster with full bootstrap procedure validated during 2026-04-13 DR event - Update restart-indri: warn about minikube delete, macOS permission dialog on first Tailscale SSH, forgejo_actions_secrets dep cycle - Update disaster-recovery reference: link to rebuild procedure - Update CLAUDE.md: never run minikube delete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:07:54 -07:00
Erich Blume	61fcd5d70a	Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s Details ## Summary - Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable) - Port build from Dockerfile to native Dagger container.py - Add liveness/readiness probes using the new /healthz endpoint on port 8080 - Update docs to reflect container.py migration and remove stale pin note ## Test plan - [ ] Build container: `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization tag with new image tag - [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana` - [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz` - [ ] Verify dashboards load in Grafana UI - [ ] `mise run services-check` Reviewed-on: #332	2026-04-13 07:57:13 -07:00
Erich Blume	6e60287e99	Doc review: delete install-dagger-on-nix-runner, add service-versions ref card Outdated leaf card removed; zot.md now links to new service-versions reference card instead. Added reverse link from review-services. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:52:38 -07:00
Erich Blume	8d80a4a3a5	Rewrite runner-logs: API-based log fetching, multi-repo support Replace broken SSH+filesystem log retrieval with Forgejo web API endpoint. Fix CLI to use run numbers (not task IDs), add --repo for querying any forge repo (e.g. sporks), --limit/-n for listing size. Document runner-logs as the way to verify build success in CLAUDE.md and container build docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:42:58 -07:00
Erich Blume	138e23d525	Miniflux 2.2.19 + container.py migration + ty typechecker (#331 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (miniflux) (push) Successful in 1m3s Details ## Summary - Upgrade miniflux from 2.2.17 to 2.2.19 (security hardening, performance improvements) - Migrate miniflux from Dockerfile to native Dagger container.py build - Refactor `alpine_runtime()` helper to support existing users (nobody/65534) - Add `ty` (Astral) Python typechecker to prek hooks ## Test plan - [ ] `dagger call build --src=. --container-name=miniflux` succeeds - [ ] `dagger call container-version --container-name=miniflux` returns 2.2.19 - [ ] `mise run container-version-check` passes - [ ] `ty check` passes cleanly - [ ] `prek run --all-files` passes - [ ] CI builds container successfully - [ ] Miniflux healthcheck passes after deploy from branch Reviewed-on: #331	2026-04-12 08:54:32 -07:00
Erich Blume	c86b5d7772	Native Dagger container builds + Navidrome v0.61.1 (#330 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (navidrome) (push) Successful in 22m26s Details ## Summary - Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops` - Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step - Migrate navidrome as the first container (`containers/navidrome/container.py`) - Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding) - Add `dagger call container-version` for CI version extraction without Dockerfile parsing - All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode - Legacy `docker_build()` fallback preserved for all other containers ## Motivation When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline. ## Container build dispatch needed After merge, dispatch container build for navidrome: ``` mise run container-build-and-release navidrome --ref `470b4bd` ``` ## Deploy steps 1. Wait for container build to complete 2. Back up navidrome-data PVC (non-reversible DB migrations) 3. `argocd app set navidrome --revision main && argocd app sync navidrome` 4. Verify at https://dj.ops.eblu.me ## Future Remaining containers migrate incrementally in follow-up PRs using the same pattern. Reviewed-on: #330	2026-04-11 17:11:56 -07:00
Erich Blume	a059d81314	Add review-compliance-reports task and reorganize report storage New mise task fetches Prowler reports from sifaka, parses with proper muted/unmuted distinction, shows week-over-week delta, and includes a scaffold for Kingfisher once JSON/CSV output is available upstream. Moved all legacy top-level reports on sifaka into date subdirectories to match the current CronJob output structure. Updated read-compliance-reports doc with task reference and links. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 10:16:46 -07:00
Erich Blume	64200a55c5	Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3) Replace the Helm chart deployment with plain kustomize manifests following the Authentik pattern (separate deployments per component). Consolidate the immich-storage ArgoCD app into the main immich app. Add no-helm-policy doc establishing kustomize as the standard deployment mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:42:25 -07:00
Erich Blume	baee7ae54b	Review single-user-cluster control and add evidence collection card Stamp single-user-cluster last-reviewed to 2026-04-01 after verifying Tailscale ACLs and kubeconfig distribution. Add aspirational how-to card documenting what PCI DSS evidence collection would look like (CCW, artifacts, Drata workflow). Link from existing review process card. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-01 22:01:57 -07:00
Erich Blume	a18a424866	Pin NixOS service versions via nixpkgs-services overlay (#321 ) ## Summary - Add `nixpkgs-services` flake input pinned to a specific nixpkgs commit, with an overlay that pulls `forgejo-runner`, `snowflake`, and `k3s` from it instead of the rolling `nixpkgs` - Dagger `flake-update` pipeline now excludes `nixpkgs-services` via `--exclude` - Fix stale nix-container-builder version in service-versions.yaml (was 12.6.4, actually running 12.7.2) - Add k3s and minikube to service-versions.yaml tracking - Document the pinning approach in review-services how-to and ringtail reference ## Motivation During service review, discovered that flake updates had silently upgraded forgejo-runner from 12.6.4 → 12.7.2 without updating service-versions.yaml. This "sneak-in upgrade" bypasses the service review process. The overlay ensures these three services only change versions deliberately. ## Test plan - [ ] Verify `nix flake update` from `nixos/ringtail/` does not change `nixpkgs-services` lock entry - [ ] Verify `mise run provision-ringtail` builds successfully with the overlay - [ ] Confirm running service versions unchanged after deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #321	2026-04-01 21:37:57 -07:00
Erich Blume	4059b3d27b	Add compensating controls framework and date-based report dirs (#320 ) ## Summary - Add `compensating-controls.yaml` tracking 9 named controls that justify suppressed security findings - Update all Prowler mutelist descriptions with `CC: <id>` references to named controls - Add `mise run review-compensating-controls` task — surfaces stalest control with all codebase references - Add [[review-compensating-controls]] how-to doc - Organize Prowler and Kingfisher reports into `YYYY-MM-DD` subdirectories ### Compensating controls \| ID \| Mitigates \| \|----\|-----------\| \| `single-user-cluster` \| Image cache abuse, RBAC breadth, system pod privileges \| \| `tailscale-network-isolation` \| Profiling endpoints, weak TLS, debug ports \| \| `local-registry` \| AlwaysPullImages gap \| \| `sso-gated-admin-tools` \| ArgoCD wildcard RBAC \| \| `operator-managed-pods` \| Tailscale proxy pod security settings \| \| `ephemeral-privileged-jobs` \| Prowler hostPID exposure \| \| `trusted-ci-only` \| Forgejo runner DinD \| \| `init-container-isolation` \| Grafana root init container \| \| `observability-stack-audit` \| Missing apiserver audit logging \| ## Test plan - [ ] `mise run review-compensating-controls` shows table and references - [ ] `kubectl kustomize argocd/manifests/prowler/` renders correctly - [ ] Sync prowler and kingfisher, verify next scan writes to dated subdirectory - [ ] Grep for `CC:` in mutelist files — every muted finding should have at least one 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #320	2026-03-30 17:44:11 -07:00
Erich Blume	f9206bf10b	Build custom Kingfisher container from sporked deploy branch (#318 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-nix (kingfisher) (push) Successful in 12s Details ## Summary - Add Dockerfile for Kingfisher built from source (sporked deploy branch) - Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime - Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher` - Add kingfisher to service-versions.yaml (version tracks upstream main SHA) - Document spork workflow in CLAUDE.md ## Test plan - [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29` - [ ] Verify image on registry: `mise run container-list` - [ ] Update kustomization newTag - [ ] Sync ArgoCD kingfisher app from branch - [ ] Trigger manual CronJob and verify scan completes - [ ] Verify reports on sifaka Reviewed-on: #318	2026-03-30 06:34:49 -07:00
Erich Blume	99df78664e	Note upstream history rewrite as a spork sync failure mode Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 08:22:11 -07:00
Erich Blume	6ecfaf02b6	Add spork strategy: tooling and documentation Spork-create mise task sets up a floating-branch soft-fork of a mirrored upstream project with daily mirror-sync via Forgejo Actions. Includes explanation card, how-to guides for setup and branch management, and the spork-create uv script. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 22:58:10 -07:00
Erich Blume	2bd1611ac1	Document sifaka NFS/Tailscale TUN troubleshooting Sifaka's Tailscale can revert to userspace networking after package updates, causing NFS mounts to fail because the NFS daemon sees 127.0.0.1 instead of the client's Tailscale IP. Added troubleshooting how-to doc and updated sifaka reference card with frigate export and TUN requirement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 09:12:00 -07:00
Erich Blume	3017f759a7	Migrate Forgejo from Homebrew to source build (#316 ) ## Summary - Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent - Matches the established pattern used by zot, caddy, and alloy - Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more) ## Changes - Ansible role: Replace brew install/services with binary stat check + LaunchAgent - Paths: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo` - Run user: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`) - Docs: Updated Forgejo reference card, restart-indri guide - Service review: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current ## One-time migration steps (manual, on indri) 1. Clone from Codeberg, add forge mirror remote 2. Check out v14.0.3, build with `make build && make forgejo` 3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership 4. Run `provision-indri --tags forgejo` 5. Verify, then `brew uninstall forgejo` ## Data safety - `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config) - Brew version stays installed as rollback until verification passes - No schema changes between 14.0.2 → 14.0.3 Reviewed-on: #316	2026-03-28 08:19:23 -07:00
Erich Blume	66a47738dd	Add ringtail post-deploy maintenance: kernel check, generation pruning, GC Update manage-lockfile doc with post-deploy steps (kernel update detection, reboot guidance, generation pruning). Add prune-ringtail-generations mise task that keeps the 5 most recent generations plus the most recent one matching the booted kernel for safe rollback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:55:45 -07:00
Erich Blume	687e972713	Review CV doc and close build-dep review gap Fix stale CV service doc (URL, forge domain, container tag) and add guidance for reviewing build-time dependencies in private forge repos during service reviews. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:12:38 -07:00
Erich Blume	fe201a495c	Add Prowler IaC scanning of blumeops repo (Saturday 2am) Clone repo in init container, scan Dockerfiles and K8s manifests with Prowler's IaC provider (Trivy). Reports written to sifaka:/volume1/reports/prowler-iac/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:49:38 -07:00

1 2 3 4

186 commits