Commit graph

177 commits

Author SHA1 Message Date
ee51bcafb4 Rip out compensating-controls framework (#359)
## Summary

Removes the compensating-controls (CC) framework. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files stay in place but no longer carry \`CC: <id>\` prefixes — each entry now just keeps a free-form \`Description\` of why it's muted.

The CC review cadence proved to be more process overhead than this single-operator homelab needed.

## What changed

**Deleted**
- \`compensating-controls.yaml\` — the CC registry
- \`mise-tasks/review-compensating-controls\` — the staleness-review task
- \`docs/how-to/operations/review-compensating-controls.md\`
- \`docs/how-to/operations/record-review-evidence.md\` (was aspirational)
- \`docs/explanation/compliance-mute-categories.md\` (proposed-future CC/NA/RA work)
- 5 orphan \`+review-cc-*\` / \`+compliance-mute-categories\` changelog fragments

**Modified**
- 6 mutelist YAML files: stripped \`CC: <id>.\` prefix from every \`Description\` / \`statement\` field, kept the free-form text
- \`mise-tasks/review-compliance-reports\`: removed CC mentions from docstrings, panel text, and the node-verification table title. Node-verification logic itself is unchanged.
- \`docs/reference/operations/security.md\`: removed the "Compensating controls" section
- \`docs/how-to/operations/read-compliance-reports.md\`: rewrote step 3 of "Acting on findings" to point at the mutelist YAML directly
- \`docs/changelog.d/prowler-iac-mutelist.infra.md\`: rewrote to drop the "two new compensating controls" framing

## What did not change

- All Prowler manifests (cronjobs, RBAC, PVs, kustomization) — scans still run on the same schedule
- The Kingfisher deployment
- The trivy-shim in the Prowler container — that's about Trivy ignorefile plumbing, independent of the CC concept
- The mutelist entries themselves — each \`Resources\` list is unchanged; only the prose of \`Description\` was edited
- \`CHANGELOG.md\` — historical releases are left as-is

## Test plan

- [ ] Wait for human review before deploying — once merged, re-point ArgoCD: \`argocd app set prowler --revision main && argocd app sync prowler\` (no manifest changes besides the ConfigMap, so impact is limited to muted-finding descriptions in next week's report)
- [ ] Confirm next weekly Prowler K8s CIS run (Sunday 3am) still completes and produces a report on sifaka
- [ ] Confirm next weekly Prowler IaC run still honors \`trivyignore.yaml\` (the trivy shim is untouched but the ignorefile content was rewritten)
- [ ] \`mise run review-compliance-reports\` — verify node-verification block still runs and prints the renamed table title

Reviewed-on: #359
2026-05-22 21:08:53 -07:00
947e4310c3 C2: migrate immich from minikube to ringtail (mikado chain) (#356)
## Summary

C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the
largest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.

This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.

## Goal end-state

- Immich `server`, `machine-learning`, `valkey` on ringtail.
- ML pod uses ringtail's RTX 4080 (performance win — currently
  CPU-only).
- CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail.
- Library still on sifaka NFS — ringtail mounts the same path.
- `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress.
- Minikube `immich` and `immich-pg` are removed.

## Cards

| Card | Depends on |
|---|---|
| `migrate-immich-to-ringtail` (goal) | all six below |
| `cnpg-on-ringtail` | — |
| `immich-pg-on-ringtail` | cnpg-on-ringtail |
| `immich-pg-data-migration` | immich-pg-on-ringtail |
| `sifaka-nfs-from-ringtail` | — |
| `immich-app-on-ringtail` | immich-pg-on-ringtail, sifaka-nfs-from-ringtail |
| `immich-cutover-and-decommission` | immich-pg-data-migration, immich-app-on-ringtail |

## Key constraints

- **No data loss.** Downtime is acceptable; data loss is not. Two
  surfaces matter: postgres (ML embeddings, face data — slow to
  re-derive) and the library files (don't move, but NFS access from
  ringtail must be verified).
- **Migration method:** Option A is a CNPG `externalCluster`
  basebackup → promote. Option B is `pg_dump`/`pg_restore` as a
  documented fallback. Either way, dry-run against a scratch
  cluster first.
- **Why pg moves too** (not cross-cluster): keeping pg on minikube
  would block the whole decommission, and Immich is chatty with pg
  so tailnet round-trips would hurt.

## Test plan

- [ ] Plan review — does the dependency graph make sense?
- [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the
      chain correctly.
- [ ] Per-card implementation cycles land separately (commit
      convention enforced by hook).

Reviewed-on: #356
2026-05-13 16:46:17 -07:00
292d354902 C1: deploy adelaide-baby-shower-app to ringtail k3s (#349)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m12s
## Summary

Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app).

### What's included

- **ArgoCD app + manifests** under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer.
- **Public surface** (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett.
- **fail2ban** filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default).
- **Caddy** route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`).
- **Pulumi** Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`.
- **Grafana** APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter.
- **Container** `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages.
- **Docs** runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments.

### Defense layers on the public surface

1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list)
2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP)
3. django-axes (5 fails / 1h, keyed on username+ip_address)
4. edge `/admin/` block (returns 403 for anything that isn't login/logout)

## Prerequisites for the user to do (NOT in this PR)

Halted on these per request — they touch shared/manual systems:

- [x] **NFS share** on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000`
- [ ] **1Password item** `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git
- [ ] **Container build**: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix`
- [x] **DNS**: `mise run dns-up` after merge
- [x] **Fly cert**: `fly certs add shower.eblu.me -a blumeops-proxy`
- [ ] **Caddy push**: `mise run provision-indri -- --tags caddy`
- [ ] **Fly redeploy** to pick up the new nginx block + fail2ban jail: `mise run fly-deploy`
- [ ] **ArgoCD sync**: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging

## Test plan

- [ ] Container builds successfully on nix-container-builder runner
- [ ] Pod starts, migrations run, gunicorn answers on :8000
- [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean
- [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response
- [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead`
- [ ] Grafana dashboard "Shower APM" appears and starts showing traffic
- [ ] `mise run services-check` passes

Reviewed-on: #349
2026-05-11 13:47:18 -07:00
074887cd57 C0: docs — explanation article on compliance mute categories
Captures the CC vs NA vs RA distinction surfaced during the 2026-05-03
weekly compliance review (CVE-2026-31789), and the image-scan mutelist
gap that blocks acting on it. Links the new article from the
review-compensating-controls how-to so it isn't orphaned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:19:53 -07:00
a2c61b625d C0: rotate-fly-deploy-token — fish+bash one-shot, op validator gotcha
Combine mint+store into a single command with both fish and bash
forms (the doc previously only showed manual paste). Document the
1Password CLI "Password item requires ps value" validator error and
the placeholder-password workaround for Password-category items with
empty primary password fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:42:57 -07:00
f6e392b80c C1: SHA-pin tooling dependencies (2026-04 cycle) (#344)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m45s
## Summary

Monthly tooling dependency refresh, with a one-time conversion from version-tag pins (`rev = "vX.Y.Z"`, `image:tag`, `>=`) to SHA / digest pins everywhere.

## Changes

- **prek hooks**: all `rev = "vX.Y.Z"` → commit SHA + `# vX.Y.Z` comment. Bumped trufflehog (3.94.0→3.95.2), kingfisher (1.91.0→1.97.0), ruff (0.15.7→0.15.12), shfmt (3.13.0→3.13.1), prettier (3.8.1→3.8.3), actionlint (1.7.11→1.7.12).
- **fly/Dockerfile**: tag pins → `image@sha256:...` digest pins. Bumped nginx (1.29.6→1.30.0-alpine), tailscale (v1.94.1→v1.94.2 — still inside the safe pre-1.96.5 range), alloy (v1.14.1→v1.16.0).
- **mise-tasks**: PEP 723 inline deps converted from `>=` to `==` (PEP 508 doesn't support hashes inline). All scripts pinned to current latest: rich 15.0.0, typer 0.25.0, pyyaml 6.0.3, httpx 0.28.1.
- **prek `additional_dependencies`**: ansible-lint==26.4.0, ansible-core==2.20.5.
- **taplo-lint**: pass `--no-schema`. Upstream's `--default-schema-catalogs` returns a format taplo v0.9.3 can't parse — we don't validate against TOML schemas anyway, so this turns off the broken catalog fetch.
- **docs/update-tooling-dependencies**: documents the SHA-pin convention, `docker buildx imagetools inspect` for digest lookup, and `prek clean` before re-verifying (cache grows to several GiB).

Forgejo workflow `actions/checkout@v6.0.2` was already at the latest SHA — no change.

## Test plan

- [x] `prek run --all-files` passes after `prek clean`
- [x] `deploy-fly` workflow builds and deploys the new fly image on merge
- [x] `fly status -a blumeops-proxy` healthy after deploy
- [x] Spot-check a few mise tasks (`mise run blumeops-tasks`, `mise run docs-check-links`) to confirm pinned deps resolve cleanly

Reviewed-on: #344
2026-04-30 16:51:43 -07:00
8d634861f6 C1: migrate cv + docs from minikube to indri-native (#342)
## Summary

Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down.

## What's in this commit

- `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/`
- `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`)
- `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy`
- `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit
- `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi)
- Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment

## What is not in this commit

The dead artifacts. After PR review and successful cutover, a follow-up commit deletes:

- `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/`
- `containers/cv/`, `containers/quartz/`
- `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check`
- bumps `docs.current-version` in `service-versions.yaml` to the release tag

## Cutover plan (manual, from gilbert, after review)

1. **Take down old:**
   - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade`
   - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns | grep -E '^(cv|docs)\\b'` (should be empty)
   - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail
2. **Bring up new:**
   - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch)
   - `mise run provision-indri -- --tags cv,docs,caddy`
   - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"`
3. **Verify:** `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md`
4. **Cleanup commit + merge.**

Total expected downtime: minutes (not the few-hour budget you authorized).

## Test plan
- [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean
- [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread
- [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200
- [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment`
- [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page
- [ ] `mise run services-check` clean
- [ ] `mise run service-review --type ansible` shows cv and docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #342
2026-04-29 14:55:11 -07:00
14ca0160ba Migrate devpi from minikube to indri (launchd) (#341)
## Summary

Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent.

## What changed

- **New ansible role** `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box).
- **Caddy**: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`.
- **Playbook**: `indri.yml` gains pre_tasks for the root password and the new role.
- **service-versions.yaml**: devpi flipped from `type: argocd` to `type: ansible`.
- **ArgoCD**: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted.
- **Docs**: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list.

## Already deployed (live on indri)

- Service running: `launchctl list mcquack.eblume.devpi` → PID 53888
- `curl https://pypi.ops.eblu.me/+api` returns 200 
- `mise run docs-mikado` works again 
- 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/`
- Minikube namespace and PVC fully reclaimed

## Test plan

- [ ] `mise run services-check` (after merge)
- [ ] CI workflows that use devpi succeed
- [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines)

## Context

This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain.

Reviewed-on: #341
2026-04-29 13:38:36 -07:00
005e2a03ed C0: split gandi-operations docs; add dns-acme-cleanup mise task
Splits the nebulous gandi-operations how-to into two single-topic cards
(manage-eblu-me-dns, rotate-gandi-pat) and adds a mise task for the
recurring _acme-challenge TXT cleanup needed due to a value-comparison
bug in libdns/gandi v1.1.0 that prevents certmagic's cleanup phase from
removing presented TXT values.

The gandi reference card is updated to drop the false "different
credential from Pulumi PAT" claim — verified during the 2026-04-27
incident that Caddy and Pulumi share a single PAT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 09:48:46 -07:00
72b27b7fd2 C0: docs — add mealie borg restore how-to
Captures the procedure used to restore mealie's SQLite DB from a borgmatic
archive after the post-DR wipe: extract from borg, snapshot the wiped DB,
swap via a helper pod on the ReadWriteOnce PVC, fix UID 911 ownership.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:04:28 -07:00
7d94b9073a C0: docs — default argocd login to --sso; drop extraneous --grpc-web
Now that argocd's Authentik OAuth2 client is public, `argocd login --sso`
works for day-to-day use. Promote it to the default in AGENTS.md,
argocd-cli reference, and troubleshooting; keep the admin/password flow
documented as a break-glass fallback for when Authentik is unavailable.

Also drops --grpc-web from every interactive login command — confirmed
extraneous (login succeeds without it). Left in CI workflows and
`argocd cluster add` untouched; those are different contexts that I
didn't re-test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:43:21 -07:00
1425bf1f5c Upgrade forgejo-runner to v12.8, adopt server.connections, and clean up docs (#338)
## Summary
- consolidate forgejo-runner how-to docs into current cards
- upgrade the k8s forgejo-runner deployment to the latest v12.8.x runner image
- switch the k8s runner from first-boot register flow to declarative server.connections config
- keep the runner image on the native Dagger build path and update the surrounding manifests/secrets

## Notes
- PR opened early for C1 review
- implementation and deployment verification will follow in subsequent commits

Reviewed-on: #338
2026-04-20 09:03:54 -07:00
353e2785c3 docs: review zot oidc client card 2026-04-20 07:55:25 -07:00
53a7374ac1 C0: drop fix-ntfy-nix-version mikado card
Historical one-shot fix from the zot hardening chain — knowledge is
self-evident in containers/ntfy/default.nix and container-version-check
regex. Should have been removed at mikado finalization. Scrubbed the two
wiki-link references in add-container-version-sync-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 07:26:53 -07:00
d26a6ae3b2 Update docs for Caddy routing and direct WireGuard peering
Comprehensive docs pass reflecting the new Fly proxy architecture:
- Fly proxy routes through Caddy on indri (not per-service TS Ingress)
- Direct WireGuard peering via --port=41641 pinning
- DERP relay performance lesson in Tailscale docs
- Caddy now in public traffic path
- indri tagged as flyio-target
- Removed fly-reload references
- Updated architecture diagrams and per-service setup guide
- Added changelog fragment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:57:30 -07:00
9bafe85b2b Add teslamate extensions to DR restore procedure
The earthdistance extension (depends on cube) must be created before
restoring the teslamate database — discovered missing after 2026-04-13 DR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:12:26 -07:00
fe0e913963 Switch Fly proxy to upstream keepalive pools (#337)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s
## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
2026-04-17 16:39:52 -07:00
3ecd888537 Switch container builds to manual-only workflow dispatch
Shared Dagger helpers (src/blumeops/) affect all Dagger-built containers,
making path-based auto-triggers unreliable. All builds now go through
`mise run container-build-and-release <name>`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 14:25:14 -07:00
223b134776 Document uv.lock as the source of devpi dependency in Dagger builds
The lockfile bakes in devpi URLs — Dagger does a locked install, not
fresh resolution. This is the mechanism behind the cold-cache failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 07:41:45 -07:00
ccaef4c1a7 Document devpi cold cache failure mode and deploy teslamate v3.0.0-08c698e
After a DR rebuild, devpi's empty cache causes race conditions under
concurrent load — metadata is served but wheel files 404. Also deploys
the first container.py-built teslamate image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 07:38:06 -07:00
08c698e833 Migrate teslamate to native Dagger container.py (#333)
Some checks failed
Build Container / detect (push) Successful in 2s
Build Container / build-dagger (teslamate) (push) Failing after 6s
## Summary
- Replace legacy Dockerfile with native Dagger `container.py` build
- Two-stage pipeline: Elixir+Node builder, Debian slim runtime
- Uses shared helpers (`clone_from_forge`, `oci_labels`)
- Delete old Dockerfile (pipeline auto-discovers container.py)
- Update build-container-image docs and mark service reviewed

## Test plan
- [x] `dagger call build --src=. --container-name=teslamate` succeeds locally
- [ ] CI container build passes
- [ ] Deploy from branch and verify teslamate starts cleanly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #333
2026-04-14 07:20:52 -07:00
4ca0630d76 Review enforce-tag-immutability doc: add review date and zot reference link
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 07:00:55 -07:00
d7c3c687f4 Document DR rebuild procedure and update restart-indri
- New how-to: rebuild-minikube-cluster with full bootstrap procedure
  validated during 2026-04-13 DR event
- Update restart-indri: warn about minikube delete, macOS permission
  dialog on first Tailscale SSH, forgejo_actions_secrets dep cycle
- Update disaster-recovery reference: link to rebuild procedure
- Update CLAUDE.md: never run minikube delete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:07:54 -07:00
61fcd5d70a Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332)
All checks were successful
Build Container / detect (push) Successful in 4s
Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s
## Summary

- Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable)
- Port build from Dockerfile to native Dagger container.py
- Add liveness/readiness probes using the new /healthz endpoint on port 8080
- Update docs to reflect container.py migration and remove stale pin note

## Test plan

- [ ] Build container: `mise run container-build-and-release grafana-sidecar`
- [ ] Update kustomization tag with new image tag
- [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana`
- [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz`
- [ ] Verify dashboards load in Grafana UI
- [ ] `mise run services-check`

Reviewed-on: #332
2026-04-13 07:57:13 -07:00
6e60287e99 Doc review: delete install-dagger-on-nix-runner, add service-versions ref card
Outdated leaf card removed; zot.md now links to new service-versions
reference card instead. Added reverse link from review-services.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:52:38 -07:00
8d80a4a3a5 Rewrite runner-logs: API-based log fetching, multi-repo support
Replace broken SSH+filesystem log retrieval with Forgejo web API
endpoint. Fix CLI to use run numbers (not task IDs), add --repo
for querying any forge repo (e.g. sporks), --limit/-n for listing
size. Document runner-logs as the way to verify build success in
CLAUDE.md and container build docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:42:58 -07:00
138e23d525 Miniflux 2.2.19 + container.py migration + ty typechecker (#331)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (miniflux) (push) Successful in 1m3s
## Summary

- Upgrade miniflux from 2.2.17 to 2.2.19 (security hardening, performance improvements)
- Migrate miniflux from Dockerfile to native Dagger container.py build
- Refactor `alpine_runtime()` helper to support existing users (nobody/65534)
- Add `ty` (Astral) Python typechecker to prek hooks

## Test plan

- [ ] `dagger call build --src=. --container-name=miniflux` succeeds
- [ ] `dagger call container-version --container-name=miniflux` returns 2.2.19
- [ ] `mise run container-version-check` passes
- [ ] `ty check` passes cleanly
- [ ] `prek run --all-files` passes
- [ ] CI builds container successfully
- [ ] Miniflux healthcheck passes after deploy from branch

Reviewed-on: #331
2026-04-12 08:54:32 -07:00
c86b5d7772 Native Dagger container builds + Navidrome v0.61.1 (#330)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (navidrome) (push) Successful in 22m26s
## Summary
- Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops`
- Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step
- Migrate navidrome as the first container (`containers/navidrome/container.py`)
- Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding)
- Add `dagger call container-version` for CI version extraction without Dockerfile parsing
- All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode
- Legacy `docker_build()` fallback preserved for all other containers

## Motivation
When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline.

## Container build dispatch needed
After merge, dispatch container build for navidrome:
```
mise run container-build-and-release navidrome --ref 470b4bd
```

## Deploy steps
1. Wait for container build to complete
2. Back up navidrome-data PVC (non-reversible DB migrations)
3. `argocd app set navidrome --revision main && argocd app sync navidrome`
4. Verify at https://dj.ops.eblu.me

## Future
Remaining containers migrate incrementally in follow-up PRs using the same pattern.

Reviewed-on: #330
2026-04-11 17:11:56 -07:00
a059d81314 Add review-compliance-reports task and reorganize report storage
New mise task fetches Prowler reports from sifaka, parses with proper
muted/unmuted distinction, shows week-over-week delta, and includes
a scaffold for Kingfisher once JSON/CSV output is available upstream.

Moved all legacy top-level reports on sifaka into date subdirectories
to match the current CronJob output structure. Updated
read-compliance-reports doc with task reference and links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:16:46 -07:00
64200a55c5 Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3)
Replace the Helm chart deployment with plain kustomize manifests following
the Authentik pattern (separate deployments per component). Consolidate
the immich-storage ArgoCD app into the main immich app. Add no-helm-policy
doc establishing kustomize as the standard deployment mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:42:25 -07:00
baee7ae54b Review single-user-cluster control and add evidence collection card
Stamp single-user-cluster last-reviewed to 2026-04-01 after verifying
Tailscale ACLs and kubeconfig distribution. Add aspirational how-to card
documenting what PCI DSS evidence collection would look like (CCW,
artifacts, Drata workflow). Link from existing review process card.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:01:57 -07:00
a18a424866 Pin NixOS service versions via nixpkgs-services overlay (#321)
## Summary

- Add `nixpkgs-services` flake input pinned to a specific nixpkgs commit, with an overlay that pulls `forgejo-runner`, `snowflake`, and `k3s` from it instead of the rolling `nixpkgs`
- Dagger `flake-update` pipeline now excludes `nixpkgs-services` via `--exclude`
- Fix stale nix-container-builder version in service-versions.yaml (was 12.6.4, actually running 12.7.2)
- Add k3s and minikube to service-versions.yaml tracking
- Document the pinning approach in review-services how-to and ringtail reference

## Motivation

During service review, discovered that flake updates had silently upgraded forgejo-runner from 12.6.4 → 12.7.2 without updating service-versions.yaml. This "sneak-in upgrade" bypasses the service review process. The overlay ensures these three services only change versions deliberately.

## Test plan

- [ ] Verify `nix flake update` from `nixos/ringtail/` does not change `nixpkgs-services` lock entry
- [ ] Verify `mise run provision-ringtail` builds successfully with the overlay
- [ ] Confirm running service versions unchanged after deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #321
2026-04-01 21:37:57 -07:00
4059b3d27b Add compensating controls framework and date-based report dirs (#320)
## Summary

- Add `compensating-controls.yaml` tracking 9 named controls that justify suppressed security findings
- Update all Prowler mutelist descriptions with `CC: <id>` references to named controls
- Add `mise run review-compensating-controls` task — surfaces stalest control with all codebase references
- Add [[review-compensating-controls]] how-to doc
- Organize Prowler and Kingfisher reports into `YYYY-MM-DD` subdirectories

### Compensating controls

| ID | Mitigates |
|----|-----------|
| `single-user-cluster` | Image cache abuse, RBAC breadth, system pod privileges |
| `tailscale-network-isolation` | Profiling endpoints, weak TLS, debug ports |
| `local-registry` | AlwaysPullImages gap |
| `sso-gated-admin-tools` | ArgoCD wildcard RBAC |
| `operator-managed-pods` | Tailscale proxy pod security settings |
| `ephemeral-privileged-jobs` | Prowler hostPID exposure |
| `trusted-ci-only` | Forgejo runner DinD |
| `init-container-isolation` | Grafana root init container |
| `observability-stack-audit` | Missing apiserver audit logging |

## Test plan

- [ ] `mise run review-compensating-controls` shows table and references
- [ ] `kubectl kustomize argocd/manifests/prowler/` renders correctly
- [ ] Sync prowler and kingfisher, verify next scan writes to dated subdirectory
- [ ] Grep for `CC:` in mutelist files — every muted finding should have at least one

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #320
2026-03-30 17:44:11 -07:00
f9206bf10b Build custom Kingfisher container from sporked deploy branch (#318)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 12s
## Summary

- Add Dockerfile for Kingfisher built from source (sporked deploy branch)
- Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime
- Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher`
- Add kingfisher to service-versions.yaml (version tracks upstream main SHA)
- Document spork workflow in CLAUDE.md

## Test plan

- [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29`
- [ ] Verify image on registry: `mise run container-list`
- [ ] Update kustomization newTag
- [ ] Sync ArgoCD kingfisher app from branch
- [ ] Trigger manual CronJob and verify scan completes
- [ ] Verify reports on sifaka

Reviewed-on: #318
2026-03-30 06:34:49 -07:00
99df78664e Note upstream history rewrite as a spork sync failure mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 08:22:11 -07:00
6ecfaf02b6 Add spork strategy: tooling and documentation
Spork-create mise task sets up a floating-branch soft-fork of a
mirrored upstream project with daily mirror-sync via Forgejo Actions.
Includes explanation card, how-to guides for setup and branch
management, and the spork-create uv script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:58:10 -07:00
2bd1611ac1 Document sifaka NFS/Tailscale TUN troubleshooting
Sifaka's Tailscale can revert to userspace networking after package
updates, causing NFS mounts to fail because the NFS daemon sees
127.0.0.1 instead of the client's Tailscale IP. Added troubleshooting
how-to doc and updated sifaka reference card with frigate export and
TUN requirement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 09:12:00 -07:00
3017f759a7 Migrate Forgejo from Homebrew to source build (#316)
## Summary

- Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent
- Matches the established pattern used by zot, caddy, and alloy
- Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more)

## Changes

- **Ansible role**: Replace brew install/services with binary stat check + LaunchAgent
- **Paths**: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo`
- **Run user**: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`)
- **Docs**: Updated Forgejo reference card, restart-indri guide
- **Service review**: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current

## One-time migration steps (manual, on indri)

1. Clone from Codeberg, add forge mirror remote
2. Check out v14.0.3, build with `make build && make forgejo`
3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership
4. Run `provision-indri --tags forgejo`
5. Verify, then `brew uninstall forgejo`

## Data safety

- `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config)
- Brew version stays installed as rollback until verification passes
- No schema changes between 14.0.2 → 14.0.3

Reviewed-on: #316
2026-03-28 08:19:23 -07:00
66a47738dd Add ringtail post-deploy maintenance: kernel check, generation pruning, GC
Update manage-lockfile doc with post-deploy steps (kernel update detection,
reboot guidance, generation pruning). Add prune-ringtail-generations mise
task that keeps the 5 most recent generations plus the most recent one
matching the booted kernel for safe rollback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:55:45 -07:00
687e972713 Review CV doc and close build-dep review gap
Fix stale CV service doc (URL, forge domain, container tag) and add
guidance for reviewing build-time dependencies in private forge repos
during service reviews.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:12:38 -07:00
fe201a495c Add Prowler IaC scanning of blumeops repo (Saturday 2am)
Clone repo in init container, scan Dockerfiles and K8s manifests
with Prowler's IaC provider (Trivy). Reports written to
sifaka:/volume1/reports/prowler-iac/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:49:38 -07:00
696024306c Add Prowler image vulnerability scanning for blumeops containers
All checks were successful
Build Container / detect (push) Successful in 39s
Build Container / build-dockerfile (prowler) (push) Successful in 10m15s
Add Trivy to the Prowler container for image and IaC scanning.
New CronJob (Saturday 3am) scans all blumeops/* images in the
registry for CVEs, embedded secrets, and Dockerfile misconfigs.
Reports written to sifaka:/volume1/reports/prowler-images/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:43:08 -07:00
d021b3534f Deploy Prowler CIS scanner (#310)
All checks were successful
Build Container / detect (push) Successful in 4s
Build Container / build-dockerfile (prowler) (push) Successful in 10s
## Summary
- Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning
- Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream)
- Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/`
- Read-only ClusterRole for pod, RBAC, and control plane inspection
- Host path mounts + hostPID for kubelet file permission checks

## Follow-ups
- Mirror prowler-cloud/prowler on forge for supply chain control
- Build and push container image, update kustomization.yaml newTag
- Consider adding k3s-ringtail scanning (core + RBAC checks only)

## Test plan
- [ ] Build container: `mise run container-release prowler v5.22.0`
- [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag
- [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler`
- [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri`
- [ ] Verify reports appear on sifaka NFS share
- [ ] `mise run services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #310
2026-03-24 16:08:09 -07:00
fd0bebb0fc Localize authentik-redis container (#309)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dockerfile (alloy) (push) Successful in 12s
Build Container / build-dockerfile (ntfy) (push) Successful in 11s
Build Container / build-nix (alloy) (push) Successful in 20s
Build Container / build-nix (authentik) (push) Successful in 6m10s
Build Container / build-nix (authentik-redis) (push) Successful in 20s
Build Container / build-nix (ntfy) (push) Successful in 6s
## Summary

- Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs
- Introduce **attached service pattern**: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates
- Document the pattern in [[review-services]] so future attached services slot in cleanly
- Backfill `parent: grafana` on existing `grafana-sidecar` entry

## Version drift protection

1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build`
2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update
3. Both must agree before commit succeeds

## Test plan

- [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`)
- [ ] Update kustomization `newTag` to branch-built image tag
- [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`)
- [ ] Verify Authentik login, session persistence, and task queue still work
- [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #309
2026-03-24 13:27:36 -07:00
fc45989a6c Decommission JobSync service (#308)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary

- Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation
- Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed

## Motivation

Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation.

## Remaining manual steps after merge

- Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy`
- Sync Homepage: `argocd app sync homepage`
- Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #308
2026-03-24 08:44:23 -07:00
bd0ff30d3f Unify container build workflows (#306)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary
- Merges `build-container.yaml` and `build-container-nix.yaml` into a single workflow
- Detect job classifies each changed container by presence of `Dockerfile` and/or `default.nix`
- Dockerfile containers build on `k8s` (indri) via Dagger; Nix containers build on `nix-container-builder` (ringtail) via nix-build + skopeo
- Containers with both build files (alloy, nettest, ntfy) get built on both runners

## Test plan
- [ ] Push a change to a Dockerfile-only container (e.g. grafana) — verify it builds on k8s only
- [ ] Push a change to a nix-only container (e.g. jobsync) — verify it builds on nix-container-builder only
- [ ] Push a change to a dual container (e.g. ntfy) — verify it builds on both runners
- [ ] Test workflow_dispatch with a specific container name

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #306
2026-03-23 20:55:50 -07:00
6d65e6928c C2: Deploy infrastructure alerting pipeline (#303)
## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303
2026-03-22 14:52:56 -07:00
e5ce510fdc Fix plan-a-meal random recipe API queries
Mealie's orderBy=random requires a paginationSeed parameter, otherwise
the API returns 422. Added the seed to all random query examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:10:48 -07:00
6d7597670e Add plan-a-meal how-to for Mealie cooking timelines
Agent-facing guide for generating unified cooking timelines from
Mealie meal plans. Covers querying the API, picking balanced meals
(protein/carb/vegetable), and interleaving recipe steps into a
relative timeline so everything finishes together.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:07:16 -07:00
0f5377568d Review operations docs: add last-reviewed dates and improve troubleshooting
Mark run-1password-backup and troubleshooting as reviewed. Troubleshooting
gets inline wiki-links for all referenced services, a new ringtail/k3s
section, and a cross-reference to restart-indri.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 07:38:02 -07:00