blumeops

Author	SHA1	Message	Date
Erich Blume	4b5a0c376a	C1(unpoller-v3): bump kustomization to v3.2.0-1b27242 Built by build-container workflow run #559 from this branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 09:46:00 -07:00
Erich Blume	1b27242437	C1(unpoller-v3): upgrade v2.34.0 -> v3.2.0, migrate to container.py Major version bump from v2.34.0 to v3.2.0. Breaking changes upstream: - v3.0.0: UniFi network API shifts (later 10.x); metrics, events and logs may have changed names/labels. - v3.2.0: defaults to a 60s background poll feeding cached Prometheus scrapes (was on-demand poll per scrape). Set interval = 0 in up.conf to restore on-demand behavior if needed. Also migrate the container build from a Dockerfile to a native Dagger pipeline (containers/unpoller/container.py) using the shared helpers in blumeops.containers, following the navidrome/miniflux pattern. The build-container workflow already prefers container.py when present. Refresh last-reviewed and current-version in service-versions.yaml. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-28 09:34:12 -07:00
Erich Blume	f6febb1f77	C0: switch fly proxy deploy strategy to immediate All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 34s Details Bluegreen kept timing out — the new green machine couldn't reach "started" within Fly's 5-minute deploy budget. The cold-start sequence (tailscaled → tailscale up → wait-for-MagicDNS → nginx startup) eats most of that, leaving no headroom for healthcheck propagation. For a single-machine proxy, bluegreen offers little benefit anyway: no warm second instance, so trading 5-10s of downtime for predictable completion is the right call.	2026-05-28 07:59:22 -07:00
Erich Blume	4e25180b0a	C0: clone blumeops via tailnet on ringtail provision Switch ringtail.yml from forge.eblu.me (Fly proxy, WAN) to forge.ops.eblu.me (Caddy on indri, tailnet). Ringtail is always on the tailnet — the WAN round-trip was overhead and made provision-ringtail fail any time Fly was slow or down.	2026-05-28 07:13:40 -07:00
Erich Blume	c00d7db507	Recurring maintenance batch (2026-05-27) (#360 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 14m10s Details Bundle of recurring overdue tasks: - Ringtail flake update - Security & compliance report review - Tooling deps bump (prek, fly, mise, forgejo workflows) - Top stale doc review - Top stale service review (if trivial) Larger items (service version bumps requiring upgrades, non-local container migration) split out as separate PRs. Reviewed-on: #360	2026-05-28 06:01:57 -07:00
Erich Blume	753fa9cb63	C0: disable VRR on ringtail DP-1 to stop OMEN panel flicker The OMEN 27i IPS pumps brightness when its refresh swings into the low VRR range during low-framerate content (game cutscenes), producing a ~20Hz flicker that compounds over a session until a reboot. GPU health is clean (no Xid/ECC/thermal); pinning fixed 165Hz eliminates it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 12:59:29 -07:00
Erich Blume	c09bd5b612	C0: cap systemd-coredump on ringtail to stop game-crash lockups Wine/Proton game segfaults (e.g. Diablo IV) produced multi-GB cores that systemd-coredump spent minutes compressing to disk, pinning the CPU and freezing the desktop. Cap ProcessSizeMax/ExternalSizeMax at 1G (oversized cores logged but skipped) and MaxUse at 2G to bound the store. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 11:54:32 -07:00
Erich Blume	35ae171783	C0: fix sync button location in manage-forgejo-mirrors The verify step pointed to the main repo page, but the "Synchronize now" button is in the Mirror settings section of the settings page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-27 07:15:07 -07:00
Erich Blume	57fd88b269	C0: fix op item edit syntax in zot key rotation The pbpaste \| op item edit ... "field[password]=-" stdin syntax is rejected by op 2.34 as "invalid JSON" — recent op versions treat piped input as a full JSON template, not a single field value. Procedure now uses an inline assignment via a local fish variable.	2026-05-22 21:50:43 -07:00
Erich Blume	08a1cb164a	C0: fix 1password export filename in backup how-to 1Password's desktop app names exports as 1PasswordExport-<uuid>-<timestamp>.1pux automatically — you can't choose the name. Procedure now points the task at that glob.	2026-05-22 21:36:13 -07:00
Erich Blume	d02bf062af	C0: review 1password reference card Added vault split (blumeops vs Personal), noted onepassword-connect runs on both indri and ringtail, and lifted op CLI guidance from agent memory into the card. Bumped last-reviewed.	2026-05-22 21:29:11 -07:00
Erich Blume	ee51bcafb4	Rip out compensating-controls framework (#359 ) ## Summary Removes the compensating-controls (CC) framework. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files stay in place but no longer carry \`CC: <id>\` prefixes — each entry now just keeps a free-form \`Description\` of why it's muted. The CC review cadence proved to be more process overhead than this single-operator homelab needed. ## What changed Deleted - \`compensating-controls.yaml\` — the CC registry - \`mise-tasks/review-compensating-controls\` — the staleness-review task - \`docs/how-to/operations/review-compensating-controls.md\` - \`docs/how-to/operations/record-review-evidence.md\` (was aspirational) - \`docs/explanation/compliance-mute-categories.md\` (proposed-future CC/NA/RA work) - 5 orphan \`+review-cc-\` / \`+compliance-mute-categories\` changelog fragments Modified* - 6 mutelist YAML files: stripped \`CC: <id>.\` prefix from every \`Description\` / \`statement\` field, kept the free-form text - \`mise-tasks/review-compliance-reports\`: removed CC mentions from docstrings, panel text, and the node-verification table title. Node-verification logic itself is unchanged. - \`docs/reference/operations/security.md\`: removed the "Compensating controls" section - \`docs/how-to/operations/read-compliance-reports.md\`: rewrote step 3 of "Acting on findings" to point at the mutelist YAML directly - \`docs/changelog.d/prowler-iac-mutelist.infra.md\`: rewrote to drop the "two new compensating controls" framing ## What did not change - All Prowler manifests (cronjobs, RBAC, PVs, kustomization) — scans still run on the same schedule - The Kingfisher deployment - The trivy-shim in the Prowler container — that's about Trivy ignorefile plumbing, independent of the CC concept - The mutelist entries themselves — each \`Resources\` list is unchanged; only the prose of \`Description\` was edited - \`CHANGELOG.md\` — historical releases are left as-is ## Test plan - [ ] Wait for human review before deploying — once merged, re-point ArgoCD: \`argocd app set prowler --revision main && argocd app sync prowler\` (no manifest changes besides the ConfigMap, so impact is limited to muted-finding descriptions in next week's report) - [ ] Confirm next weekly Prowler K8s CIS run (Sunday 3am) still completes and produces a report on sifaka - [ ] Confirm next weekly Prowler IaC run still honors \`trivyignore.yaml\` (the trivy shim is untouched but the ignorefile content was rewritten) - [ ] \`mise run review-compliance-reports\` — verify node-verification block still runs and prints the renamed table title Reviewed-on: #359	2026-05-22 21:08:53 -07:00
Erich Blume	2fae0f7161	C0: switch grafana deployment to Recreate strategy Grafana uses an RWO PVC for SQLite + Bleve search index. RollingUpdate spawns the new pod before terminating the old one, so the new pod crashloops on the index lock until rollout timeout. Recreate terminates the old pod first, letting the new pod acquire the lock cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-19 06:33:26 -07:00
Erich Blume	1897eb1c5b	C0: move immich blackbox probe to ringtail alloy Immich migrated to ringtail's k3s cluster but the probe still targeted the in-cluster service DNS on indri's minikube, firing ServiceProbeFailure indefinitely. Moved the target into alloy-ringtail's config so the probe runs in the cluster where immich actually lives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 08:46:22 -07:00
Erich Blume	e222d47d45	C0: deploy shower v1.1.3 (kustomize newTag bump) Image v1.1.3-3645098-nix was built directly on ringtail and pushed via skopeo, bypassing the Forgejo runner: indri was severely overloaded (load avg 24.92, minikube VM at 344% CPU) and the workflow-dispatch endpoint timed out. The image content is identical to what the runner would have produced — same default.nix at commit `3645098` (on main), same NIX_PATH (current nixpkgs flake), same skopeo invocation. Tag short-sha matches the commit that defines the recipe so we aren't pinning to a ghost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 20:09:54 -07:00
Erich Blume	3645098bf1	C0: bump shower to v1.1.3 Wheel/sdist + FOD hashes probed on ringtail. Full nix-build verified end-to-end before commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 19:57:37 -07:00
Erich Blume	96dbbb3cbe	C0: add sn2-prelaunch wrapper to clear SN2 stale lockfiles UE5 writes Saved/running.dat as a "session in progress" marker. If the previous session exited uncleanly (SIGKILL, crash), it lingers, and SN2 pops up an invisible 0×0 Error dialog at next launch that the GameThread blocks on forever — visible only as a black screen with a spinning loader. Wrap the Steam command to clear the marker files before each launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 12:26:10 -07:00
Erich Blume	815a0cc6e6	C0: shower — rebuild from main SHA (post-merge retag) PR #358 was squash-merged so the branch commit `b8c7783` baked into the prior image tag isn't reachable from main's history. Rebuild from main HEAD (`a33fa47`) and retag. Image content is byte-identical (FOD is content-addressed, inputs unchanged); only the SHA in the tag changes so future provenance tracing stays on main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 06:57:24 -07:00
Erich Blume	a33fa47b80	C1: deploy shower v1.1.2 (#358 ) ## Summary Deploys `adelaide-baby-shower-app` v1.1.2 to ringtail k3s. - Bumps `containers/shower/default.nix` `version` to 1.1.2. - Refreshes sdist + wheel `fetchurl` hashes against the forge PyPI artifacts. - Re-probed FOD `outputHash` on the nix-container-builder runner (ringtail) and pinned the new closure hash. - Bumps kustomize `newTag` to `v1.1.2-b8c7783-nix` (built from this branch's tip). - Bumps `service-versions.yaml` entry for shower to `1.1.2` / `last-reviewed: 2026-05-15`. ## Build provenance Built by Forgejo Actions run #553 on `nix-container-builder` (ringtail) at commit ``b8c7783``. After merge a C0 follow-on will rebuild from main and retag so future provenance points at main history. ## Test plan - [ ] `argocd app set shower --revision shower-v1.1.2 && argocd app sync shower` deploys cleanly - [ ] Pod migrates the SQLite PV and serves at `shower.ops.eblu.me` / `shower.eblu.me` - [ ] No new errors in pod logs after `collectstatic` + gunicorn boot Reviewed-on: #358	2026-05-15 06:50:46 -07:00
Erich Blume	12314857d8	C0: add GE-Proton to ringtail Steam extraCompatPackages Lets Subnautica 2 (and any other game) opt into the GE-Proton build via Steam's per-game compatibility tool override, as a workaround for the Proton Experimental + DXVK D3D12 Mercuna hang. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-15 06:27:43 -07:00
Erich Blume	4d2bc9975f	C0: deploy shower v1.1.1 (kustomize newTag bump) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:51:10 -07:00
Erich Blume	4e117dc921	C0: pin shower v1.1.1 FOD outputHash (probed on ringtail) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:40:22 -07:00
Erich Blume	6e90c4c363	C0: bump shower to v1.1.1 (probe FOD hash) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 20:12:00 -07:00
Erich Blume	dc69b8c68b	C1: fix borgmatic shower SQLite dump (ssh to ringtail) (#357 ) ## Summary Nightly borgmatic backups have been failing for 2 days. Root cause: the shower SQLite dump `before_backup` hook (added in PR #349) referenced `kubectl --context=k3s-ringtail`, but indri's kubeconfig deliberately doesn't carry the ringtail credentials. The hook's failure aborted the entire run, taking out both the local sifaka repo and the BorgBase offsite. Verified the last good archive was `indri-2026-05-11T02:00`. ## Approach ssh into ringtail and run `k3s kubectl` there — no indri-side kubeconfig needed. `/etc/rancher/k3s/k3s.yaml` is mode 644 so no sudo required, and the existing ssh access from indri to ringtail works. Inline-shell quoting got hairy fast (fish on ringtail rejected `POD=...` bash syntax; the nix shower image lacks `tar` so `kubectl cp` fails). Pulled the dump logic into `~/bin/borgmatic-k8s-sqlite-dump`, deployed by the ansible role. Each dump entry now declares a `target`: - `local:<context>` — local kubectl with explicit context (mealie) - `ssh:<user@host>` — ssh + `k3s kubectl` on the cluster host (shower) Bytes come back via `kubectl exec ... -- cat` instead of `kubectl cp` since `cp` needs `tar` in the pod (nix-built containers don't bundle it). ## Test plan - [x] `mise run provision-indri -- --tags borgmatic --check --diff` shows expected diff - [x] Apply, helper script deployed at `~/bin/borgmatic-k8s-sqlite-dump` - [x] Helper invoked directly with `ssh:eblume@ringtail` produces a valid 288 KB SQLite file - [x] Full `borgmatic create` completes without errors — both mealie.db (1.7 MB) and shower.db (288 KB) appear in `~/.local/share/borgmatic/k8s-dumps/`, archive `indri-2026-05-13T17:31:02` written to sifaka borg repo 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #357	2026-05-13 18:55:50 -07:00
Erich Blume	947e4310c3	C2: migrate immich from minikube to ringtail (mikado chain) (#356 ) ## Summary C2 Mikado chain to move the entire Immich stack (server, ML, valkey, postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the largest single tenant on minikube (~1.5 GiB resident) and minikube is currently memory-saturated (97% RAM, swapping). This is the first concrete chain in the broader indri-k8s decommission effort. This PR contains the planning layer only — 7 cards (1 goal + 6 prerequisites). Implementation cycles follow per the Mikado Branch Invariant. ## Goal end-state - Immich `server`, `machine-learning`, `valkey` on ringtail. - ML pod uses ringtail's RTX 4080 (performance win — currently CPU-only). - CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail. - Library still on sifaka NFS — ringtail mounts the same path. - `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress. - Minikube `immich` and `immich-pg` are removed. ## Cards \| Card \| Depends on \| \|---\|---\| \| `migrate-immich-to-ringtail` (goal) \| all six below \| \| `cnpg-on-ringtail` \| — \| \| `immich-pg-on-ringtail` \| cnpg-on-ringtail \| \| `immich-pg-data-migration` \| immich-pg-on-ringtail \| \| `sifaka-nfs-from-ringtail` \| — \| \| `immich-app-on-ringtail` \| immich-pg-on-ringtail, sifaka-nfs-from-ringtail \| \| `immich-cutover-and-decommission` \| immich-pg-data-migration, immich-app-on-ringtail \| ## Key constraints - No data loss. Downtime is acceptable; data loss is not. Two surfaces matter: postgres (ML embeddings, face data — slow to re-derive) and the library files (don't move, but NFS access from ringtail must be verified). - Migration method: Option A is a CNPG `externalCluster` basebackup → promote. Option B is `pg_dump`/`pg_restore` as a documented fallback. Either way, dry-run against a scratch cluster first. - Why pg moves too (not cross-cluster): keeping pg on minikube would block the whole decommission, and Immich is chatty with pg so tailnet round-trips would hurt. ## Test plan - [ ] Plan review — does the dependency graph make sense? - [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the chain correctly. - [ ] Per-card implementation cycles land separately (commit convention enforced by hook). Reviewed-on: #356	2026-05-13 16:46:17 -07:00
Erich Blume	bc8ceb502b	Merge pull request 'C1: pin ringtail wired IP to 192.168.1.21 (static)' (#355 ) from ringtail-static-ip into main	2026-05-12 09:59:59 -07:00
Erich Blume	a4a30aad44	fix(ringtail): explicitly enable net.ipv4.ip_forward After the static IP change, k3s/flannel pod networking broke because ip_forward was 0. NixOS doesn't enable IP forwarding by default — it was previously being set implicitly somewhere in the NM-managed / scripted-DHCP path. With static networking we have to set it ourselves. Verified at runtime via sysctl -w before adding here; pod outbound came back immediately and Tailscale VIP services recovered without any pod restarts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 09:51:16 -07:00
Erich Blume	d0b5423135	C1: pin ringtail wired IP to 192.168.1.21 (static) Removes DHCP lease renewal as a failure mode on ringtail after an outage on 2026-05-12 where the IP and routes silently disappeared from enp5s0 without any kernel link event. NetworkManager stays enabled for wireless fallback but no longer manages the wired interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 09:33:57 -07:00
Erich Blume	dc0916a548	C0: shower — rebuild from main SHA (post-merge retag) PR #354 was squash-merged so the branch commit `444ff91` baked into the prior image tag isn't reachable from main's history. Rebuild from main HEAD (`3c7967e`) and retag. Image content is byte-identical (FOD is content-addressed, inputs unchanged); only the SHA in the tag changes so future provenance tracing stays on main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 20:20:39 -07:00
Erich Blume	3c7967e445	C1: deploy shower v1.1.0 (phases + guest memories) (#354 ) ## Summary Deploys `adelaide-baby-shower-app` v1.1.0 to ringtail k3s. ### App changes (since v1.0.2) - Four-phase `ShowerState` replaces the boolean `locked` flag — `pre_event` → `party` → `prizes_locked` → `event_locked` — with a backfill migration that maps `locked=True → pre_event`, `locked=False → party`. - Guest memories: append-only photos + comments panel where guests can leave notes for the baby. Adds `GuestPhoto` + `GuestComment` models with file-extension validators and a max-size validator; new `shower.imaging` module for thumbnail generation. - Admin + QR polish: configurable host link, fixed "View Site" URL, guest-facing QR copy improvements, contest tweaks. Three Django migrations run automatically in the entrypoint against the SQLite PV: - `0009_shower_phase` - `0010_guest_memories` - `0011_book_description` No ConfigMap / env-var changes. The deploy uses `strategy: Recreate` with a single replica, so the old pod releases the data PVC before the new one mounts it and runs migrations. ### Container build changes The v1.1.0 tag exposed a latent issue with the Forgejo PyPI install path: - The recent commit [`2d38418e`](`2d38418e`) closed the forge package leak at the Fly edge by blocking `/api/packages/*` publicly. - Forgejo's PyPI simple index returns absolute file URLs hardcoded to its public `ROOT_URL` (`forge.eblu.me`), so pip-installing from the tailnet index URL still tries to download from `forge.eblu.me` → 403. - Previous shower builds escaped this because their FOD outputs were already in the nix store; bumping to a new version forced a fresh pip run that hit the block. Fix mirrors what we already do for the sdist: both wheel and sdist are pulled via direct `fetchurl` against `forge.ops.eblu.me`, then the wheel is copied to TMPDIR under its clean filename (nix store path's hash prefix breaks pip's wheel-filename parser) and handed to pip as a local path. The forge `--extra-index-url` is no longer needed. FOD outputHash pinned to `sha256-kTNOswobtkgyQmmqbQM8XO4vvaGg57nCuuZGbNXb0NM=` from run 547. Image: `registry.ops.eblu.me/blumeops/shower:v1.1.0-444ff91-nix`. ### Adjacent finding (already handled) The ringtail `gitea-runner-nix_container_builder` systemd unit was left `inactive` after the recent `provision-ringtail` (matches the known `sshd-restart-hangs-mux` lesson — the rebuild changed the unit's PATH closure + config.yaml, systemd stopped it, then the playbook hung before the activation could restart it). Manually started; the existing memory `lesson_provision_ringtail_ssh_hang.md` was extended to mention the runner as the canary service to check after provisions. ## Test plan - [ ] `argocd app diff shower --revision shower-v1.1.0` — review the manifest change - [ ] `argocd app set shower --revision shower-v1.1.0 && argocd app sync shower` - [ ] `kubectl --context=k3s-ringtail logs -n shower deploy/shower` — confirm migrations 0009/0010/0011 applied, no errors - [ ] Hit `https://shower.ops.eblu.me/` (tailnet) — splash page renders, phase indicator visible - [ ] Hit `https://shower.ops.eblu.me/host/` — host console loads, phase dropdown shows the four states - [ ] Hit `https://shower.eblu.me/` (public via Fly) — splash page still served - [ ] After merge: `argocd app set shower --revision main && argocd app sync shower` Reviewed-on: #354	2026-05-11 20:08:03 -07:00
Erich Blume	fbc1f7720e	C0: gitignore .claude/scheduled_tasks.lock Transient lock file written by the ScheduleWakeup harness tool when Claude paces its own work between long-running operations. Not config, not state worth checking in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:37:29 -07:00
Erich Blume	4133785119	C1: ringtail — weekly flake.lock update (#352 ) ## Summary - Recurring weekly lockfile refresh for `nixos/ringtail/flake.lock`. - Inputs updated: `disko`, `home-manager`, `nixpkgs`. - `nixpkgs-services` was deliberately skipped (per overlay convention — pinned services bump only on intentional update). - Generated via `dagger call flake-update --src=. --flake-path=nixos/ringtail`. ## Test plan - [x] `prek` hooks pass - [ ] After merge: `mise run provision-ringtail` to deploy - [ ] Then check for kernel update per [[manage-lockfile]] ## Notes - Not deployed from this PR — provisioning is a follow-up. Reviewed-on: #352	2026-05-11 16:13:07 -07:00
Erich Blume	145df76d06	C1: service review — mealie (v3.12.0 deployed; upstream v3.17.0) (#351 ) ## Summary - Recurring service review for `mealie`. - Upstream is at v3.17.0 (released 2026-05-06); deployed image is v3.12.0 — 5 minor versions behind. - Container is built locally from the forge mirror (`containers/mealie/Dockerfile`), so upgrade requires a fresh build + changelog review for breaking changes between v3.12 and v3.17. - Deferring the actual upgrade to a separate task; this PR just refreshes `last-reviewed` and captures the gap in `notes`. ## Test plan - [x] `prek` hooks pass - [ ] Follow-up: open task to bump `containers/mealie/Dockerfile` `CONTAINER_APP_VERSION`, build, and update kustomization tag ## Notes - No deployment changes in this PR. Reviewed-on: #351	2026-05-11 16:12:36 -07:00
Erich Blume	bb7efa850a	C1: doc review — replicating-blumeops tutorial (#350 ) ## Summary - Periodic doc review of `tutorials/replicating-blumeops.md` (was never reviewed). - Fixed 4 instances of "BluemeOps" → "BlumeOps" (also caught 1 in `contributing.md`). - Added `last-reviewed: 2026-05-11` and bumped `modified`. - Verified all wiki-link targets resolve. ## Test plan - [x] `prek` hooks pass (link checker, frontmatter checker) - [ ] Optional: `mise run docs-preview docs/tutorials/replicating-blumeops.md` Reviewed-on: #350	2026-05-11 16:11:35 -07:00
Erich Blume	f83be3bf37	C1: review CC observability-stack-audit (extend to k3s) (#353 ) ## Summary - Recurring compensating-control review (oldest stale control: 42 days). - Verified the control is in effect on both clusters: - `alloy-k8s` on minikube-indri — Synced/Healthy, DaemonSet 1/1 ready - `alloy-ringtail` on k3s-ringtail — Synced/Healthy - `loki` (`monitoring/loki-0`) — Running, receiving logs (52 restarts in 18h is worth watching but not blocking review) - Generalized the description: previously named only minikube, but the indri→ringtail migration means we now operate two clusters and both rely on this control. - Added a follow-up note: enabling native apiserver audit logging is far more tractable on k3s (`--audit-log-path` / `--audit-policy-file`) than it was on minikube — worth revisiting once the migration concludes. ## Test plan - [x] `prek` hooks pass - [x] Verified alloy + loki status via `kubectl --context=minikube-indri` and `argocd app get` ## Notes - No deployment changes. Reviewed-on: #353	2026-05-11 16:10:39 -07:00
Erich Blume	40d9a1ef9e	C0: shower — rebuild from main SHA (post-PR-349 retag) Standard squash-merge dance per docs/how-to/deployment/build-container-image.md#Squash-merge-and-container-tags — retags from v1.0.2-039d9b9-nix (branch SHA) to v1.0.2-292d354-nix ([main] tag from run 544 built off the merge commit). Functionally identical; preserves source traceability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:55:25 -07:00
Erich Blume	292d354902	C1: deploy adelaide-baby-shower-app to ringtail k3s (#349 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m12s Details ## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - ArgoCD app + manifests under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - Public surface (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - fail2ban filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - Caddy route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - Pulumi Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - Grafana APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - Container `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - Docs runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] NFS share on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] 1Password item `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] Container build: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] DNS: `mise run dns-up` after merge - [x] Fly cert: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] Caddy push: `mise run provision-indri -- --tags caddy` - [ ] Fly redeploy to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] ArgoCD sync: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes Reviewed-on: #349	2026-05-11 13:47:18 -07:00
Erich Blume	eceb2b99ce	C0: bump homepage image to fixed-perms build (v1.11.0-678f26b-nix) Pulls in `678f26b0` (chowned /app/config). Resolves the EACCES crash loop on ringtail.	2026-05-10 21:16:34 -07:00
Erich Blume	678f26b0e7	C0: fix homepage container /app/config write permissions The previous Dockerfile chowned /app/config to 1000:1000 so the runtime user could seed missing skeleton configs (e.g. proxmox.yaml) and write /app/config/logs. The nix derivation didn't replicate that, so the new amd64 image crashed with EACCES on cold start (fixed-forward — caught during ringtail cutover, ArgoCD #348). Add fakeRootCommands to dockerTools to create /app and /app/config and chown them at build time. The deployment's ConfigMap subPath mounts leave the parent directory as image filesystem, so its ownership has to be set at build time, not at runtime.	2026-05-10 20:49:22 -07:00
Erich Blume	ad7a0ed105	Merge pull request 'C1: migrate homepage dashboard from minikube to ringtail (nix-built amd64)' (#348 ) from homepage-to-ringtail into main	2026-05-10 20:40:33 -07:00
Erich Blume	be54cc3411	C1: migrate homepage dashboard to ringtail k3s Repoint the ArgoCD Application destination from minikube to ringtail and bump the image tag to the new amd64 nix-built v1.11.0-b87f62e-nix. Rework services.yaml for the autodiscovery shift: 11 services that previously auto-populated via minikube Ingress annotations (ArgoCD, Immich, Kiwix, Mealie, Miniflux, Grafana, Prometheus, Navidrome, Paperless, TeslaMate, Transmission) become explicit static entries with their widget configs preserved. Conversely, the ringtail services that will now auto-populate (Frigate/NVR, Authentik, Ntfy) are removed from the static list to avoid duplicates; Ollama becomes newly visible. Add a Content group for Immich/Kiwix/Miniflux which previously lived under the autodiscovered "Content" group from annotations.	2026-05-10 20:37:03 -07:00
Erich Blume	b87f62e0f5	C1: nix-build homepage container for amd64 ringtail migration Replace Dockerfile (arm64-only, indri-built) with a nix derivation adapted from nixpkgs pkgs/by-name/ho/homepage-dashboard. Built via the nix-container-builder runner on ringtail, producing an amd64 image suitable for k3s. Includes the upstream Next.js file-system-cache patch to avoid prerender cache write failures on a read-only nix store path (nixpkgs issues #328621 and #458494). Pinned to v1.11.0 (current production version).	2026-05-10 20:32:38 -07:00
Erich Blume	8bc19fa460	C0: tailscale main-SHA rebuild for ringtail proxyclass Routine post-squash-merge cleanup. Bumps the ProxyClass image tag from the now-orphaned PR branch SHA (`67af7a8`) to the merge commit SHA (`0108b68`) so the deployed image stays traceable after branch cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 06:52:39 -07:00
Erich Blume	0108b68769	C1: mirror tailscale container locally for ringtail proxyclass (#347 ) ## Summary Adds the first cut of a local nix build for `docker.io/tailscale/tailscale` and rewires only the ringtail tailscale-operator overlay to use it. Indri's overlay continues pulling upstream — minikube on indri is being decommissioned in favor of ringtail's k3s, so investing in dual-cluster routing here would be wasted churn. ## Changes - `containers/tailscale/default.nix` — `buildGoModule` over `cmd/tailscale`, `cmd/tailscaled`, `cmd/containerboot`; packaged via `dockerTools.buildLayeredImage` with `cacert`, `iptables` (legacy symlink to match upstream Synology compat), `iproute2`, `tzdata`, `busybox`. - `argocd/manifests/tailscale-operator-ringtail/kustomization.yaml` — kustomize `images:` rewrite swapping `docker.io/tailscale/tailscale` → `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`. - `docs/changelog.d/mirror-tailscale-container.infra.md` — fragment. ## Pin rationale v1.94.2 matches `service-versions.yaml:96` and the current ProxyClass exactly — this PR is "make it local," not "upgrade tailscale." Version bumps come as follow-up C0/C1 changes once we decide to test newer (v1.96.x had a Fly-side MagicDNS regression; v1.98.0 is current upstream stable). ## Test plan - [x] Image built successfully on ringtail nix-container-builder (run #528). - [x] Image visible in registry: `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`. - [ ] Deploy from branch: `argocd app set tailscale-operator-ringtail --revision mirror-tailscale-container && argocd app sync tailscale-operator-ringtail`. - [ ] Verify proxy pods restart with new image and existing tailnet ingresses (e.g., authentik, immich, tempo) keep resolving. - [ ] After merge: rebuild on main SHA, update kustomization, run `services-check`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #347	2026-05-06 06:50:31 -07:00
Erich Blume	6f0d80ca1e	C0: doc review — index.md, add ringtail to infra overview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 06:14:40 -07:00
Erich Blume	39b042e638	C0: service review — caddy v2.11.2 (current latest, healthy) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 06:11:15 -07:00
Erich Blume	24e5490259	C0: review CC init-container-isolation — defer retirement to post-ringtail Runtime grafana pod matches the manifest and the CC's claim; bumped last-reviewed. Noted that retiring init-chown-data in favor of fsGroup alone should wait until grafana migrates to ringtail's k3s, since the storage backend will change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:31:13 -07:00
Erich Blume	074887cd57	C0: docs — explanation article on compliance mute categories Captures the CC vs NA vs RA distinction surfaced during the 2026-05-03 weekly compliance review (CVE-2026-31789), and the image-scan mutelist gap that blocks acting on it. Links the new article from the review-compensating-controls how-to so it isn't orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:19:53 -07:00
Erich Blume	9fb5442ccd	C0: kiwix — doc review, fix Adding Archives source path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:46:16 -07:00
Erich Blume	f16e1c81f1	C0: zot — upgrade indri registry to v2.1.16 Security fixes only (TLS verification on metrics client, CORS Allow-Credentials suppression on wildcard origin, manifest/API-key body-size limits, dependabot bumps). No config changes required; re-built from source on indri and bounced launchagent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:41:07 -07:00

1 2 3 4 5 ...

1,011 commits