Commit graph

1,029 commits

Author SHA1 Message Date
6bcbca3ca0 wave-1: neutralize minikube paperless/teslamate/mealie (replicas 0, drop ingress)
These migrated to ringtail. Set replicas: 0 (prevents resurrecting the old
instances and double-writing the now-ringtail-owned databases) and remove the
tailscale Ingress from each (the names tesla/meals/paperless were handed off
to the -ringtail ingresses at cutover; a re-created minikube ingress would
steal them back). Service/PVC/ExternalSecrets retained for rollback. Manifest
deletion + source-DB drop come in a later decommission PR.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:31:43 -07:00
4016a86d3f paperless-ringtail: amd64 valkey (-nix tag) + /data mount for redis sidecar
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:09:19 -07:00
852eaba3ac paperless-ringtail: redis as native sidecar so migrate init can reach it
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:07:23 -07:00
8293f5197c paperless-ringtail: add migrate initContainer (Nix split has no s6 migrate step)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:04:50 -07:00
727818480d teslamate-ringtail: use cookie-fixed image v3.0.0-191be1b-nix
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:01:01 -07:00
191be1b2cf teslamate: keep Erlang release cookie (removeCookie = false)
The Nix mixRelease strips releases/COOKIE by default and expects
RELEASE_COOKIE at runtime, but teslamate's start script reads the file
and crash-loops without it. teslamate is single-node (no distributed
Erlang exposed beyond :4000), so keeping the build-generated cookie in
the release is the simplest self-contained fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:54:21 -07:00
18dc9a143c wave-1 ringtail: app manifests + ArgoCD apps (paperless, teslamate, mealie)
Staging deployments on ringtail k3s, in parallel with the minikube apps
until per-service cutover. Each uses the Nix image built at 1d4cbbf
(paperless v2.20.15, mealie v3.16.0, teslamate v3.0.0, all -nix tags) and
points postgres at the in-cluster ringtail blumeops-pg.

- paperless: redesigned as web/worker/beat/consumer + redis in one pod
  (Nix image has no s6 supervisor); media on a ringtail-suffixed NFS PV
  (needs a sifaka rule for ringtail).
- mealie: single gunicorn; SQLite PVC (local-path) copied at cutover.
- teslamate: stateless; DATABASE_HOST already in-cluster, unchanged.

ArgoCD apps target ringtail (https://ringtail.tail8d86e.ts.net:6443).
Not synced yet; deploy-from-branch + cutover is the next step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:42:22 -07:00
1d4cbbfb84 databases-ringtail: add blumeops-pg cluster for wave-1 (paperless, teslamate)
CNPG Cluster on ringtail to receive the paperless + teslamate databases
migrated off the minikube blumeops-pg via cold pg_dump/pg_restore. Mirrors
the minikube cluster (managed roles eblume/borgmatic/paperless/teslamate,
scram pg_hba) on ringtail's local-path storage, scoped to wave-1 roles
(miniflux + authentik stay put for later waves). Apps reach it in-cluster
at blumeops-pg-rw.databases.svc.cluster.local — same name as on minikube.

Database creation is deferred to cutover: paperless restores into the
bootstrap database; teslamate's DB is created by the eblume superuser at
its cutover (the dump's earthdistance extension is untrusted). The four
ExternalSecrets reuse the same 1Password items as the minikube cluster.
Not yet synced; deploy waits for review. kustomize build verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:26:39 -07:00
39686c8a2e teslamate: port container from Dagger to Nix (default.nix)
teslamate is not in nixpkgs, so this is a from-scratch beamPackages
mixRelease: an Elixir/Phoenix release with npm-built assets. Replaces
container.py (+ entrypoint.sh, now inlined as the image Entrypoint).

Pins erlang_27 + elixir_1_18 from the shared nixos-unstable rev (teslamate
needs elixir ~> 1.17; stays off the default OTP 28). Source from the forge
mirror, pinned by the v3.0.0 tag commit. Assets build in-release via npm ci
(esbuild + sass are devDeps; esbuild platform binary is optional) + the
custom node scripts/build.js, then mix phx.digest. ex_cldr locale data is
pre-fetched and pointed at via LOCALES to avoid compile-time GitHub
downloads the build sandbox blocks. Version unchanged (v3.0.0). Build
verified on ringtail (exit 0, ~134 MB image).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 09:09:01 -07:00
4dbd93c4fc mealie: port container from Dockerfile to Nix (default.nix)
Wrap nixpkgs mealie in dockerTools.buildLayeredImage, replacing the
Node+Python Dockerfile build. nixpkgs ships a single `mealie` gunicorn
entrypoint serving the prebuilt frontend, so this is a clean single-
process wrap; the run wrapper mirrors the NixOS module (init_db Alembic
migrations, then gunicorn). DB stays SQLite on the mealie-data PVC.

Self-pins nixos-unstable (stable lags at 3.9.2) for mealie 3.16.0 -- a
forward 4-minor bump from v3.12.0 (the previously-deferred upgrade).
Breaking-change review v3.13-v3.16: no schema breaks, SQLite auto-migrates
forward; remaining changes minor (see service-versions.yaml notes). Source
PVC retained for rollback. Build verified on ringtail (exit 0, assert ok).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 08:44:23 -07:00
43047423c4 paperless: port container from Dockerfile to Nix (default.nix)
Wrap nixpkgs paperless-ngx in dockerTools.buildLayeredImage, replacing
the s6-overlay Dockerfile build. The package bundles the full OCR/imaging
closure (tesseract, qpdf, jbig2enc, unpaper, pngquant, ocrmypdf, pikepdf)
and nltk data, so the image stays lean. Unlike the s6 image, this runs as
four containers on ringtail sharing one image (web/worker/beat/consumer);
the web wrapper mirrors the NixOS module's granian + PYTHONPATH invocation.

Self-pins nixos-unstable (stable lags at 2.19.6) for paperless-ngx 2.20.15
-- a same-minor forward patch bump from the v2.20.13 Dockerfile build.
Build verified on ringtail (nix-build, exit 0, version assert passes).

Also fixes pre-existing shower version drift (service-versions 1.1.2 ->
1.1.3 to match its default.nix) so container-version-check passes; the
paperless service-versions edit widens that check to all containers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 08:39:07 -07:00
944a1570cf docs: add wave-1 ringtail migration runbook + changelog
Docs-first for the C1 migration of paperless, teslamate, and mealie off
minikube-indri (OOM-saturated, kernel OOM-killer thrashing apiserver)
onto k3s-ringtail. Cold, downtime-tolerant cutover; postgres preserved
via dump/restore from a quiesced source, mealie SQLite PVC copied.
Linked as the next chain from [[migrate-immich-to-ringtail]].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 08:38:00 -07:00
40bd929820 C0: remove visible GNU Terry Pratchett from naughty.html body
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 37s
GNU lives in the overhead — the X-Clacks-Overhead header — never on the
visible page. Keep the header, drop the footer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:55:05 -07:00
a36a18aaa6 C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of
all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI
GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite
git-history URL space and timing out Forgejo. robots.txt already disallowed
/mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~
to beat the regex asset locations), served as a roll-of-dishonour page with an
X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via
forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection
are documented in docs/explanation/ai-scraper-mitigation.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:52:20 -07:00
e0064de83d C0: update ringtail flake inputs (nixpkgs, disko)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:52:09 -07:00
f588638331 C0: rebuild valkey from squashed main commit
Image tags from PR #362 (v8.1.7-02859c5{,-nix}) referenced a branch
SHA that no longer exists on main after squash-merge. Rebuilt both
the dagger arm64 and nix amd64 variants from the squashed commit
(ecded30) and updated paperless + immich-ringtail to the new tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:53:21 -07:00
ecded30073 Make valkey local on ringtail (nix amd64) + bump to 8.1.7 (#362)
## Summary

Weekly "make one non-local container local" pickup: immich-ringtail still pulled `docker.io/valkey/valkey:8.1.6` because the existing `containers/valkey/container.py` build was arm64-only.

- Adds `containers/valkey/default.nix` — nix-built amd64 valkey image, packaged by the ringtail nix-container-builder runner using `pkgs.dockerTools.buildLayeredImage`. Mirrors the existing `containers/authentik-redis/default.nix` pattern.
- `containers/valkey/container.py` keeps building the Alpine arm64 image for paperless on indri. Bumped both builds to upstream valkey 8.1.7 (Alpine 3.22 now ships `8.1.7-r0`; nixpkgs has 8.1.7).
- Splits `VERSION` (upstream app) from `ALPINE_PIN` (apk pin) in `container.py` so both build files can declare the same upstream version and pass `container-version-check`.
- Updates `service-versions.yaml`: current-version 8.1.7, refreshed last-reviewed, upstream-source now points at the canonical valkey-io releases page.
- Switches kustomizations:
  - `immich-ringtail/kustomization.yaml`: `docker.io/valkey/valkey:8.1.6` → `registry.ops.eblu.me/blumeops/valkey:v8.1.7-02859c5-nix`, comment updated.
  - `paperless/kustomization.yaml`: `v8.1.6-r0-fabca04` → `v8.1.7-02859c5`.

## Build

build-container run #563 — both jobs succeeded after a transient runner crash on the first dispatch (#562 build-nix), which surfaced two separate bugs that landed in a separate C0 on main:

- `runner-logs` silently returned 0 with no output when the log file didn't exist on indri
- `ssh indri` swallowing remote exit codes (fish login shell), which the wrapper now works around via a stdout marker

## Test plan

- [ ] `argocd app set immich-ringtail --revision valkey-nix && argocd app sync immich-ringtail`
- [ ] `argocd app set paperless --revision valkey-nix && argocd app sync paperless`
- [ ] Both valkey pods come Ready and start serving on :6379
- [ ] Immich app + paperless can read/write their respective cache
- [ ] After merge: rebuild from squashed main commit + update kustomization tags (squash-tag follow-up)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #362
2026-05-28 14:51:09 -07:00
1ce381cb6e C0: surface missing-log failures in runner-logs
`mise run runner-logs <run> -j <n>` previously silently succeeded with
no output when forgejo had no log for the task. Two layered causes:

1. zstdcat exits 0 even when the file is missing (writes "can't stat
   … -- ignored" to stderr).
2. ssh to indri runs fish, which silently drops the remote exit code so
   the subprocess returncode is always 0.

Probe `test -f` over SSH and parse a stdout marker (EXISTS / MISSING) to
detect the missing-log case, then report it explicitly with the indri
path and a hint about action_task.log_in_storage = 0 so the operator
knows where to look next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:36:33 -07:00
e703d25efe C0: rebuild unpoller container from squashed main commit
Image was previously tagged with the unpoller-v3 branch SHA (1b27242),
which doesn't exist in main's history after squash-merge. Rebuilt from
the squashed commit so the tag references a reachable commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:10:21 -07:00
4d1f4af25b Upgrade unpoller v2.34.0 → v3.2.0, migrate to container.py (#361)
## Summary

- Service Review pickup: unpoller (last reviewed 73 days ago).
- Upgrades unpoller from v2.34.0 to v3.2.0 (major version bump).
- Migrates the container build from a Dockerfile to a native Dagger pipeline (`containers/unpoller/container.py`) following the navidrome / miniflux pattern.
- Refreshes `service-versions.yaml` (last-reviewed, current-version).

## Breaking changes (upstream)

- **v3.0.0** — UniFi network API shifts (later 10.x). Some metric / event / log names and labels may have changed. Worth a follow-up sweep of the unpoller Grafana dashboard for missing series.
- **v3.2.0** — defaults to a 60s background poll feeding cached Prometheus scrapes (was on-demand poll per scrape). To restore previous behavior, set `interval = 0` in `up.conf`. Leaving the new default in this PR — every-15s scrapes will simply serve from cache, which is fine for our use.

## Build

- Image: `registry.ops.eblu.me/blumeops/unpoller:v3.2.0-1b27242`
- Built by build-container workflow run #559 from this branch.

## Test plan

- [ ] `argocd app set unpoller --revision unpoller-v3 && argocd app sync unpoller`
- [ ] Pod comes Ready
- [ ] Verify metrics exported (`Site/Client/UAP/USG/USW` counts in logs, `unpoller_*` series in Prometheus)
- [ ] Spot-check unpoller Grafana dashboard for missing series after the v3 API shift
- [ ] After merge: `argocd app set unpoller --revision main && argocd app sync unpoller`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #361
2026-05-28 09:59:46 -07:00
f6febb1f77 C0: switch fly proxy deploy strategy to immediate
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 34s
Bluegreen kept timing out — the new green machine couldn't reach
"started" within Fly's 5-minute deploy budget. The cold-start sequence
(tailscaled → tailscale up → wait-for-MagicDNS → nginx startup) eats
most of that, leaving no headroom for healthcheck propagation.

For a single-machine proxy, bluegreen offers little benefit anyway:
no warm second instance, so trading 5-10s of downtime for predictable
completion is the right call.
2026-05-28 07:59:22 -07:00
4e25180b0a C0: clone blumeops via tailnet on ringtail provision
Switch ringtail.yml from forge.eblu.me (Fly proxy, WAN) to
forge.ops.eblu.me (Caddy on indri, tailnet). Ringtail is always
on the tailnet — the WAN round-trip was overhead and made
provision-ringtail fail any time Fly was slow or down.
2026-05-28 07:13:40 -07:00
c00d7db507 Recurring maintenance batch (2026-05-27) (#360)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 14m10s
Bundle of recurring overdue tasks:

- Ringtail flake update
- Security & compliance report review
- Tooling deps bump (prek, fly, mise, forgejo workflows)
- Top stale doc review
- Top stale service review (if trivial)

Larger items (service version bumps requiring upgrades, non-local container migration) split out as separate PRs.

Reviewed-on: #360
2026-05-28 06:01:57 -07:00
Erich Blume
753fa9cb63 C0: disable VRR on ringtail DP-1 to stop OMEN panel flicker
The OMEN 27i IPS pumps brightness when its refresh swings into the low
VRR range during low-framerate content (game cutscenes), producing a
~20Hz flicker that compounds over a session until a reboot. GPU health
is clean (no Xid/ECC/thermal); pinning fixed 165Hz eliminates it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:59:29 -07:00
Erich Blume
c09bd5b612 C0: cap systemd-coredump on ringtail to stop game-crash lockups
Wine/Proton game segfaults (e.g. Diablo IV) produced multi-GB cores that
systemd-coredump spent minutes compressing to disk, pinning the CPU and
freezing the desktop. Cap ProcessSizeMax/ExternalSizeMax at 1G (oversized
cores logged but skipped) and MaxUse at 2G to bound the store.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:54:32 -07:00
35ae171783 C0: fix sync button location in manage-forgejo-mirrors
The verify step pointed to the main repo page, but the "Synchronize now"
button is in the Mirror settings section of the settings page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:15:07 -07:00
57fd88b269 C0: fix op item edit syntax in zot key rotation
The pbpaste | op item edit ... "field[password]=-" stdin syntax is
rejected by op 2.34 as "invalid JSON" — recent op versions treat
piped input as a full JSON template, not a single field value.
Procedure now uses an inline assignment via a local fish variable.
2026-05-22 21:50:43 -07:00
08a1cb164a C0: fix 1password export filename in backup how-to
1Password's desktop app names exports as
1PasswordExport-<uuid>-<timestamp>.1pux automatically — you can't
choose the name. Procedure now points the task at that glob.
2026-05-22 21:36:13 -07:00
d02bf062af C0: review 1password reference card
Added vault split (blumeops vs Personal), noted onepassword-connect
runs on both indri and ringtail, and lifted op CLI guidance from
agent memory into the card. Bumped last-reviewed.
2026-05-22 21:29:11 -07:00
ee51bcafb4 Rip out compensating-controls framework (#359)
## Summary

Removes the compensating-controls (CC) framework. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files stay in place but no longer carry \`CC: <id>\` prefixes — each entry now just keeps a free-form \`Description\` of why it's muted.

The CC review cadence proved to be more process overhead than this single-operator homelab needed.

## What changed

**Deleted**
- \`compensating-controls.yaml\` — the CC registry
- \`mise-tasks/review-compensating-controls\` — the staleness-review task
- \`docs/how-to/operations/review-compensating-controls.md\`
- \`docs/how-to/operations/record-review-evidence.md\` (was aspirational)
- \`docs/explanation/compliance-mute-categories.md\` (proposed-future CC/NA/RA work)
- 5 orphan \`+review-cc-*\` / \`+compliance-mute-categories\` changelog fragments

**Modified**
- 6 mutelist YAML files: stripped \`CC: <id>.\` prefix from every \`Description\` / \`statement\` field, kept the free-form text
- \`mise-tasks/review-compliance-reports\`: removed CC mentions from docstrings, panel text, and the node-verification table title. Node-verification logic itself is unchanged.
- \`docs/reference/operations/security.md\`: removed the "Compensating controls" section
- \`docs/how-to/operations/read-compliance-reports.md\`: rewrote step 3 of "Acting on findings" to point at the mutelist YAML directly
- \`docs/changelog.d/prowler-iac-mutelist.infra.md\`: rewrote to drop the "two new compensating controls" framing

## What did not change

- All Prowler manifests (cronjobs, RBAC, PVs, kustomization) — scans still run on the same schedule
- The Kingfisher deployment
- The trivy-shim in the Prowler container — that's about Trivy ignorefile plumbing, independent of the CC concept
- The mutelist entries themselves — each \`Resources\` list is unchanged; only the prose of \`Description\` was edited
- \`CHANGELOG.md\` — historical releases are left as-is

## Test plan

- [ ] Wait for human review before deploying — once merged, re-point ArgoCD: \`argocd app set prowler --revision main && argocd app sync prowler\` (no manifest changes besides the ConfigMap, so impact is limited to muted-finding descriptions in next week's report)
- [ ] Confirm next weekly Prowler K8s CIS run (Sunday 3am) still completes and produces a report on sifaka
- [ ] Confirm next weekly Prowler IaC run still honors \`trivyignore.yaml\` (the trivy shim is untouched but the ignorefile content was rewritten)
- [ ] \`mise run review-compliance-reports\` — verify node-verification block still runs and prints the renamed table title

Reviewed-on: #359
2026-05-22 21:08:53 -07:00
2fae0f7161 C0: switch grafana deployment to Recreate strategy
Grafana uses an RWO PVC for SQLite + Bleve search index. RollingUpdate
spawns the new pod before terminating the old one, so the new pod
crashloops on the index lock until rollout timeout. Recreate terminates
the old pod first, letting the new pod acquire the lock cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 06:33:26 -07:00
1897eb1c5b C0: move immich blackbox probe to ringtail alloy
Immich migrated to ringtail's k3s cluster but the probe still targeted
the in-cluster service DNS on indri's minikube, firing ServiceProbeFailure
indefinitely. Moved the target into alloy-ringtail's config so the probe
runs in the cluster where immich actually lives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 08:46:22 -07:00
e222d47d45 C0: deploy shower v1.1.3 (kustomize newTag bump)
Image v1.1.3-3645098-nix was built directly on ringtail and pushed via
skopeo, bypassing the Forgejo runner: indri was severely overloaded
(load avg 24.92, minikube VM at 344% CPU) and the workflow-dispatch
endpoint timed out. The image content is identical to what the runner
would have produced — same default.nix at commit 3645098 (on main),
same NIX_PATH (current nixpkgs flake), same skopeo invocation. Tag
short-sha matches the commit that defines the recipe so we aren't
pinning to a ghost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 20:09:54 -07:00
3645098bf1 C0: bump shower to v1.1.3
Wheel/sdist + FOD hashes probed on ringtail. Full nix-build verified
end-to-end before commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 19:57:37 -07:00
Erich Blume
96dbbb3cbe C0: add sn2-prelaunch wrapper to clear SN2 stale lockfiles
UE5 writes Saved/running.dat as a "session in progress" marker. If
the previous session exited uncleanly (SIGKILL, crash), it lingers,
and SN2 pops up an invisible 0×0 Error dialog at next launch that
the GameThread blocks on forever — visible only as a black screen
with a spinning loader. Wrap the Steam command to clear the marker
files before each launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 12:26:10 -07:00
815a0cc6e6 C0: shower — rebuild from main SHA (post-merge retag)
PR #358 was squash-merged so the branch commit b8c7783 baked into the
prior image tag isn't reachable from main's history. Rebuild from main
HEAD (a33fa47) and retag. Image content is byte-identical (FOD is
content-addressed, inputs unchanged); only the SHA in the tag changes
so future provenance tracing stays on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 06:57:24 -07:00
a33fa47b80 C1: deploy shower v1.1.2 (#358)
## Summary

Deploys `adelaide-baby-shower-app` **v1.1.2** to ringtail k3s.

- Bumps `containers/shower/default.nix` `version` to 1.1.2.
- Refreshes sdist + wheel `fetchurl` hashes against the forge PyPI artifacts.
- Re-probed FOD `outputHash` on the nix-container-builder runner (ringtail) and pinned the new closure hash.
- Bumps kustomize `newTag` to `v1.1.2-b8c7783-nix` (built from this branch's tip).
- Bumps `service-versions.yaml` entry for shower to `1.1.2` / `last-reviewed: 2026-05-15`.

## Build provenance

Built by Forgejo Actions run #553 on `nix-container-builder` (ringtail) at commit `b8c7783`. After merge a C0 follow-on will rebuild from main and retag so future provenance points at main history.

## Test plan

- [ ] `argocd app set shower --revision shower-v1.1.2 && argocd app sync shower` deploys cleanly
- [ ] Pod migrates the SQLite PV and serves at `shower.ops.eblu.me` / `shower.eblu.me`
- [ ] No new errors in pod logs after `collectstatic` + gunicorn boot

Reviewed-on: #358
2026-05-15 06:50:46 -07:00
Erich Blume
12314857d8 C0: add GE-Proton to ringtail Steam extraCompatPackages
Lets Subnautica 2 (and any other game) opt into the GE-Proton
build via Steam's per-game compatibility tool override, as a
workaround for the Proton Experimental + DXVK D3D12 Mercuna hang.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 06:27:43 -07:00
4d2bc9975f C0: deploy shower v1.1.1 (kustomize newTag bump)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:51:10 -07:00
4e117dc921 C0: pin shower v1.1.1 FOD outputHash (probed on ringtail)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:40:22 -07:00
6e90c4c363 C0: bump shower to v1.1.1 (probe FOD hash)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:12:00 -07:00
dc69b8c68b C1: fix borgmatic shower SQLite dump (ssh to ringtail) (#357)
## Summary

Nightly borgmatic backups have been failing for 2 days. Root cause: the
shower SQLite dump `before_backup` hook (added in PR #349) referenced
`kubectl --context=k3s-ringtail`, but indri's kubeconfig deliberately
doesn't carry the ringtail credentials. The hook's failure aborted the
entire run, taking out *both* the local sifaka repo and the BorgBase
offsite. Verified the last good archive was `indri-2026-05-11T02:00`.

## Approach

ssh into ringtail and run `k3s kubectl` there — no indri-side
kubeconfig needed. `/etc/rancher/k3s/k3s.yaml` is mode 644 so no sudo
required, and the existing ssh access from indri to ringtail works.

Inline-shell quoting got hairy fast (fish on ringtail rejected `POD=...`
bash syntax; the nix shower image lacks `tar` so `kubectl cp` fails).
Pulled the dump logic into `~/bin/borgmatic-k8s-sqlite-dump`, deployed
by the ansible role. Each dump entry now declares a `target`:

- `local:<context>` — local kubectl with explicit context (mealie)
- `ssh:<user@host>` — ssh + `k3s kubectl` on the cluster host (shower)

Bytes come back via `kubectl exec ... -- cat` instead of `kubectl cp`
since `cp` needs `tar` in the pod (nix-built containers don't bundle it).

## Test plan

- [x] `mise run provision-indri -- --tags borgmatic --check --diff` shows expected diff
- [x] Apply, helper script deployed at `~/bin/borgmatic-k8s-sqlite-dump`
- [x] Helper invoked directly with `ssh:eblume@ringtail` produces a valid 288 KB SQLite file
- [x] Full `borgmatic create` completes without errors — both mealie.db (1.7 MB) and shower.db (288 KB) appear in `~/.local/share/borgmatic/k8s-dumps/`, archive `indri-2026-05-13T17:31:02` written to sifaka borg repo

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #357
2026-05-13 18:55:50 -07:00
947e4310c3 C2: migrate immich from minikube to ringtail (mikado chain) (#356)
## Summary

C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the
largest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.

This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.

## Goal end-state

- Immich `server`, `machine-learning`, `valkey` on ringtail.
- ML pod uses ringtail's RTX 4080 (performance win — currently
  CPU-only).
- CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail.
- Library still on sifaka NFS — ringtail mounts the same path.
- `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress.
- Minikube `immich` and `immich-pg` are removed.

## Cards

| Card | Depends on |
|---|---|
| `migrate-immich-to-ringtail` (goal) | all six below |
| `cnpg-on-ringtail` | — |
| `immich-pg-on-ringtail` | cnpg-on-ringtail |
| `immich-pg-data-migration` | immich-pg-on-ringtail |
| `sifaka-nfs-from-ringtail` | — |
| `immich-app-on-ringtail` | immich-pg-on-ringtail, sifaka-nfs-from-ringtail |
| `immich-cutover-and-decommission` | immich-pg-data-migration, immich-app-on-ringtail |

## Key constraints

- **No data loss.** Downtime is acceptable; data loss is not. Two
  surfaces matter: postgres (ML embeddings, face data — slow to
  re-derive) and the library files (don't move, but NFS access from
  ringtail must be verified).
- **Migration method:** Option A is a CNPG `externalCluster`
  basebackup → promote. Option B is `pg_dump`/`pg_restore` as a
  documented fallback. Either way, dry-run against a scratch
  cluster first.
- **Why pg moves too** (not cross-cluster): keeping pg on minikube
  would block the whole decommission, and Immich is chatty with pg
  so tailnet round-trips would hurt.

## Test plan

- [ ] Plan review — does the dependency graph make sense?
- [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the
      chain correctly.
- [ ] Per-card implementation cycles land separately (commit
      convention enforced by hook).

Reviewed-on: #356
2026-05-13 16:46:17 -07:00
bc8ceb502b Merge pull request 'C1: pin ringtail wired IP to 192.168.1.21 (static)' (#355) from ringtail-static-ip into main 2026-05-12 09:59:59 -07:00
a4a30aad44 fix(ringtail): explicitly enable net.ipv4.ip_forward
After the static IP change, k3s/flannel pod networking broke because
ip_forward was 0. NixOS doesn't enable IP forwarding by default — it
was previously being set implicitly somewhere in the NM-managed /
scripted-DHCP path. With static networking we have to set it ourselves.

Verified at runtime via sysctl -w before adding here; pod outbound
came back immediately and Tailscale VIP services recovered without
any pod restarts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:51:16 -07:00
d0b5423135 C1: pin ringtail wired IP to 192.168.1.21 (static)
Removes DHCP lease renewal as a failure mode on ringtail after an outage
on 2026-05-12 where the IP and routes silently disappeared from enp5s0
without any kernel link event. NetworkManager stays enabled for wireless
fallback but no longer manages the wired interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:33:57 -07:00
dc0916a548 C0: shower — rebuild from main SHA (post-merge retag)
PR #354 was squash-merged so the branch commit 444ff91 baked into the
prior image tag isn't reachable from main's history. Rebuild from main
HEAD (3c7967e) and retag. Image content is byte-identical (FOD is
content-addressed, inputs unchanged); only the SHA in the tag changes
so future provenance tracing stays on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 20:20:39 -07:00
3c7967e445 C1: deploy shower v1.1.0 (phases + guest memories) (#354)
## Summary

Deploys `adelaide-baby-shower-app` **v1.1.0** to ringtail k3s.

### App changes (since v1.0.2)

- **Four-phase `ShowerState`** replaces the boolean `locked` flag — `pre_event` → `party` → `prizes_locked` → `event_locked` — with a backfill migration that maps `locked=True → pre_event`, `locked=False → party`.
- **Guest memories**: append-only photos + comments panel where guests can leave notes for the baby. Adds `GuestPhoto` + `GuestComment` models with file-extension validators and a max-size validator; new `shower.imaging` module for thumbnail generation.
- **Admin + QR polish**: configurable host link, fixed "View Site" URL, guest-facing QR copy improvements, contest tweaks.

Three Django migrations run automatically in the entrypoint against the SQLite PV:
- `0009_shower_phase`
- `0010_guest_memories`
- `0011_book_description`

No ConfigMap / env-var changes. The deploy uses `strategy: Recreate` with a single replica, so the old pod releases the data PVC before the new one mounts it and runs migrations.

### Container build changes

The v1.1.0 tag exposed a latent issue with the Forgejo PyPI install path:

- The recent commit [2d38418e](2d38418e) closed the forge package leak at the Fly edge by blocking `/api/packages/*` publicly.
- Forgejo's PyPI simple index returns absolute file URLs hardcoded to its public `ROOT_URL` (`forge.eblu.me`), so pip-installing from the tailnet index URL still tries to download from `forge.eblu.me` → 403.
- Previous shower builds escaped this because their FOD outputs were already in the nix store; bumping to a new version forced a fresh pip run that hit the block.

Fix mirrors what we already do for the sdist: both wheel and sdist are pulled via direct `fetchurl` against `forge.ops.eblu.me`, then the wheel is copied to TMPDIR under its clean filename (nix store path's hash prefix breaks pip's wheel-filename parser) and handed to pip as a local path. The forge `--extra-index-url` is no longer needed.

FOD outputHash pinned to `sha256-kTNOswobtkgyQmmqbQM8XO4vvaGg57nCuuZGbNXb0NM=` from run 547. Image: `registry.ops.eblu.me/blumeops/shower:v1.1.0-444ff91-nix`.

### Adjacent finding (already handled)

The ringtail `gitea-runner-nix_container_builder` systemd unit was left `inactive` after the recent `provision-ringtail` (matches the known `sshd-restart-hangs-mux` lesson — the rebuild changed the unit's PATH closure + config.yaml, systemd stopped it, then the playbook hung before the activation could restart it). Manually started; the existing memory `lesson_provision_ringtail_ssh_hang.md` was extended to mention the runner as the canary service to check after provisions.

## Test plan

- [ ] `argocd app diff shower --revision shower-v1.1.0` — review the manifest change
- [ ] `argocd app set shower --revision shower-v1.1.0 && argocd app sync shower`
- [ ] `kubectl --context=k3s-ringtail logs -n shower deploy/shower` — confirm migrations 0009/0010/0011 applied, no errors
- [ ] Hit `https://shower.ops.eblu.me/` (tailnet) — splash page renders, phase indicator visible
- [ ] Hit `https://shower.ops.eblu.me/host/` — host console loads, phase dropdown shows the four states
- [ ] Hit `https://shower.eblu.me/` (public via Fly) — splash page still served
- [ ] After merge: `argocd app set shower --revision main && argocd app sync shower`

Reviewed-on: #354
2026-05-11 20:08:03 -07:00
fbc1f7720e C0: gitignore .claude/scheduled_tasks.lock
Transient lock file written by the ScheduleWakeup harness tool when
Claude paces its own work between long-running operations. Not config,
not state worth checking in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:37:29 -07:00
4133785119 C1: ringtail — weekly flake.lock update (#352)
## Summary
- Recurring weekly lockfile refresh for `nixos/ringtail/flake.lock`.
- Inputs updated: `disko`, `home-manager`, `nixpkgs`.
- `nixpkgs-services` was deliberately skipped (per overlay convention — pinned services bump only on intentional update).
- Generated via `dagger call flake-update --src=. --flake-path=nixos/ringtail`.

## Test plan
- [x] `prek` hooks pass
- [ ] After merge: `mise run provision-ringtail` to deploy
- [ ] Then check for kernel update per [[manage-lockfile]]

## Notes
- Not deployed from this PR — provisioning is a follow-up.

Reviewed-on: #352
2026-05-11 16:13:07 -07:00