Commit graph

1,003 commits

Author SHA1 Message Date
aad76fc3e0 C2(migrate-immich-to-ringtail): impl rename ringtail immich ingress photos-ringtail -> photos
Minikube immich-tailscale Ingress was deleted; the "photos" Tailscale
device name is now free. Renaming the ringtail ingress claims it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:27:04 -07:00
18e6c7ef5d C2(migrate-immich-to-ringtail): close immich-app-on-ringtail
All three pods Running, 1/1 Ready:
- immich-server: v2.6.3, connected to ringtail pg + valkey
  ("/api/server/ping" returns 200, "/api/server/version" returns
  v2.6.3)
- immich-machine-learning: CUDA variant, RTX 4080 attached
  (nvidia-smi shows 8 GiB used / 16 GiB total — shared with
  frigate via time-slicing), gunicorn workers booted
- immich-valkey: upstream multi-arch docker.io/valkey/valkey:8.1.6

immich-db Secret in the immich namespace created manually with
source's immich-pg-app password (matches minikube pattern).
Tailscale ingress staging hostname: photos-ringtail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:23:24 -07:00
5a9596c7d9 C2(migrate-immich-to-ringtail): impl add immich Deployments + bump GPU time-slicing
- argocd/manifests/immich-ringtail/: full port of the immich stack
  (server, ML, valkey, services, ingress, pvc-ml-cache) from
  argocd/manifests/immich/, with ringtail-specific tweaks:
  - deployment-ml: runtimeClassName=nvidia, nvidia.com/gpu:1 limit,
    -cuda image tag
  - deployment-valkey + kustomization: drop the
    registry.ops.eblu.me/blumeops/valkey mirror (arm64-only), use
    upstream docker.io/valkey/valkey:8.1.6 (multi-arch)
  - ingress-tailscale: tls.hosts=[photos-ringtail] for staging
- argocd/apps/immich-ringtail.yaml: new ArgoCD app (manual sync,
  ringtail destination)
- argocd/manifests/nvidia-device-plugin/time-slicing-config.yaml:
  bump replicas 2 -> 4 so the ringtail GPU can be shared by
  frigate + ollama + immich-ml

The immich-db Secret in the immich namespace is created manually
(matching minikube pattern) — see argocd/apps/immich-ringtail.yaml
header for the procedure.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:14:07 -07:00
674ca2ced9 C2(migrate-immich-to-ringtail): close immich-pg-data-migration
Migration via CNPG pg_basebackup (Option A) completed cleanly.

Sequence:
1. Stopped immich-server + immich-machine-learning on minikube
   (scaled to 0). valkey + source pg kept running.
2. Copied minikube's immich-pg-ca + immich-pg-replication secrets
   to ringtail as source-immich-pg-{ca,replication}.
3. Recreated the ringtail immich-pg Cluster with
   bootstrap.pg_basebackup, replica.enabled=true, externalClusters
   pointing at immich-pg.tail8d86e.ts.net via the streaming_replica
   TLS cert.
4. Basebackup completed in ~50s. Replica caught up streaming.
5. Verified row counts identical between source and replica:
   asset=12681, user=1, album=28, smart_search=9624,
   activity=0, asset_face=3917.
6. Promoted via replica.enabled=false. pg_is_in_recovery → false.
   Write test passed. All 7 expected extensions present in immich
   db (vector, vchord, cube, earthdistance, pg_trgm, unaccent,
   uuid-ossp).
7. Pruned bootstrap + externalClusters blocks; deleted out-of-band
   replication secrets.

Source minikube immich-pg is intact and untouched — recovery path
remains available until immich-cutover-and-decommission completes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
e59bbc9348 C2(migrate-immich-to-ringtail): impl prune externalClusters + bootstrap from immich-pg manifest
Migration done, cluster promoted. Pruning the externalClusters block
and bootstrap.pg_basebackup reference eliminates the footgun where a
future replica.enabled=true would demote this primary against the
stale minikube source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
431d538ab1 C2(migrate-immich-to-ringtail): impl promote ringtail immich-pg from replica to primary
Row counts verified equal between source (minikube) and replica
(ringtail) across asset (12681), user (1), album (28),
smart_search (9624), activity (0), asset_face (3917). Source immich
is scaled to 0 — no writes since the basebackup completed.

Flipping replica.enabled=false to promote. The externalClusters and
bootstrap.pg_basebackup blocks are left in place as documentation
(CNPG ignores them after initialization).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
5752f00343 C2(migrate-immich-to-ringtail): impl bootstrap immich-pg via pg_basebackup from minikube
Replaces the initdb bootstrap with a pg_basebackup from the minikube
source over the tailnet (immich-pg.tail8d86e.ts.net). The ringtail
cluster starts in replica mode (replica.enabled=true), streaming WAL
from the source. Promotion happens by flipping replica.enabled=false
after the replica catches up and the source is quiesced.

Uses the source's streaming_replica TLS cert + CA, copied to ringtail
as out-of-band secrets (source-immich-pg-replication,
source-immich-pg-ca) — the standard CNPG-to-CNPG migration auth path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
be5255d685 C2(migrate-immich-to-ringtail): close sifaka-nfs-from-ringtail
Verified on k3s-ringtail:
- Sifaka NFS export /volume1/photos covers 192.168.1.0/24 +
  100.64.0.0/10. Ringtail at 192.168.1.21 is in scope; no DSM rule
  changes needed.
- nfs-test pod mounted the share, read existing library/ thumbs/
  backups/ encoded-video/ profile/, wrote a temp file, deleted it.
- DNS resolution: sifaka → 192.168.1.203 (LAN). NFS traffic stays
  off tailnet, avoiding the sifaka-tailscale-userspace concern.
- Committed PV + PVC bind on first apply (RWX, 2Ti).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
9f8d627ce8 C2(migrate-immich-to-ringtail): impl add ringtail-side NFS PV/PVC for immich library
Mirrors argocd/manifests/immich/pv-nfs.yaml + pvc.yaml. PV renamed
to immich-library-nfs-pv-ringtail to avoid confusion with the
minikube side (PVs are cluster-scoped; both can coexist).

Initial kustomization.yaml in argocd/manifests/immich-ringtail/
holds just the storage bits today; deployments/services/ingress
will be added in immich-app-on-ringtail.

Verified: PVC binds to PV on k3s-ringtail; mount test from a
busybox pod read existing photo library dirs, wrote and deleted a
test file. DNS resolves sifaka to 192.168.1.203 so NFS traffic
stays on the LAN, off the tailnet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
4c6695868d C2(migrate-immich-to-ringtail): close immich-pg-on-ringtail
Verified on k3s-ringtail:
- Cluster immich-pg reached "Cluster in healthy state" (1/1 instance)
- borgmatic role: rolcanlogin=t, member of pg_read_all_data
- ExternalSecret immich-pg-borgmatic: Ready=True, username=borgmatic
- Extensions vchord, vector, cube, earthdistance installed in postgres db
  (immich db extensions deferred to app startup per the card)

10 GiB local-path storage; same VectorChord image as minikube source.
Bootstrap is empty initdb today; will be rewritten when
immich-pg-data-migration picks its import method.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
1d9d8867fb C2(migrate-immich-to-ringtail): impl add immich-pg cluster + app on ringtail
Mirror of argocd/manifests/databases/immich-pg.yaml on ringtail:
- Same VectorChord image (PG17 + VectorChord 0.5.0)
- Same extensions (vector, vchord, cube, earthdistance) via postInitSQL
- Same managed borgmatic role with pg_read_all_data
- 10 GiB local-path storage (matches minikube source)
- shared_preload_libraries: vchord.so
- Empty initdb today; bootstrap block will be rewritten when
  immich-pg-data-migration picks its import method.

ArgoCD app databases-ringtail targets ringtail/databases.
ExternalSecret reuses the onepassword-blumeops ClusterSecretStore that
already exists on ringtail via external-secrets-ringtail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
e1fe5d2ea6 C2(migrate-immich-to-ringtail): close cnpg-on-ringtail
Verified: cnpg-controller-manager pod Ready on k3s-ringtail; CRDs
clusters.postgresql.cnpg.io etc. installed; ArgoCD app Synced/Healthy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
b37ac0750f C2(migrate-immich-to-ringtail): impl add cloudnative-pg-ringtail ArgoCD app
Sibling of cloudnative-pg.yaml targeting k3s-ringtail. Same mirror
(mirrors/cloudnative-pg) and release (v1.27.1), same sync options.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:21 -07:00
bca5c40663 C2(migrate-immich-to-ringtail): plan capture GPU contention + valkey arch on immich-app-on-ringtail
Two discovered prereqs while bringing the immich stack up on ringtail:

1. nvidia-device-plugin time-slicing on ringtail advertises only 2
   virtual GPUs. Frigate + Ollama consume both. immich-ml's
   nvidia.com/gpu:1 cannot schedule until replicas is bumped to >= 3.
2. The registry.ops.eblu.me/blumeops/valkey image was built on indri
   (arm64) and is single-arch. Pulling on ringtail (amd64)
   crashloops with "exec format error". Use the upstream multi-arch
   docker.io/valkey/valkey image directly until the mirror gets a
   multi-arch tag.

Card body updated to capture both. Next impl incorporates the fixes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 13:12:09 -07:00
355be3fbc4 C2(migrate-immich-to-ringtail): plan correct extension-verification on immich-pg-on-ringtail card
CNPG's bootstrap.initdb.postInitSQL runs against the postgres
superuser database, not the application database. Extensions
declared there end up in the postgres db, not the immich db. The
Immich app installs them in its own database at startup.

This matches the existing minikube cluster's behavior — same
Cluster CR, same effect. Adjusting the card's verification to
reflect reality rather than (incorrectly) requiring extensions to
be present in the immich db pre-app-deploy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:25:30 -07:00
db37e7cc3e C2(migrate-immich-to-ringtail): plan capture two discovered concerns
1. Registering new ArgoCD apps from a feature branch: the app-of-apps
   "apps" Application is self-managing (re-reads apps.yaml on every
   sync, which pins targetRevision: main). So setting its revision to
   a branch doesn't stick across syncs, and new app definitions on a
   branch are invisible to the cluster via the normal flow. The goal
   card now documents the kubectl-apply + per-new-app `argocd app set
   --revision <branch>` workaround.

2. Tailscale device-name collision on cutover. The minikube immich
   ingress claims tailnet hostname "photos" (tls.hosts: [photos]).
   The ringtail ingress can't claim the same name while minikube's is
   alive (Tailscale enforces uniqueness). Staging uses
   tls.hosts: [photos-ringtail], with the rename to "photos" baked
   into immich-cutover-and-decommission step 2 + step 5.

Card dependency graph unchanged; no new cards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 12:21:57 -07:00
4623733695 C2(migrate-immich-to-ringtail): plan introduce mikado chain
Goal: move immich (server, ML, valkey, postgres) off minikube-indri
onto k3s-ringtail. Immich is the largest single tenant on minikube
(~1.5 GiB resident) and minikube is memory-saturated.

Prerequisite cards:
- cnpg-on-ringtail
- immich-pg-on-ringtail (requires cnpg-on-ringtail)
- immich-pg-data-migration (requires immich-pg-on-ringtail)
- sifaka-nfs-from-ringtail
- immich-app-on-ringtail (requires immich-pg-on-ringtail, sifaka-nfs-from-ringtail)
- immich-cutover-and-decommission (requires immich-pg-data-migration, immich-app-on-ringtail)

Data loss is a critical failure; downtime is acceptable. The cutover
plan favors a CNPG externalCluster basebackup (Option A) with pg_dump
as the documented fallback (Option B).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 11:05:40 -07:00
bc8ceb502b Merge pull request 'C1: pin ringtail wired IP to 192.168.1.21 (static)' (#355) from ringtail-static-ip into main 2026-05-12 09:59:59 -07:00
a4a30aad44 fix(ringtail): explicitly enable net.ipv4.ip_forward
After the static IP change, k3s/flannel pod networking broke because
ip_forward was 0. NixOS doesn't enable IP forwarding by default — it
was previously being set implicitly somewhere in the NM-managed /
scripted-DHCP path. With static networking we have to set it ourselves.

Verified at runtime via sysctl -w before adding here; pod outbound
came back immediately and Tailscale VIP services recovered without
any pod restarts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:51:16 -07:00
d0b5423135 C1: pin ringtail wired IP to 192.168.1.21 (static)
Removes DHCP lease renewal as a failure mode on ringtail after an outage
on 2026-05-12 where the IP and routes silently disappeared from enp5s0
without any kernel link event. NetworkManager stays enabled for wireless
fallback but no longer manages the wired interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:33:57 -07:00
dc0916a548 C0: shower — rebuild from main SHA (post-merge retag)
PR #354 was squash-merged so the branch commit 444ff91 baked into the
prior image tag isn't reachable from main's history. Rebuild from main
HEAD (3c7967e) and retag. Image content is byte-identical (FOD is
content-addressed, inputs unchanged); only the SHA in the tag changes
so future provenance tracing stays on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 20:20:39 -07:00
3c7967e445 C1: deploy shower v1.1.0 (phases + guest memories) (#354)
## Summary

Deploys `adelaide-baby-shower-app` **v1.1.0** to ringtail k3s.

### App changes (since v1.0.2)

- **Four-phase `ShowerState`** replaces the boolean `locked` flag — `pre_event` → `party` → `prizes_locked` → `event_locked` — with a backfill migration that maps `locked=True → pre_event`, `locked=False → party`.
- **Guest memories**: append-only photos + comments panel where guests can leave notes for the baby. Adds `GuestPhoto` + `GuestComment` models with file-extension validators and a max-size validator; new `shower.imaging` module for thumbnail generation.
- **Admin + QR polish**: configurable host link, fixed "View Site" URL, guest-facing QR copy improvements, contest tweaks.

Three Django migrations run automatically in the entrypoint against the SQLite PV:
- `0009_shower_phase`
- `0010_guest_memories`
- `0011_book_description`

No ConfigMap / env-var changes. The deploy uses `strategy: Recreate` with a single replica, so the old pod releases the data PVC before the new one mounts it and runs migrations.

### Container build changes

The v1.1.0 tag exposed a latent issue with the Forgejo PyPI install path:

- The recent commit [2d38418e](2d38418e) closed the forge package leak at the Fly edge by blocking `/api/packages/*` publicly.
- Forgejo's PyPI simple index returns absolute file URLs hardcoded to its public `ROOT_URL` (`forge.eblu.me`), so pip-installing from the tailnet index URL still tries to download from `forge.eblu.me` → 403.
- Previous shower builds escaped this because their FOD outputs were already in the nix store; bumping to a new version forced a fresh pip run that hit the block.

Fix mirrors what we already do for the sdist: both wheel and sdist are pulled via direct `fetchurl` against `forge.ops.eblu.me`, then the wheel is copied to TMPDIR under its clean filename (nix store path's hash prefix breaks pip's wheel-filename parser) and handed to pip as a local path. The forge `--extra-index-url` is no longer needed.

FOD outputHash pinned to `sha256-kTNOswobtkgyQmmqbQM8XO4vvaGg57nCuuZGbNXb0NM=` from run 547. Image: `registry.ops.eblu.me/blumeops/shower:v1.1.0-444ff91-nix`.

### Adjacent finding (already handled)

The ringtail `gitea-runner-nix_container_builder` systemd unit was left `inactive` after the recent `provision-ringtail` (matches the known `sshd-restart-hangs-mux` lesson — the rebuild changed the unit's PATH closure + config.yaml, systemd stopped it, then the playbook hung before the activation could restart it). Manually started; the existing memory `lesson_provision_ringtail_ssh_hang.md` was extended to mention the runner as the canary service to check after provisions.

## Test plan

- [ ] `argocd app diff shower --revision shower-v1.1.0` — review the manifest change
- [ ] `argocd app set shower --revision shower-v1.1.0 && argocd app sync shower`
- [ ] `kubectl --context=k3s-ringtail logs -n shower deploy/shower` — confirm migrations 0009/0010/0011 applied, no errors
- [ ] Hit `https://shower.ops.eblu.me/` (tailnet) — splash page renders, phase indicator visible
- [ ] Hit `https://shower.ops.eblu.me/host/` — host console loads, phase dropdown shows the four states
- [ ] Hit `https://shower.eblu.me/` (public via Fly) — splash page still served
- [ ] After merge: `argocd app set shower --revision main && argocd app sync shower`

Reviewed-on: #354
2026-05-11 20:08:03 -07:00
fbc1f7720e C0: gitignore .claude/scheduled_tasks.lock
Transient lock file written by the ScheduleWakeup harness tool when
Claude paces its own work between long-running operations. Not config,
not state worth checking in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:37:29 -07:00
4133785119 C1: ringtail — weekly flake.lock update (#352)
## Summary
- Recurring weekly lockfile refresh for `nixos/ringtail/flake.lock`.
- Inputs updated: `disko`, `home-manager`, `nixpkgs`.
- `nixpkgs-services` was deliberately skipped (per overlay convention — pinned services bump only on intentional update).
- Generated via `dagger call flake-update --src=. --flake-path=nixos/ringtail`.

## Test plan
- [x] `prek` hooks pass
- [ ] After merge: `mise run provision-ringtail` to deploy
- [ ] Then check for kernel update per [[manage-lockfile]]

## Notes
- Not deployed from this PR — provisioning is a follow-up.

Reviewed-on: #352
2026-05-11 16:13:07 -07:00
145df76d06 C1: service review — mealie (v3.12.0 deployed; upstream v3.17.0) (#351)
## Summary
- Recurring service review for `mealie`.
- Upstream is at **v3.17.0** (released 2026-05-06); deployed image is **v3.12.0** — 5 minor versions behind.
- Container is built locally from the forge mirror (`containers/mealie/Dockerfile`), so upgrade requires a fresh build + changelog review for breaking changes between v3.12 and v3.17.
- Deferring the actual upgrade to a separate task; this PR just refreshes `last-reviewed` and captures the gap in `notes`.

## Test plan
- [x] `prek` hooks pass
- [ ] Follow-up: open task to bump `containers/mealie/Dockerfile` `CONTAINER_APP_VERSION`, build, and update kustomization tag

## Notes
- No deployment changes in this PR.

Reviewed-on: #351
2026-05-11 16:12:36 -07:00
bb7efa850a C1: doc review — replicating-blumeops tutorial (#350)
## Summary
- Periodic doc review of `tutorials/replicating-blumeops.md` (was never reviewed).
- Fixed 4 instances of "BluemeOps" → "BlumeOps" (also caught 1 in `contributing.md`).
- Added `last-reviewed: 2026-05-11` and bumped `modified`.
- Verified all wiki-link targets resolve.

## Test plan
- [x] `prek` hooks pass (link checker, frontmatter checker)
- [ ] Optional: `mise run docs-preview docs/tutorials/replicating-blumeops.md`

Reviewed-on: #350
2026-05-11 16:11:35 -07:00
f83be3bf37 C1: review CC observability-stack-audit (extend to k3s) (#353)
## Summary
- Recurring compensating-control review (oldest stale control: 42 days).
- Verified the control is in effect on both clusters:
  - `alloy-k8s` on minikube-indri — Synced/Healthy, DaemonSet 1/1 ready
  - `alloy-ringtail` on k3s-ringtail — Synced/Healthy
  - `loki` (`monitoring/loki-0`) — Running, receiving logs (52 restarts in 18h is worth watching but not blocking review)
- Generalized the description: previously named only minikube, but the indri→ringtail migration means we now operate two clusters and both rely on this control.
- Added a follow-up note: enabling native apiserver audit logging is far more tractable on k3s (`--audit-log-path` / `--audit-policy-file`) than it was on minikube — worth revisiting once the migration concludes.

## Test plan
- [x] `prek` hooks pass
- [x] Verified alloy + loki status via `kubectl --context=minikube-indri` and `argocd app get`

## Notes
- No deployment changes.

Reviewed-on: #353
2026-05-11 16:10:39 -07:00
40d9a1ef9e C0: shower — rebuild from main SHA (post-PR-349 retag)
Standard squash-merge dance per
docs/how-to/deployment/build-container-image.md#Squash-merge-and-container-tags
— retags from v1.0.2-039d9b9-nix (branch SHA) to v1.0.2-292d354-nix
([main] tag from run 544 built off the merge commit). Functionally
identical; preserves source traceability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:55:25 -07:00
292d354902 C1: deploy adelaide-baby-shower-app to ringtail k3s (#349)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m12s
## Summary

Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app).

### What's included

- **ArgoCD app + manifests** under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer.
- **Public surface** (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett.
- **fail2ban** filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default).
- **Caddy** route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`).
- **Pulumi** Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`.
- **Grafana** APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter.
- **Container** `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages.
- **Docs** runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments.

### Defense layers on the public surface

1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list)
2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP)
3. django-axes (5 fails / 1h, keyed on username+ip_address)
4. edge `/admin/` block (returns 403 for anything that isn't login/logout)

## Prerequisites for the user to do (NOT in this PR)

Halted on these per request — they touch shared/manual systems:

- [x] **NFS share** on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000`
- [ ] **1Password item** `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git
- [ ] **Container build**: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix`
- [x] **DNS**: `mise run dns-up` after merge
- [x] **Fly cert**: `fly certs add shower.eblu.me -a blumeops-proxy`
- [ ] **Caddy push**: `mise run provision-indri -- --tags caddy`
- [ ] **Fly redeploy** to pick up the new nginx block + fail2ban jail: `mise run fly-deploy`
- [ ] **ArgoCD sync**: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging

## Test plan

- [ ] Container builds successfully on nix-container-builder runner
- [ ] Pod starts, migrations run, gunicorn answers on :8000
- [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean
- [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response
- [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead`
- [ ] Grafana dashboard "Shower APM" appears and starts showing traffic
- [ ] `mise run services-check` passes

Reviewed-on: #349
2026-05-11 13:47:18 -07:00
eceb2b99ce C0: bump homepage image to fixed-perms build (v1.11.0-678f26b-nix)
Pulls in 678f26b0 (chowned /app/config). Resolves the EACCES crash loop
on ringtail.
2026-05-10 21:16:34 -07:00
678f26b0e7 C0: fix homepage container /app/config write permissions
The previous Dockerfile chowned /app/config to 1000:1000 so the runtime
user could seed missing skeleton configs (e.g. proxmox.yaml) and write
/app/config/logs. The nix derivation didn't replicate that, so the new
amd64 image crashed with EACCES on cold start (fixed-forward — caught
during ringtail cutover, ArgoCD #348).

Add fakeRootCommands to dockerTools to create /app and /app/config and
chown them at build time. The deployment's ConfigMap subPath mounts
leave the parent directory as image filesystem, so its ownership has to
be set at build time, not at runtime.
2026-05-10 20:49:22 -07:00
ad7a0ed105 Merge pull request 'C1: migrate homepage dashboard from minikube to ringtail (nix-built amd64)' (#348) from homepage-to-ringtail into main 2026-05-10 20:40:33 -07:00
be54cc3411 C1: migrate homepage dashboard to ringtail k3s
Repoint the ArgoCD Application destination from minikube to ringtail and
bump the image tag to the new amd64 nix-built v1.11.0-b87f62e-nix.

Rework services.yaml for the autodiscovery shift: 11 services that
previously auto-populated via minikube Ingress annotations (ArgoCD,
Immich, Kiwix, Mealie, Miniflux, Grafana, Prometheus, Navidrome,
Paperless, TeslaMate, Transmission) become explicit static entries with
their widget configs preserved. Conversely, the ringtail services that
will now auto-populate (Frigate/NVR, Authentik, Ntfy) are removed from
the static list to avoid duplicates; Ollama becomes newly visible.

Add a Content group for Immich/Kiwix/Miniflux which previously lived
under the autodiscovered "Content" group from annotations.
2026-05-10 20:37:03 -07:00
b87f62e0f5 C1: nix-build homepage container for amd64 ringtail migration
Replace Dockerfile (arm64-only, indri-built) with a nix derivation
adapted from nixpkgs pkgs/by-name/ho/homepage-dashboard. Built via the
nix-container-builder runner on ringtail, producing an amd64 image
suitable for k3s.

Includes the upstream Next.js file-system-cache patch to avoid
prerender cache write failures on a read-only nix store path
(nixpkgs issues #328621 and #458494).

Pinned to v1.11.0 (current production version).
2026-05-10 20:32:38 -07:00
8bc19fa460 C0: tailscale main-SHA rebuild for ringtail proxyclass
Routine post-squash-merge cleanup. Bumps the ProxyClass image tag from
the now-orphaned PR branch SHA (67af7a8) to the merge commit SHA
(0108b68) so the deployed image stays traceable after branch cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:52:39 -07:00
0108b68769 C1: mirror tailscale container locally for ringtail proxyclass (#347)
## Summary

Adds the first cut of a local nix build for `docker.io/tailscale/tailscale` and rewires only the ringtail tailscale-operator overlay to use it. Indri's overlay continues pulling upstream — minikube on indri is being decommissioned in favor of ringtail's k3s, so investing in dual-cluster routing here would be wasted churn.

## Changes

- `containers/tailscale/default.nix` — `buildGoModule` over `cmd/tailscale`, `cmd/tailscaled`, `cmd/containerboot`; packaged via `dockerTools.buildLayeredImage` with `cacert`, `iptables` (legacy symlink to match upstream Synology compat), `iproute2`, `tzdata`, `busybox`.
- `argocd/manifests/tailscale-operator-ringtail/kustomization.yaml` — kustomize `images:` rewrite swapping `docker.io/tailscale/tailscale` → `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`.
- `docs/changelog.d/mirror-tailscale-container.infra.md` — fragment.

## Pin rationale

v1.94.2 matches `service-versions.yaml:96` and the current ProxyClass exactly — this PR is "make it local," not "upgrade tailscale." Version bumps come as follow-up C0/C1 changes once we decide to test newer (v1.96.x had a Fly-side MagicDNS regression; v1.98.0 is current upstream stable).

## Test plan

- [x] Image built successfully on ringtail nix-container-builder (run #528).
- [x] Image visible in registry: `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`.
- [ ] Deploy from branch: `argocd app set tailscale-operator-ringtail --revision mirror-tailscale-container && argocd app sync tailscale-operator-ringtail`.
- [ ] Verify proxy pods restart with new image and existing tailnet ingresses (e.g., authentik, immich, tempo) keep resolving.
- [ ] After merge: rebuild on main SHA, update kustomization, run `services-check`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #347
2026-05-06 06:50:31 -07:00
6f0d80ca1e C0: doc review — index.md, add ringtail to infra overview
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:14:40 -07:00
39b042e638 C0: service review — caddy v2.11.2 (current latest, healthy)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:11:15 -07:00
24e5490259 C0: review CC init-container-isolation — defer retirement to post-ringtail
Runtime grafana pod matches the manifest and the CC's claim; bumped
last-reviewed. Noted that retiring init-chown-data in favor of fsGroup
alone should wait until grafana migrates to ringtail's k3s, since the
storage backend will change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:31:13 -07:00
074887cd57 C0: docs — explanation article on compliance mute categories
Captures the CC vs NA vs RA distinction surfaced during the 2026-05-03
weekly compliance review (CVE-2026-31789), and the image-scan mutelist
gap that blocks acting on it. Links the new article from the
review-compensating-controls how-to so it isn't orphaned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:19:53 -07:00
9fb5442ccd C0: kiwix — doc review, fix Adding Archives source path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:46:16 -07:00
f16e1c81f1 C0: zot — upgrade indri registry to v2.1.16
Security fixes only (TLS verification on metrics client, CORS
Allow-Credentials suppression on wildcard origin, manifest/API-key
body-size limits, dependabot bumps). No config changes required;
re-built from source on indri and bounced launchagent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:41:07 -07:00
a2c61b625d C0: rotate-fly-deploy-token — fish+bash one-shot, op validator gotcha
Combine mint+store into a single command with both fish and bash
forms (the doc previously only showed manual paste). Document the
1Password CLI "Password item requires ps value" validator error and
the placeholder-password workaround for Password-category items with
empty primary password fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:42:57 -07:00
2c0917b266 C0: valkey — bump kustomization tags to main-branch SHA
Routine post-merge follow-up after #346. Branch SHA tag (946fa75) replaced
with the main-SHA-built tag (fabca04) so paperless and immich reference an
image traceable to a commit on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:47:16 -07:00
fabca04771 Mirror valkey 8.1 locally for paperless and immich (#346)
## Summary

- Add native Dagger build of valkey 8.1.6-r0 on Alpine 3.22 at `containers/valkey/`
- Swap paperless redis sidecar and immich-valkey from `docker.io/valkey/valkey:8.1-alpine` to `registry.ops.eblu.me/blumeops/valkey:v8.1.6-r0-946fa75`
- Resolves the DR-2026-04 TODO in paperless kustomization about multi-arch redis

## Why

Move toward fully locally-built containers for supply chain control. Paperless and immich both pulled the same upstream tag — one mirror serves both. Authentik's nix-built Redis stays separate (different image entirely).

## Risk

Low. Both sidecars are stateless caches:
- paperless redis: no volumeMount (in-pod localhost, pure memory)
- immich-valkey: `emptyDir` (cache only)

Pod restart rebuilds the cache. Smoke-tested locally (PING/SET/GET roundtrip on `valkey 8.1.6` with `--bind 0.0.0.0 --protected-mode no`).

## Test plan

- [ ] After merge: `mise run container-build-and-release valkey` to rebuild with main SHA
- [ ] Update kustomizations to the `[main]` SHA tag (C0 follow-up)
- [ ] `argocd app sync paperless` and `argocd app sync immich`
- [ ] Verify pods come up healthy (paperless OCR queue functional, immich job queue functional)
- [ ] `mise run services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #346
2026-05-01 17:40:03 -07:00
f84f5f02b3 C0: review compensating control trusted-ci-only
Verified Forgejo runner is registered only to forge.ops.eblu.me and the
forge has registration disabled, so no untrusted users can trigger
privileged CI. Tightened notes to reflect the closed-forge mechanism
(not a per-repo allow-list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:49:22 -07:00
4aa0872949 C0: review ollama doc — refresh image, models, last-reviewed
Bumped documented image tag to 0.20.4 (matches kustomization newTag),
added the two qwen3.5 models from models.txt, and stamped the card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:33 -07:00
2d55303213 C0: alloy native macOS on indri — upgrade to v1.16.0
Completes the v1.16.0 fleet upgrade for the fourth alloy service
(type: ansible, built from source on indri). Binary built on gilbert
with Go 1.26.2 + CGO, scp'd to indri, codesigned, LaunchAgent reloaded.
Service reports clean WAL replay and resumed metric/log shipping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:36:38 -07:00
55563afc7e C0: alloy — bump kustomization tags to main-branch SHA
Per the build-container-image squash-merge convention, rebuild alloy v1.16.0
container images from the main SHA (9564435) and update the three alloy
kustomizations to reference :v1.16.0-9564435[-nix] instead of the branch
SHA :v1.16.0-26a3ab5[-nix] left over from #345.

Both images were rebuilt locally on gilbert (dagger) and ringtail (nix)
because indri is still under heavy macOS memory-compressor pressure (see
separate ticket); CI on indri can't reliably run the dagger publish step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:31:27 -07:00
9564435b11 Alloy V1.16.0 (#345)
Bump Grafana Alloy v1.14.0 → v1.16.0 across all four services (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail; alloy native ansible). Also migrate the indri build path from `Dockerfile` to a native Dagger `container.py` per the build-container-image migration playbook.

## Highlights from upstream
- v1.15: database observability promoted to stable, OTel Collector → v0.147.0
- v1.16: clustering for `loki.source.kubernetes_events`, MySQL exporter 0.19.0
- One pre-existing breaking change in v1.15 (`loki.source.awsfirehose` undocumented metric prefix rename) — not used here.

## Build infra
Alloy v1.16.0's go.mod requires Go 1.26.2. The nix derivation now uses `pkgs.go_1_26` with `GOTOOLCHAIN=local` to avoid auto-downloading a toolchain blob that violated the fixed-output rule.

## Test plan
- [ ] CI: `mise run container-build-and-release alloy --ref alloy-v1.16.0` (dispatched as run 522; nix job to be re-triggered with the v1.16.0 goModules outputHash once the local ringtail build surfaces it)
- [ ] After CI green, bump `images[].newTag` in three kustomizations to the new `-<sha>` and `-<sha>-nix` tags, deploy from this branch via `argocd app set <app> --revision alloy-v1.16.0 && argocd app sync <app>`
- [ ] Manual rebuild of macOS native binary on gilbert (per ansible/roles/alloy README) and `mise run provision-indri -- --tags alloy --check --diff`
- [ ] `mise run services-check` after merge & redeploy

Reviewed-on: #345
2026-05-01 08:05:37 -07:00