C1: deploy adelaide-baby-shower-app to ringtail k3s #349

Merged
eblume merged 20 commits from shower-app-deploy into main 2026-05-11 13:47:20 -07:00
Owner

Summary

Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: shower.eblu.me (public, via Fly proxy) and shower.ops.eblu.me (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at adelaide-baby-shower-app.

What's included

  • ArgoCD app + manifests under argocd/manifests/shower/ (deployment, service, ProxyGroup ingress, ConfigMap for DJANGO_DEBUG/DJANGO_ADMIN_URL, ExternalSecret for DJANGO_SECRET_KEY from 1Password item Shower (blumeops), NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer.
  • Public surface (fly/): new shower.eblu.me server block proxying to shower.ops.eblu.me. /admin/ returns 403 at the edge except /admin/login/ and /admin/logout/, which are rate-limited via a new shower_auth zone. X-Clacks-Overhead on. GNU Terry Pratchett.
  • fail2ban filter (shower-admin-login.conf) matching 401/403/429 on /admin/login/ and jail (shower.conf) with maxretry=5/findtime=600/bantime=3600. The nginx-deny action was generalized to take a per-jail nginx_deny_file so the shower has its own deny list (forge keeps using the legacy default).
  • Caddy route on indri (shower.ops.eblu.mehttps://shower.tail8d86e.ts.net).
  • Pulumi Gandi CNAME shower.eblu.me → blumeops-proxy.fly.dev..
  • Grafana APM dashboard configmap-shower-apm.yaml (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring docs-apm.json with a host="shower.eblu.me" filter.
  • Container containers/shower/default.nixdockerTools.buildLayeredImage with a nixpkgs Python and a startup wrapper that creates /app/data/.venv, pip-installs adelaide-baby-shower-app==1.0.0 from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A local_settings.py shim pins DATABASES.NAME/MEDIA_ROOT/STATIC_ROOT to absolute paths so they don't end up in site-packages.
  • Docs runbook at docs/how-to/operations/shower-app.md linked from the apps registry, plus changelog fragments.

Defense layers on the public surface

  1. fly nginx geo+fail2ban $shower_banned (per-service deny list)
  2. fly nginx limit_req zone=shower_auth (3 r/s per Fly-Client-IP)
  3. django-axes (5 fails / 1h, keyed on username+ip_address)
  4. edge /admin/ block (returns 403 for anything that isn't login/logout)

Prerequisites for the user to do (NOT in this PR)

Halted on these per request — they touch shared/manual systems:

  • NFS share on sifaka: /volume1/shower, NFS rule for ringtail RW, chown 1000:1000
  • 1Password item Shower (blumeops) in the blumeops vault with a freshly minted secret-key field (openssl rand -base64 48) — do NOT reuse anything that has lived in git
  • Container build: mise run container-build-and-release shower, then update images[].newTag in argocd/manifests/shower/kustomization.yaml to the resulting v1.0.0-<sha>-nix
  • DNS: mise run dns-up after merge
  • Fly cert: fly certs add shower.eblu.me -a blumeops-proxy
  • Caddy push: mise run provision-indri -- --tags caddy
  • Fly redeploy to pick up the new nginx block + fail2ban jail: mise run fly-deploy
  • ArgoCD sync: argocd app set shower --revision shower-app-deploy && argocd app sync shower to test from this branch before merging

Test plan

  • Container builds successfully on nix-container-builder runner
  • Pod starts, migrations run, gunicorn answers on :8000
  • kubectl --context=k3s-ringtail -n shower logs deploy/shower clean
  • curl -sf https://shower.ops.eblu.me/ returns the splash page (tailnet)
  • curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/ returns 200 (pre-DNS verification)
  • curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/ returns 403 (edge block)
  • curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/ returns a Django login response
  • After DNS is up: curl -I https://shower.eblu.me/ returns 200 with X-Clacks-Overhead
  • Grafana dashboard "Shower APM" appears and starts showing traffic
  • mise run services-check passes
## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - **ArgoCD app + manifests** under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - **Public surface** (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - **fail2ban** filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - **Caddy** route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - **Pulumi** Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - **Grafana** APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - **Container** `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - **Docs** runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] **NFS share** on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] **1Password item** `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] **Container build**: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] **DNS**: `mise run dns-up` after merge - [x] **Fly cert**: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] **Caddy push**: `mise run provision-indri -- --tags caddy` - [ ] **Fly redeploy** to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] **ArgoCD sync**: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes
Adds the Adelaide / Heidi / Addie baby shower app — a Django guest
splash, raffle picker, and prize-assignment console — on ringtail k3s.
Public landing at shower.eblu.me (via fly proxy), tailnet admin at
shower.ops.eblu.me. App source: forge.eblu.me/eblume/adelaide-baby-shower-app,
wheel-published to the Forgejo Packages PyPI index.

Manifests under argocd/manifests/shower/: NFS-backed PVC for /app/media,
local-path PVC for SQLite, ExternalSecret pulling DJANGO_SECRET_KEY from
1Password (item "Shower (blumeops)"), Tailscale ProxyGroup ingress.

Defense-in-depth for the public surface:
  - /admin/ blocked at the fly edge except /admin/login/ and /admin/logout/
  - shower_auth rate limit on the login path
  - new fail2ban filter+jail with a per-service shower-deny.conf
    (nginx-deny action generalized to accept nginx_deny_file)
  - django-axes (5 / 1h) keyed on (username, ip_address)

Plus: Caddy route on indri, Pulumi gandi CNAME, Grafana APM dashboard
mirroring docs-apm.json, runbook at how-to/operations/shower-app.md,
and a service-versions entry. X-Clacks-Overhead set on the new server
block — GNU Terry Pratchett.

Build: containers/shower/default.nix uses dockerTools to ship a
nixpkgs Python plus a startup wrapper that installs the wheel into
/app/data/.venv on first boot and execs gunicorn. Lets the wheel come
from forge PyPI without pinning hashes for every transitive dep.

Prerequisites tracked in the runbook (not yet executed):
  - NFS share sifaka:/volume1/shower (manual Synology step)
  - 1Password item "Shower (blumeops)" with secret-key field
  - container build via `mise run container-build-and-release shower`
  - Pulumi dns-up after merge
  - fly certs add shower.eblu.me

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups on the shower deployment branch:

1. containers/shower/default.nix now uses buildPythonPackage to install
   the adelaide-baby-shower-app wheel + its deps at nix build time. The
   wheel comes from the forge PyPI index with a pinned SRI hash. The
   entrypoint no longer does pip-at-boot — it just runs migrations,
   collectstatic, and execs gunicorn.

2. ansible/roles/borgmatic/defaults/main.yml:
   - Adds shower to borgmatic_k8s_sqlite_dumps (context k3s-ringtail)
     so /app/data/db.sqlite3 is dumped via kubectl exec on every run.
   - Adds /Volumes/shower (sifaka SMB mount on indri) to
     borgmatic_source_directories so prize-photo media gets archived.

3. NFS share docs corrected to match the real on-sifaka pattern:
   exports allowlist 192.168.1.0/24 + 100.64.0.0/10 with all_squash to
   admin (matching frigate/paperless/etc.), not "Squash=No mapping".
   The pod's runAsUser doesn't need to match an on-disk uid because
   all_squash rewrites every write to admin:users.

Also adds a missing service-versions entry for the tailscale container
introduced in PR #347 — pre-existing gap surfaced by the
container-version-check hook on this commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The buildPythonPackage approach with `propagatedBuildInputs = [ python.pkgs.django ... ]` doesn't work:

  1. nixpkgs python314Packages.django still aliases to Django 4.2 LTS,
     which doesn't support Python 3.14.
  2. django-axes from nixpkgs pulls selenium + browser fonts into its
     check phase, and the nix sandbox can't provide those (fontconfig
     errors, then build dep tree collapses).

Switching to authentik's FOD pattern instead: a single fixed-output
derivation that pip-installs the adelaide-baby-shower-app wheel + every
transitive dep from forge PyPI into a target dir. FODs get network
access in exchange for a pinned output hash, so the closure stays
reproducible.

outputHash is set to fakeHash for the first build — the runner will
print the real hash on failure; a follow-up commit will pin it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 534 failed with 'fixed-output derivations must not reference store
paths: ... gcc-14.3.0-lib' because pip-installed wheels pulled stdenv
into the venv (Python's setup, gcc-lib runtime references).

Adapts authentik's two-stage pattern:
- pyDepsFOD: pip-installs into the venv, then strips every nix store
  ref it can find (find+remove-references-to). Output is fully
  self-contained — pinned by outputHash.
- pyDeps (non-FOD wrapper): copies the FOD output and runs
  autoPatchelfHook against runtime buildInputs (libstdc++, zlib, image
  libs for pillow). This restores RPATHs on the .so files that pillow
  and scipy ship, against the real on-image library locations.

outputHash still fakeHash — next build prints the real one.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Build 536 finished cleanly with the strip-refs FOD + autopatchelf
wrapper. The [branch] tag is fine for ArgoCD branch-revision testing;
a follow-up C0 will rebuild from main and re-pin to the [main] SHA tag
after merge, per docs/how-to/deployment/build-container-image.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets a re-run of `mise run fly-setup` (e.g. after a fly-app rebuild or
when bootstrapping fresh) re-issue the cert without remembering the
ad-hoc `fly certs add` we did during this deployment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ -0,0 +6,4 @@
data:
DJANGO_DEBUG: "0"
# Admin lives behind the tailnet; the public proxy blocks /admin/ except
# /admin/login/ and /admin/logout/. /host/'s "Django admin" link follows
Author
Owner

Hmm can you please remind me why /admin/login and /admin/logout need to be accessible on WAN? Can't we just forward any logins/logouts to the tailnet hostname as well, and thus not expose admin login on WAN at all?

Hmm can you please remind me why /admin/login and /admin/logout need to be accessible on WAN? Can't we just forward any logins/logouts to the tailnet hostname as well, and thus not expose admin login on WAN at all?
eblume marked this conversation as resolved
@ -0,0 +1,224 @@
---
title: Shower App on Ringtail
Author
Owner

This is a good how-to article, but let's also have a reference page for this app too - aim for a 30s read time, just basic facts and links to other cards.

This is a good how-to article, but let's also have a reference page for this app too - aim for a 30s read time, just basic facts and links to other cards.
eblume marked this conversation as resolved
PR review caught that we didn't need an admin login surface on WAN.
App v1.0.1 adds DJANGO_PUBLIC_URL_BASE so QR codes generated from
/host/ (now tailnet-only) still point at shower.eblu.me for guest
phones — that closes the loop and lets us strip the WAN admin surface
entirely.

Container:
  - bump version to 1.0.1
  - outputHash → fakeHash (build will print the real one)
  - entrypoint still does migrate + collectstatic before gunicorn —
    the app is small enough that auto-migration is fine

Manifests:
  - configmap adds DJANGO_PUBLIC_URL_BASE=https://shower.eblu.me

Fly nginx (shower.eblu.me):
  - drop the /admin/(login|logout) carveout
  - 403 anything under /admin/ AND /host/ with a "tailnet only" pointer
  - drop the shower_auth limit_req zone and \$shower_banned geo
  - drop the shower-admin-login fail2ban filter + jail
  - drop the shower-deny.conf touch from start.sh

Docs:
  - rename how-to docs/how-to/operations/shower-app.md →
    shower-on-ringtail.md (mirrors cv-on-indri / docs-on-indri)
  - new reference card docs/reference/services/shower-app.md per PR
    review comment 2 (≈30s read; quick facts + cross-links)
  - rewrite Defense layers section: collapses to general rate limit +
    django-axes on the tailnet-side login (the only credential surface)
  - rewrite the .infra.md changelog fragment to match
  - add a 'Create the admin user' step (kubectl exec createsuperuser)
    so first-time deploys aren't locked out

The nginx-deny action's per-jail \`nginx_deny_file\` generalization
stays — harmless future-proofing for the next public service.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two complementary fixes for the deploy that just landed:

1. Pod was 0/1 Running because the readiness probe sends
   `Host: shower.ops.eblu.me` and the app's hardcoded ALLOWED_HOSTS
   only includes `shower.eblu.me`. settings.py exposes a
   DJANGO_ALLOWED_HOSTS env-var extras hook for exactly this case —
   wired into the configmap.

2. `kubectl exec deploy/shower -- python -m django <cmd>` returned
   "No module named django" because PYTHONPATH lived only inside the
   entrypoint script. Moved PYTHONPATH, DJANGO_SETTINGS_MODULE, PATH,
   and HOME into the image's Env block so exec'd shells inherit them.
   The entrypoint now just runs the boot sequence; the exports are
   redundant (image Env covers them) and gone.

FOD inputs are unchanged so outputHash stays valid; no fakeHash dance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Doc said "Store the auth key in 1Password as well for the \`fly-setup\`
mise task" right next to the description of fly-setup, which reads
the key from Pulumi state, not 1Password. No code path anywhere reads
this key from 1P — the instruction is vestigial from an earlier
design and confused us during the v1.0.1 rotation when the
flyio-proxy-key expired.

Rewrite the section to:
  - point at \`mise run fly-setup\` as the canonical path
  - state explicitly that Pulumi state is the only source of truth
  - document the rotation recipe (tailnet-up --replace=<urn> +
    fly-setup + fly-deploy) for the next time this 90-day key lapses

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
App v1.0.2 ships WhiteNoise for /static/ and /media/, so the
blumeops-side workaround is no longer needed:

  - containers/shower/default.nix: drop the WhiteNoise pip dep + the
    middleware-injection block from local_settings. The shim is back
    to just path overrides (DATABASES.NAME, MEDIA_ROOT, STATIC_ROOT).
  - version → 1.0.2, outputHash → fakeHash for re-pinning.
  - service-versions.yaml mirrored.

fly/nginx.conf: cache /static/ (1y) and /media/ (1d) per location for
shower.eblu.me. /static/ filenames are content-hashed thanks to
CompressedManifestStaticFilesStorage so a year is safe and invalidation
is automatic on the next collectstatic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
forge.eblu.me's package registry (/api/packages/* and /api/v1/packages/*)
served anonymous reads to the world even for private-repo releases —
Forgejo's per-user visibility treats packages as world-readable when
the owner's Visibility is Public, and we keep eblume Public so the
profile page stays open. The sdist downloads include full source
trees of private repos; that's the leak.

The fix is to keep the user public but block /api/packages/* and
/api/v1/packages/* at the proxy edge. forge.ops.eblu.me (tailnet) is
untouched, so CI workflows + gilbert's uv + the nix-container-builder
still work — they just need to use the tailnet hostname.

Three consumers updated to forge.ops.eblu.me:
  - containers/shower/default.nix (the FOD pip --extra-index-url)
  - ansible/roles/cv/defaults/main.yml (cv_release_url for generic package)
  - chezmoi-tracked fish dotfiles (devpi.fish + conf.d/pypi.fish) —
    edited in chezmoi source, user will apply separately

The blumeops repo had no other forge-pypi consumers (audited: workers,
runner-job-image, ansible roles, container builds). Doc references in
changelog fragments + comments left as-is — they describe history.

The proper long-term fix is to move private packages to a Limited-
visibility Forgejo org instead of relying on a proxy-side block (see
queued Todoist for the migration plan). Edge block stays as
defense in depth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The wheel ships config/ and shower/ only (per pyproject hatchling
config), leaving the repo's top-level static/ dir — Sortable.min.js,
cropper.min.js, cropper.min.css, prize-placeholder.svg — behind. At
runtime, host_dashboard.html's {% static 'css/cropper.min.css' %}
hits the manifest, CompressedManifestStaticFilesStorage raises
ValueError on the missing entry, /host/ returns 500.

Fix on the deploy side: fetch the sdist via fetchurl (pinned SRI hash
from forge PyPI), extract its top-level static/ subtree into a
non-FOD derivation, lay it down at /app/static in the image. The
local_settings shim adds /app/static to STATICFILES_DIRS so
collectstatic at boot picks the vendored assets up alongside the
Django admin's own static files.

Sdist URL is forge.ops.eblu.me/api/packages/... (tailnet) — matches
the just-landed edge block on forge.eblu.me/api/packages/*. The
nix-container-builder runner on ringtail is on the tailnet, so the
FOD fetch works.

App doesn't change. v1.0.3 is no longer needed for the static gap —
the wheel's "packages = [config, shower]" pattern stays as-is, and we
treat the sdist as the canonical bundle for the assets the wheel
intentionally omits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Default `general` zone (10r/s burst=20) is tuned for internet drive-by
traffic. At the party, 30 guests scanning the splash QR from one
venue-wifi NAT'd public IP would each fetch HTML + ~5 static assets
within a few seconds — easily clearing burst=20, and the second-wave
guests would see 503 with no auto-retry.

New shower_general zone (50r/s burst=200) absorbs that simultaneous-
load spike. Exploit scanners still trip it: the 45.88.138.44 burst
we already saw in Loki fired ~30 req in 2s, well above the new
sustained 50r/s when extrapolated, and burst=200 is still a hard cap
on instantaneous spikes.

Self-healing: `limit_req` is a token bucket — no persistent ban,
nothing to manually flush. A guest who trips it auto-recovers within
~1s; tuning here is about not tripping it on legit traffic in the
first place.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
eblume merged commit 292d354902 into main 2026-05-11 13:47:20 -07:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!349
No description provided.