blumeops

Author	SHA1	Message	Date
Erich Blume	d0b5423135	C1: pin ringtail wired IP to 192.168.1.21 (static) Removes DHCP lease renewal as a failure mode on ringtail after an outage on 2026-05-12 where the IP and routes silently disappeared from enp5s0 without any kernel link event. NetworkManager stays enabled for wireless fallback but no longer manages the wired interface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-12 09:33:57 -07:00
Erich Blume	dc0916a548	C0: shower — rebuild from main SHA (post-merge retag) PR #354 was squash-merged so the branch commit `444ff91` baked into the prior image tag isn't reachable from main's history. Rebuild from main HEAD (`3c7967e`) and retag. Image content is byte-identical (FOD is content-addressed, inputs unchanged); only the SHA in the tag changes so future provenance tracing stays on main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 20:20:39 -07:00
Erich Blume	3c7967e445	C1: deploy shower v1.1.0 (phases + guest memories) (#354 ) ## Summary Deploys `adelaide-baby-shower-app` v1.1.0 to ringtail k3s. ### App changes (since v1.0.2) - Four-phase `ShowerState` replaces the boolean `locked` flag — `pre_event` → `party` → `prizes_locked` → `event_locked` — with a backfill migration that maps `locked=True → pre_event`, `locked=False → party`. - Guest memories: append-only photos + comments panel where guests can leave notes for the baby. Adds `GuestPhoto` + `GuestComment` models with file-extension validators and a max-size validator; new `shower.imaging` module for thumbnail generation. - Admin + QR polish: configurable host link, fixed "View Site" URL, guest-facing QR copy improvements, contest tweaks. Three Django migrations run automatically in the entrypoint against the SQLite PV: - `0009_shower_phase` - `0010_guest_memories` - `0011_book_description` No ConfigMap / env-var changes. The deploy uses `strategy: Recreate` with a single replica, so the old pod releases the data PVC before the new one mounts it and runs migrations. ### Container build changes The v1.1.0 tag exposed a latent issue with the Forgejo PyPI install path: - The recent commit [`2d38418e`](`2d38418e`) closed the forge package leak at the Fly edge by blocking `/api/packages/*` publicly. - Forgejo's PyPI simple index returns absolute file URLs hardcoded to its public `ROOT_URL` (`forge.eblu.me`), so pip-installing from the tailnet index URL still tries to download from `forge.eblu.me` → 403. - Previous shower builds escaped this because their FOD outputs were already in the nix store; bumping to a new version forced a fresh pip run that hit the block. Fix mirrors what we already do for the sdist: both wheel and sdist are pulled via direct `fetchurl` against `forge.ops.eblu.me`, then the wheel is copied to TMPDIR under its clean filename (nix store path's hash prefix breaks pip's wheel-filename parser) and handed to pip as a local path. The forge `--extra-index-url` is no longer needed. FOD outputHash pinned to `sha256-kTNOswobtkgyQmmqbQM8XO4vvaGg57nCuuZGbNXb0NM=` from run 547. Image: `registry.ops.eblu.me/blumeops/shower:v1.1.0-444ff91-nix`. ### Adjacent finding (already handled) The ringtail `gitea-runner-nix_container_builder` systemd unit was left `inactive` after the recent `provision-ringtail` (matches the known `sshd-restart-hangs-mux` lesson — the rebuild changed the unit's PATH closure + config.yaml, systemd stopped it, then the playbook hung before the activation could restart it). Manually started; the existing memory `lesson_provision_ringtail_ssh_hang.md` was extended to mention the runner as the canary service to check after provisions. ## Test plan - [ ] `argocd app diff shower --revision shower-v1.1.0` — review the manifest change - [ ] `argocd app set shower --revision shower-v1.1.0 && argocd app sync shower` - [ ] `kubectl --context=k3s-ringtail logs -n shower deploy/shower` — confirm migrations 0009/0010/0011 applied, no errors - [ ] Hit `https://shower.ops.eblu.me/` (tailnet) — splash page renders, phase indicator visible - [ ] Hit `https://shower.ops.eblu.me/host/` — host console loads, phase dropdown shows the four states - [ ] Hit `https://shower.eblu.me/` (public via Fly) — splash page still served - [ ] After merge: `argocd app set shower --revision main && argocd app sync shower` Reviewed-on: #354	2026-05-11 20:08:03 -07:00
Erich Blume	fbc1f7720e	C0: gitignore .claude/scheduled_tasks.lock Transient lock file written by the ScheduleWakeup harness tool when Claude paces its own work between long-running operations. Not config, not state worth checking in. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:37:29 -07:00
Erich Blume	4133785119	C1: ringtail — weekly flake.lock update (#352 ) ## Summary - Recurring weekly lockfile refresh for `nixos/ringtail/flake.lock`. - Inputs updated: `disko`, `home-manager`, `nixpkgs`. - `nixpkgs-services` was deliberately skipped (per overlay convention — pinned services bump only on intentional update). - Generated via `dagger call flake-update --src=. --flake-path=nixos/ringtail`. ## Test plan - [x] `prek` hooks pass - [ ] After merge: `mise run provision-ringtail` to deploy - [ ] Then check for kernel update per [[manage-lockfile]] ## Notes - Not deployed from this PR — provisioning is a follow-up. Reviewed-on: #352	2026-05-11 16:13:07 -07:00
Erich Blume	145df76d06	C1: service review — mealie (v3.12.0 deployed; upstream v3.17.0) (#351 ) ## Summary - Recurring service review for `mealie`. - Upstream is at v3.17.0 (released 2026-05-06); deployed image is v3.12.0 — 5 minor versions behind. - Container is built locally from the forge mirror (`containers/mealie/Dockerfile`), so upgrade requires a fresh build + changelog review for breaking changes between v3.12 and v3.17. - Deferring the actual upgrade to a separate task; this PR just refreshes `last-reviewed` and captures the gap in `notes`. ## Test plan - [x] `prek` hooks pass - [ ] Follow-up: open task to bump `containers/mealie/Dockerfile` `CONTAINER_APP_VERSION`, build, and update kustomization tag ## Notes - No deployment changes in this PR. Reviewed-on: #351	2026-05-11 16:12:36 -07:00
Erich Blume	bb7efa850a	C1: doc review — replicating-blumeops tutorial (#350 ) ## Summary - Periodic doc review of `tutorials/replicating-blumeops.md` (was never reviewed). - Fixed 4 instances of "BluemeOps" → "BlumeOps" (also caught 1 in `contributing.md`). - Added `last-reviewed: 2026-05-11` and bumped `modified`. - Verified all wiki-link targets resolve. ## Test plan - [x] `prek` hooks pass (link checker, frontmatter checker) - [ ] Optional: `mise run docs-preview docs/tutorials/replicating-blumeops.md` Reviewed-on: #350	2026-05-11 16:11:35 -07:00
Erich Blume	f83be3bf37	C1: review CC observability-stack-audit (extend to k3s) (#353 ) ## Summary - Recurring compensating-control review (oldest stale control: 42 days). - Verified the control is in effect on both clusters: - `alloy-k8s` on minikube-indri — Synced/Healthy, DaemonSet 1/1 ready - `alloy-ringtail` on k3s-ringtail — Synced/Healthy - `loki` (`monitoring/loki-0`) — Running, receiving logs (52 restarts in 18h is worth watching but not blocking review) - Generalized the description: previously named only minikube, but the indri→ringtail migration means we now operate two clusters and both rely on this control. - Added a follow-up note: enabling native apiserver audit logging is far more tractable on k3s (`--audit-log-path` / `--audit-policy-file`) than it was on minikube — worth revisiting once the migration concludes. ## Test plan - [x] `prek` hooks pass - [x] Verified alloy + loki status via `kubectl --context=minikube-indri` and `argocd app get` ## Notes - No deployment changes. Reviewed-on: #353	2026-05-11 16:10:39 -07:00
Erich Blume	40d9a1ef9e	C0: shower — rebuild from main SHA (post-PR-349 retag) Standard squash-merge dance per docs/how-to/deployment/build-container-image.md#Squash-merge-and-container-tags — retags from v1.0.2-039d9b9-nix (branch SHA) to v1.0.2-292d354-nix ([main] tag from run 544 built off the merge commit). Functionally identical; preserves source traceability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 13:55:25 -07:00
Erich Blume	292d354902	C1: deploy adelaide-baby-shower-app to ringtail k3s (#349 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m12s Details ## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - ArgoCD app + manifests under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - Public surface (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - fail2ban filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - Caddy route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - Pulumi Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - Grafana APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - Container `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - Docs runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] NFS share on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] 1Password item `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] Container build: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] DNS: `mise run dns-up` after merge - [x] Fly cert: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] Caddy push: `mise run provision-indri -- --tags caddy` - [ ] Fly redeploy to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] ArgoCD sync: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes Reviewed-on: #349	2026-05-11 13:47:18 -07:00
Erich Blume	eceb2b99ce	C0: bump homepage image to fixed-perms build (v1.11.0-678f26b-nix) Pulls in `678f26b0` (chowned /app/config). Resolves the EACCES crash loop on ringtail.	2026-05-10 21:16:34 -07:00
Erich Blume	678f26b0e7	C0: fix homepage container /app/config write permissions The previous Dockerfile chowned /app/config to 1000:1000 so the runtime user could seed missing skeleton configs (e.g. proxmox.yaml) and write /app/config/logs. The nix derivation didn't replicate that, so the new amd64 image crashed with EACCES on cold start (fixed-forward — caught during ringtail cutover, ArgoCD #348). Add fakeRootCommands to dockerTools to create /app and /app/config and chown them at build time. The deployment's ConfigMap subPath mounts leave the parent directory as image filesystem, so its ownership has to be set at build time, not at runtime.	2026-05-10 20:49:22 -07:00
Erich Blume	ad7a0ed105	Merge pull request 'C1: migrate homepage dashboard from minikube to ringtail (nix-built amd64)' (#348 ) from homepage-to-ringtail into main	2026-05-10 20:40:33 -07:00
Erich Blume	be54cc3411	C1: migrate homepage dashboard to ringtail k3s Repoint the ArgoCD Application destination from minikube to ringtail and bump the image tag to the new amd64 nix-built v1.11.0-b87f62e-nix. Rework services.yaml for the autodiscovery shift: 11 services that previously auto-populated via minikube Ingress annotations (ArgoCD, Immich, Kiwix, Mealie, Miniflux, Grafana, Prometheus, Navidrome, Paperless, TeslaMate, Transmission) become explicit static entries with their widget configs preserved. Conversely, the ringtail services that will now auto-populate (Frigate/NVR, Authentik, Ntfy) are removed from the static list to avoid duplicates; Ollama becomes newly visible. Add a Content group for Immich/Kiwix/Miniflux which previously lived under the autodiscovered "Content" group from annotations.	2026-05-10 20:37:03 -07:00
Erich Blume	b87f62e0f5	C1: nix-build homepage container for amd64 ringtail migration Replace Dockerfile (arm64-only, indri-built) with a nix derivation adapted from nixpkgs pkgs/by-name/ho/homepage-dashboard. Built via the nix-container-builder runner on ringtail, producing an amd64 image suitable for k3s. Includes the upstream Next.js file-system-cache patch to avoid prerender cache write failures on a read-only nix store path (nixpkgs issues #328621 and #458494). Pinned to v1.11.0 (current production version).	2026-05-10 20:32:38 -07:00
Erich Blume	8bc19fa460	C0: tailscale main-SHA rebuild for ringtail proxyclass Routine post-squash-merge cleanup. Bumps the ProxyClass image tag from the now-orphaned PR branch SHA (`67af7a8`) to the merge commit SHA (`0108b68`) so the deployed image stays traceable after branch cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 06:52:39 -07:00
Erich Blume	0108b68769	C1: mirror tailscale container locally for ringtail proxyclass (#347 ) ## Summary Adds the first cut of a local nix build for `docker.io/tailscale/tailscale` and rewires only the ringtail tailscale-operator overlay to use it. Indri's overlay continues pulling upstream — minikube on indri is being decommissioned in favor of ringtail's k3s, so investing in dual-cluster routing here would be wasted churn. ## Changes - `containers/tailscale/default.nix` — `buildGoModule` over `cmd/tailscale`, `cmd/tailscaled`, `cmd/containerboot`; packaged via `dockerTools.buildLayeredImage` with `cacert`, `iptables` (legacy symlink to match upstream Synology compat), `iproute2`, `tzdata`, `busybox`. - `argocd/manifests/tailscale-operator-ringtail/kustomization.yaml` — kustomize `images:` rewrite swapping `docker.io/tailscale/tailscale` → `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`. - `docs/changelog.d/mirror-tailscale-container.infra.md` — fragment. ## Pin rationale v1.94.2 matches `service-versions.yaml:96` and the current ProxyClass exactly — this PR is "make it local," not "upgrade tailscale." Version bumps come as follow-up C0/C1 changes once we decide to test newer (v1.96.x had a Fly-side MagicDNS regression; v1.98.0 is current upstream stable). ## Test plan - [x] Image built successfully on ringtail nix-container-builder (run #528). - [x] Image visible in registry: `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`. - [ ] Deploy from branch: `argocd app set tailscale-operator-ringtail --revision mirror-tailscale-container && argocd app sync tailscale-operator-ringtail`. - [ ] Verify proxy pods restart with new image and existing tailnet ingresses (e.g., authentik, immich, tempo) keep resolving. - [ ] After merge: rebuild on main SHA, update kustomization, run `services-check`. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #347	2026-05-06 06:50:31 -07:00
Erich Blume	6f0d80ca1e	C0: doc review — index.md, add ringtail to infra overview Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 06:14:40 -07:00
Erich Blume	39b042e638	C0: service review — caddy v2.11.2 (current latest, healthy) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 06:11:15 -07:00
Erich Blume	24e5490259	C0: review CC init-container-isolation — defer retirement to post-ringtail Runtime grafana pod matches the manifest and the CC's claim; bumped last-reviewed. Noted that retiring init-chown-data in favor of fsGroup alone should wait until grafana migrates to ringtail's k3s, since the storage backend will change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:31:13 -07:00
Erich Blume	074887cd57	C0: docs — explanation article on compliance mute categories Captures the CC vs NA vs RA distinction surfaced during the 2026-05-03 weekly compliance review (CVE-2026-31789), and the image-scan mutelist gap that blocks acting on it. Links the new article from the review-compensating-controls how-to so it isn't orphaned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 18:19:53 -07:00
Erich Blume	9fb5442ccd	C0: kiwix — doc review, fix Adding Archives source path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:46:16 -07:00
Erich Blume	f16e1c81f1	C0: zot — upgrade indri registry to v2.1.16 Security fixes only (TLS verification on metrics client, CORS Allow-Credentials suppression on wildcard origin, manifest/API-key body-size limits, dependabot bumps). No config changes required; re-built from source on indri and bounced launchagent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:41:07 -07:00
Erich Blume	a2c61b625d	C0: rotate-fly-deploy-token — fish+bash one-shot, op validator gotcha Combine mint+store into a single command with both fish and bash forms (the doc previously only showed manual paste). Document the 1Password CLI "Password item requires ps value" validator error and the placeholder-password workaround for Password-category items with empty primary password fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 13:42:57 -07:00
Erich Blume	2c0917b266	C0: valkey — bump kustomization tags to main-branch SHA Routine post-merge follow-up after #346. Branch SHA tag (`946fa75`) replaced with the main-SHA-built tag (`fabca04`) so paperless and immich reference an image traceable to a commit on main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 17:47:16 -07:00
Erich Blume	fabca04771	Mirror valkey 8.1 locally for paperless and immich (#346 ) ## Summary - Add native Dagger build of valkey 8.1.6-r0 on Alpine 3.22 at `containers/valkey/` - Swap paperless redis sidecar and immich-valkey from `docker.io/valkey/valkey:8.1-alpine` to `registry.ops.eblu.me/blumeops/valkey:v8.1.6-r0-946fa75` - Resolves the DR-2026-04 TODO in paperless kustomization about multi-arch redis ## Why Move toward fully locally-built containers for supply chain control. Paperless and immich both pulled the same upstream tag — one mirror serves both. Authentik's nix-built Redis stays separate (different image entirely). ## Risk Low. Both sidecars are stateless caches: - paperless redis: no volumeMount (in-pod localhost, pure memory) - immich-valkey: `emptyDir` (cache only) Pod restart rebuilds the cache. Smoke-tested locally (PING/SET/GET roundtrip on `valkey 8.1.6` with `--bind 0.0.0.0 --protected-mode no`). ## Test plan - [ ] After merge: `mise run container-build-and-release valkey` to rebuild with main SHA - [ ] Update kustomizations to the `[main]` SHA tag (C0 follow-up) - [ ] `argocd app sync paperless` and `argocd app sync immich` - [ ] Verify pods come up healthy (paperless OCR queue functional, immich job queue functional) - [ ] `mise run services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #346	2026-05-01 17:40:03 -07:00
Erich Blume	f84f5f02b3	C0: review compensating control trusted-ci-only Verified Forgejo runner is registered only to forge.ops.eblu.me and the forge has registration disabled, so no untrusted users can trigger privileged CI. Tightened notes to reflect the closed-forge mechanism (not a per-repo allow-list). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:49:22 -07:00
Erich Blume	4aa0872949	C0: review ollama doc — refresh image, models, last-reviewed Bumped documented image tag to 0.20.4 (matches kustomization newTag), added the two qwen3.5 models from models.txt, and stamped the card. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:42:33 -07:00
Erich Blume	2d55303213	C0: alloy native macOS on indri — upgrade to v1.16.0 Completes the v1.16.0 fleet upgrade for the fourth alloy service (type: ansible, built from source on indri). Binary built on gilbert with Go 1.26.2 + CGO, scp'd to indri, codesigned, LaunchAgent reloaded. Service reports clean WAL replay and resumed metric/log shipping. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:36:38 -07:00
Erich Blume	55563afc7e	C0: alloy — bump kustomization tags to main-branch SHA Per the build-container-image squash-merge convention, rebuild alloy v1.16.0 container images from the main SHA (`9564435`) and update the three alloy kustomizations to reference :v1.16.0-9564435[-nix] instead of the branch SHA :v1.16.0-26a3ab5[-nix] left over from #345. Both images were rebuilt locally on gilbert (dagger) and ringtail (nix) because indri is still under heavy macOS memory-compressor pressure (see separate ticket); CI on indri can't reliably run the dagger publish step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:31:27 -07:00
Erich Blume	9564435b11	Alloy V1.16.0 (#345 ) Bump Grafana Alloy v1.14.0 → v1.16.0 across all four services (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail; alloy native ansible). Also migrate the indri build path from `Dockerfile` to a native Dagger `container.py` per the build-container-image migration playbook. ## Highlights from upstream - v1.15: database observability promoted to stable, OTel Collector → v0.147.0 - v1.16: clustering for `loki.source.kubernetes_events`, MySQL exporter 0.19.0 - One pre-existing breaking change in v1.15 (`loki.source.awsfirehose` undocumented metric prefix rename) — not used here. ## Build infra Alloy v1.16.0's go.mod requires Go 1.26.2. The nix derivation now uses `pkgs.go_1_26` with `GOTOOLCHAIN=local` to avoid auto-downloading a toolchain blob that violated the fixed-output rule. ## Test plan - [ ] CI: `mise run container-build-and-release alloy --ref alloy-v1.16.0` (dispatched as run 522; nix job to be re-triggered with the v1.16.0 goModules outputHash once the local ringtail build surfaces it) - [ ] After CI green, bump `images[].newTag` in three kustomizations to the new `-<sha>` and `-<sha>-nix` tags, deploy from this branch via `argocd app set <app> --revision alloy-v1.16.0 && argocd app sync <app>` - [ ] Manual rebuild of macOS native binary on gilbert (per ansible/roles/alloy README) and `mise run provision-indri -- --tags alloy --check --diff` - [ ] `mise run services-check` after merge & redeploy Reviewed-on: #345	2026-05-01 08:05:37 -07:00
Erich Blume	7fed166c18	Update ringtail flake inputs Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-30 16:55:08 -07:00
Erich Blume	f6e392b80c	C1: SHA-pin tooling dependencies (2026-04 cycle) (#344 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m45s Details ## Summary Monthly tooling dependency refresh, with a one-time conversion from version-tag pins (`rev = "vX.Y.Z"`, `image:tag`, `>=`) to SHA / digest pins everywhere. ## Changes - prek hooks: all `rev = "vX.Y.Z"` → commit SHA + `# vX.Y.Z` comment. Bumped trufflehog (3.94.0→3.95.2), kingfisher (1.91.0→1.97.0), ruff (0.15.7→0.15.12), shfmt (3.13.0→3.13.1), prettier (3.8.1→3.8.3), actionlint (1.7.11→1.7.12). - fly/Dockerfile: tag pins → `image@sha256:...` digest pins. Bumped nginx (1.29.6→1.30.0-alpine), tailscale (v1.94.1→v1.94.2 — still inside the safe pre-1.96.5 range), alloy (v1.14.1→v1.16.0). - mise-tasks: PEP 723 inline deps converted from `>=` to `==` (PEP 508 doesn't support hashes inline). All scripts pinned to current latest: rich 15.0.0, typer 0.25.0, pyyaml 6.0.3, httpx 0.28.1. - prek `additional_dependencies`: ansible-lint==26.4.0, ansible-core==2.20.5. - taplo-lint: pass `--no-schema`. Upstream's `--default-schema-catalogs` returns a format taplo v0.9.3 can't parse — we don't validate against TOML schemas anyway, so this turns off the broken catalog fetch. - docs/update-tooling-dependencies: documents the SHA-pin convention, `docker buildx imagetools inspect` for digest lookup, and `prek clean` before re-verifying (cache grows to several GiB). Forgejo workflow `actions/checkout@v6.0.2` was already at the latest SHA — no change. ## Test plan - [x] `prek run --all-files` passes after `prek clean` - [x] `deploy-fly` workflow builds and deploys the new fly image on merge - [x] `fly status -a blumeops-proxy` healthy after deploy - [x] Spot-check a few mise tasks (`mise run blumeops-tasks`, `mise run docs-check-links`) to confirm pinned deps resolve cleanly Reviewed-on: #344	2026-04-30 16:51:43 -07:00
Erich Blume	5096223b48	C1: clean up cv + docs minikube artifacts (#343 ) ## Summary Follow-up to #342. The cv and docs services are now live on indri (Caddy file_server backed by ansible-managed tarball extraction) and verified working. This PR removes the dead minikube artifacts and the tooling shims that referenced them. ## Changes Deletions: - ``argocd/apps/{cv,docs}.yaml`` - ``argocd/manifests/{cv,docs}/`` (deployment, service, ingress, pdb, kustomization) - ``containers/{cv,quartz}/`` (Dockerfiles + start scripts) Tooling: - ``mise-tasks/container-version-check``: remove the ``quartz``→``docs`` CONTAINER_TO_SERVICE mapping (containers/quartz no longer exists) - ``service-versions.yaml``: bump ``docs.current-version`` to ``v1.16.0`` (the blumeops docs release tag) and trim the migration-window comment ## Live state context The argocd Applications ``cv`` and ``docs`` were already deleted from the cluster manually as part of the cutover; this PR just removes the YAML files that the ``apps`` app-of-apps was still ingesting. After merge, ``argocd app sync apps`` will reconcile and the ``apps`` Application returns to Synced. The Caddyfile ``handle_errors`` bug that briefly crashed all ``*.ops.eblu.me`` services during cutover is fixed in a separate C0 (``2ee53fe``) on main, not here. ## Test plan - [x] ``mise run container-version-check --all-files`` clean - [x] ``mise run service-review --type ansible`` shows cv at 1.0.3, docs at v1.16.0 - [ ] After merge: ``argocd app sync apps`` returns clean (cv/docs entries gone, no children to reconcile) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #343	2026-04-29 15:18:39 -07:00
Erich Blume	2ee53fe375	C0: fix Caddyfile try_html — handle_errors can't nest inside handle{} The kind=static branch added in #342 put handle_errors inside the @host handle{} block. handle_errors is a top-level site-block directive, not an ordered HTTP handler, so Caddy refuses to load the config: parsing caddyfile tokens for 'handle': directive 'handle_errors' is not an ordered HTTP handler This crash-loops the whole reverse proxy and takes down every *.ops.eblu.me service. Tripped today during the live cv/docs cutover. Fix: drop handle_errors and append /404.html as the final try_files candidate. The 404 page is served with status 200 instead of 404, but that's acceptable for a human-facing curated 404 — the page renders correctly. Documented inline. The running Caddy on indri already has the fixed config (deployed manually during the cutover); this lands the fix in main so future provision-indri --tags caddy runs don't re-break it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:16:44 -07:00
Erich Blume	8d634861f6	C1: migrate cv + docs from minikube to indri-native (#342 ) ## Summary Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down. ## What's in this commit - `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/` - `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`) - `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy` - `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit - `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi) - Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment ## What is not in this commit The dead artifacts. After PR review and successful cutover, a follow-up commit deletes: - `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/` - `containers/cv/`, `containers/quartz/` - `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check` - bumps `docs.current-version` in `service-versions.yaml` to the release tag ## Cutover plan (manual, from gilbert, after review) 1. Take down old: - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade` - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns \| grep -E '^(cv\|docs)\\b'` (should be empty) - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail 2. Bring up new: - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch) - `mise run provision-indri -- --tags cv,docs,caddy` - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"` 3. Verify: `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md` 4. Cleanup commit + merge. Total expected downtime: minutes (not the few-hour budget you authorized). ## Test plan - [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean - [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread - [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200 - [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment` - [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page - [ ] `mise run services-check` clean - [ ] `mise run service-review --type ansible` shows cv and docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #342	2026-04-29 14:55:11 -07:00
Erich Blume	a529d60f60	C0: remove containers/devpi/ build artifact Devpi now runs natively on indri (uv venv via ansible role), so the Dagger container build at containers/devpi/ is unused. Removing it. Also updated dagger.md examples to use 'miniflux' as the example container-name argument. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 13:40:45 -07:00
Erich Blume	14ca0160ba	Migrate devpi from minikube to indri (launchd) (#341 ) ## Summary Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent. ## What changed - New ansible role `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box). - Caddy: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`. - Playbook: `indri.yml` gains pre_tasks for the root password and the new role. - service-versions.yaml: devpi flipped from `type: argocd` to `type: ansible`. - ArgoCD: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted. - Docs: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list. ## Already deployed (live on indri) - Service running: `launchctl list mcquack.eblume.devpi` → PID 53888 - `curl https://pypi.ops.eblu.me/+api` returns 200 ✅ - `mise run docs-mikado` works again ✅ - 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/` - Minikube namespace and PVC fully reclaimed ## Test plan - [ ] `mise run services-check` (after merge) - [ ] CI workflows that use devpi succeed - [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines) ## Context This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain. Reviewed-on: #341	2026-04-29 13:38:36 -07:00
Erich Blume	f4a24595b1	C0: review CC ephemeral-privileged-jobs Verified TTL=604800s and hostPID limited to ephemeral Prowler CronJob on indri. Noted that alloy-tracing on ringtail also uses hostPID but is out of scope until Prowler scans ringtail (tracked in Todoist). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:09:34 -07:00
Erich Blume	817acc5e5e	C0: transmission doc — review and correct storage/monitoring details Marked last-reviewed: 2026-04-29. Fixed the storage layout table — `/config/` is an emptyDir (ephemeral), not NFS, and the watch directory is disabled. Documented the transmission-exporter sidecar that exposes Prometheus metrics on port 19091. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:00:01 -07:00
Erich Blume	4d76fd5de5	C0: prowler — rebuild image against main HEAD Squash-merge of #340 changed the SHA. Bump prowler tag from v5.23.0-2daf629 (PR branch) to v5.23.0-495e45d (main HEAD) so the Dockerfile changes are present in the image deployed off main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:49:27 -07:00
Erich Blume	495e45d01d	Address 6 critical Prowler IaC findings (mute + grafana RBAC tighten) (#340 ) ## Summary The weekly Prowler IaC scan reported 6 critical findings against `argocd/manifests/`. They split cleanly into two patterns: - Legitimate-by-design RBAC → mute with new compensating controls - `external-secrets-controller`, `external-secrets-cert-controller` manage `secrets` (KSV-0041) and the cert-controller mutates its own webhook configurations (KSV-0114). This is what the operator is for. New CC: `operator-purpose-bound-rbac`. - `kube-state-metrics` (both `minikube-indri` and `k3s-ringtail`) holds `list/watch` on secrets to expose `kube_secret_info` and `kube_secret_labels` metrics. KSM's metric schema only reads metadata, never the `data:` field. New CC: `kube-state-metrics-metadata-only`. - Over-broad RBAC → fix - `grafana-clusterrole` had `get/watch/list` on `secrets` because the dashboard-sidecar config used `RESOURCE=both` (ConfigMaps + Secrets). Nothing in the cluster labels Secrets with `grafana_dashboard=1`, so this was unused power. Switched both sidecar instances to `RESOURCE=configmap` and removed `secrets` from the ClusterRole. The IaC cronjob also did not previously pass `--mutelist-file`, which is why every IaC finding reported as unmuted regardless of mutelist configuration. The new `mutelist/iac.yaml` is bundled into the existing `prowler-mutelist` ConfigMap and mounted via `items:` selector. ## Test plan - [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/prowler/` — already passes locally - [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/grafana/` — already passes locally - [ ] Deploy from this branch via `argocd app set prowler --revision prowler-iac-mutelist && argocd app sync prowler` and same for `grafana` - [ ] Manually trigger the IaC cronjob and verify `MUTED=True` on the 6 critical findings (`kubectl --context=minikube-indri -n prowler create job --from=cronjob/prowler-iac-scan prowler-iac-test`) - [ ] Restart grafana pod and confirm dashboards still render (sidecar still finds them via ConfigMap watch) - [ ] After verify, `argocd app set <app> --revision main && argocd app sync <app>` post-merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #340	2026-04-29 10:43:32 -07:00
Erich Blume	718e0a0043	C0: review-compliance-reports — summarize image and IaC scans Previously only the K8s CIS in-cluster scan was processed; the weekly container-image and IaC Prowler scans were running on schedule but never reviewed. Now each scan gets its own status / severity / week-over-week delta, with top-N grouped tables (by check ID and resource) for the high-volume image and IaC outputs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 12:18:06 -07:00
Erich Blume	cfb6d7a7aa	C0: service-review — mark cv reviewed 2026-04-27 No version bump; build deps (jinja2, pyyaml) still loose-pinned and fine. Known issue: deployed v1.0.3 package predates phone-hide commit; tracked separately in Todoist by user. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:57:33 -07:00
Erich Blume	c9eb188e05	C0: blumeops-tasks — replace ambiguous due:+N with "Nd overdue" The signed offset format read as "due in 5 days" rather than "5 days overdue", causing misreads. Switch to self-explanatory text: "5d overdue" / "due in 2d" / "due today". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:49:46 -07:00
Erich Blume	4a37ffcdc2	C0: CLAUDE.md — import AGENTS.md instead of redirecting to it Claude Code only auto-loads CLAUDE.md. The prose shim told agents to go read AGENTS.md, which is easy to skip. Replacing the shim with `@AGENTS.md` inlines AGENTS.md content into the session prompt, so the startup rules (ai-docs, blumeops-tasks, change classification) land in context unconditionally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:41:13 -07:00
Erich Blume	f9d9e00057	C0: blumeops-tasks — show due offset + recurrence, sort by overdue-ness Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 11:18:16 -07:00
Erich Blume	005e2a03ed	C0: split gandi-operations docs; add dns-acme-cleanup mise task Splits the nebulous gandi-operations how-to into two single-topic cards (manage-eblu-me-dns, rotate-gandi-pat) and adds a mise task for the recurring _acme-challenge TXT cleanup needed due to a value-comparison bug in libdns/gandi v1.1.0 that prevents certmagic's cleanup phase from removing presented TXT values. The gandi reference card is updated to drop the false "different credential from Pulumi PAT" claim — verified during the 2026-04-27 incident that Caddy and Pulumi share a single PAT. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-27 09:48:46 -07:00
Erich Blume	72b27b7fd2	C0: docs — add mealie borg restore how-to Captures the procedure used to restore mealie's SQLite DB from a borgmatic archive after the post-DR wipe: extract from borg, snapshot the wiped DB, swap via a helper pod on the ReadWriteOnce PVC, fix UID 911 ownership. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:04:28 -07:00
Erich Blume	34fa2ef28a	C0: ringtail — restore sway default keybindings, fix fuzzel border config Extend (not replace) home-manager's default sway keybindings via lib.mkOptionDefault, with lib.mkForce on the custom overrides that conflict with defaults. Add Mod+F1 cheatsheet binding (fuzzel-filterable). Move fuzzel's border-radius/border-width out of [main] into a proper [border] section with the expected short names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 12:16:02 -07:00

1 2 3 4 5 ...

984 commits