blumeops

Author	SHA1	Message	Date
Erich Blume	9ed6272dc7	heph docs: spoke sync uses direct http://...:8787 , not Caddy HTTPS hephd's sync client is plain-HTTP-only — a Caddy https hub-url fails with a confusing 'error sending request' (HTTP connector rejects the https scheme). Spokes sync over the direct tailnet URL; heph.ops.eblu.me is for the PWA only. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 06:43:54 -07:00
Erich Blume	dc9a951eb2	heph: pin RUSTUP_TOOLCHAIN=stable for build + self-update The launchagent and ansible run without mise activation, so a bare cargo/rustc shim falls back to rustup's default toolchain — which lagged heph's rust-version floor (1.89) on both indri (1.87) and gilbert (1.84), silently failing the build. Pin the channel explicitly in the bootstrap env and the plist. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 06:18:19 -07:00
Erich Blume	d99c962fe1	Add hephaestus sync hub to indri (launchagent, PWA, device-code OIDC) Deploy hephd --mode server on indri as a self-updating LaunchAgent managed by Ansible (ansible/roles/heph, tag heph), making indri the canonical heph hub for the hub-and-spoke task/context system. - Server mode on 0.0.0.0:8787, self-update every 10 minutes (cargo install from the public forge URL; ~/.cargo/bin on the agent PATH). - heph-pwa shell served via --web-root straight from a version-pinned checkout, TLS-terminated at heph.ops.eblu.me through Caddy (new caddy_services entry). - New Authentik device-code (RFC 8628) OIDC app 'heph' (public client) plus a default-device-code-flow bound to the default brand's flow_device_code. - Docs: new services/hephaestus.md service card (incl. Path A seeding runbook and the gilbert spoke caveat), indri.md service list, changelog fragment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-04 22:14:24 -07:00
Erich Blume	bb55fa9566	Recurring review sweep: 4 doc cards + nvidia-device-plugin v0.19.2 (#366 ) Knocks out the two daily recurring review tasks (doc review + service review) in one PR. ## Doc review (4 never-reviewed reference cards, `last-reviewed: 2026-06-04`) - cluster.md — Kubernetes version v1.34.0 → v1.35.0; refreshed the stale ringtail workload list and noted the in-progress minikube→k3s migration (points to `[[ringtail]]` as the canonical list). - ntfy.md / tempo.md / alloy.md — corrected image references: these are now *locally-built `registry.ops.eblu.me/blumeops/` nix containers (ntfy v2.19.2, tempo v2.10.3, alloy-k8s v1.16.0), not upstream Docker Hub. Fly.io alloy binary bumped to v1.16.1. ## Service review - nvidia-device-plugin (ringtail GPU): v0.19.0 → v0.19.2. Upstream patch releases — CDI/Tegra fixes + dependency bumps, no breaking changes for our manifest-based CDI + RuntimeClass setup (the service-account change in the notes is helm-only). ## Not in this PR (need container rebuilds, deferred) The other stale services are locally-built nix images, so upgrading them is a forge-runner rebuild rather than a clean tag bump — left untouched (not date-bumped, so they resurface): prometheus (v3.10.0→v3.12.0), loki (3.6.7→3.7.2), kube-state-metrics, homepage**. Happy to do these as a follow-up rebuild PR. ## Deploy / verify Not yet deployed — `nvidia-device-plugin` still points at `main`. After review: ``` argocd app set nvidia-device-plugin --revision reviews-jun4 && argocd app sync nvidia-device-plugin # after merge: argocd app set nvidia-device-plugin --revision main && argocd app sync nvidia-device-plugin ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #366	2026-06-04 13:37:02 -07:00
Erich Blume	2148714584	C0: retire Todoist blumeops-tasks; point task discovery at heph Replace the Todoist-backed blumeops-tasks mise task with `heph list --project Blumeops --json` (hephaestus, now at v1 prototype on gilbert). Update task-discovery, rotation-reminder, and zk references across docs; note the zk zettelkasten is migrating into heph docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-03 21:32:10 -07:00
Erich Blume	57fd88b269	C0: fix op item edit syntax in zot key rotation The pbpaste \| op item edit ... "field[password]=-" stdin syntax is rejected by op 2.34 as "invalid JSON" — recent op versions treat piped input as a full JSON template, not a single field value. Procedure now uses an inline assignment via a local fish variable.	2026-05-22 21:50:43 -07:00
Erich Blume	d02bf062af	C0: review 1password reference card Added vault split (blumeops vs Personal), noted onepassword-connect runs on both indri and ringtail, and lifted op CLI guidance from agent memory into the card. Bumped last-reviewed.	2026-05-22 21:29:11 -07:00
Erich Blume	292d354902	C1: deploy adelaide-baby-shower-app to ringtail k3s (#349 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m12s Details ## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - ArgoCD app + manifests under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - Public surface (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - fail2ban filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - Caddy route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - Pulumi Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - Grafana APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - Container `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - Docs runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] NFS share on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] 1Password item `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] Container build: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] DNS: `mise run dns-up` after merge - [x] Fly cert: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] Caddy push: `mise run provision-indri -- --tags caddy` - [ ] Fly redeploy to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] ArgoCD sync: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes Reviewed-on: #349	2026-05-11 13:47:18 -07:00
Erich Blume	9fb5442ccd	C0: kiwix — doc review, fix Adding Archives source path Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-04 17:46:16 -07:00
Erich Blume	4aa0872949	C0: review ollama doc — refresh image, models, last-reviewed Bumped documented image tag to 0.20.4 (matches kustomization newTag), added the two qwen3.5 models from models.txt, and stamped the card. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 10:42:33 -07:00
Erich Blume	8d634861f6	C1: migrate cv + docs from minikube to indri-native (#342 ) ## Summary Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down. ## What's in this commit - `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/` - `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`) - `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy` - `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit - `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi) - Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment ## What is not in this commit The dead artifacts. After PR review and successful cutover, a follow-up commit deletes: - `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/` - `containers/cv/`, `containers/quartz/` - `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check` - bumps `docs.current-version` in `service-versions.yaml` to the release tag ## Cutover plan (manual, from gilbert, after review) 1. Take down old: - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade` - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns \| grep -E '^(cv\|docs)\\b'` (should be empty) - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail 2. Bring up new: - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch) - `mise run provision-indri -- --tags cv,docs,caddy` - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"` 3. Verify: `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md` 4. Cleanup commit + merge. Total expected downtime: minutes (not the few-hour budget you authorized). ## Test plan - [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean - [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread - [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200 - [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment` - [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page - [ ] `mise run services-check` clean - [ ] `mise run service-review --type ansible` shows cv and docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #342	2026-04-29 14:55:11 -07:00
Erich Blume	14ca0160ba	Migrate devpi from minikube to indri (launchd) (#341 ) ## Summary Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent. ## What changed - New ansible role `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box). - Caddy: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`. - Playbook: `indri.yml` gains pre_tasks for the root password and the new role. - service-versions.yaml: devpi flipped from `type: argocd` to `type: ansible`. - ArgoCD: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted. - Docs: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list. ## Already deployed (live on indri) - Service running: `launchctl list mcquack.eblume.devpi` → PID 53888 - `curl https://pypi.ops.eblu.me/+api` returns 200 ✅ - `mise run docs-mikado` works again ✅ - 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/` - Minikube namespace and PVC fully reclaimed ## Test plan - [ ] `mise run services-check` (after merge) - [ ] CI workflows that use devpi succeed - [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines) ## Context This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain. Reviewed-on: #341	2026-04-29 13:38:36 -07:00
Erich Blume	817acc5e5e	C0: transmission doc — review and correct storage/monitoring details Marked last-reviewed: 2026-04-29. Fixed the storage layout table — `/config/` is an emptyDir (ephemeral), not NFS, and the watch directory is disabled. Documented the transmission-exporter sidecar that exposes Prometheus metrics on port 19091. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 11:00:01 -07:00
Erich Blume	72b27b7fd2	C0: docs — add mealie borg restore how-to Captures the procedure used to restore mealie's SQLite DB from a borgmatic archive after the post-DR wipe: extract from borg, snapshot the wiped DB, swap via a helper pod on the ReadWriteOnce PVC, fix UID 911 ownership. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 19:04:28 -07:00
Erich Blume	1425bf1f5c	Upgrade forgejo-runner to v12.8, adopt server.connections, and clean up docs (#338 ) ## Summary - consolidate forgejo-runner how-to docs into current cards - upgrade the k8s forgejo-runner deployment to the latest v12.8.x runner image - switch the k8s runner from first-boot register flow to declarative server.connections config - keep the runner image on the native Dagger build path and update the surrounding manifests/secrets ## Notes - PR opened early for C1 review - implementation and deployment verification will follow in subsequent commits Reviewed-on: #338	2026-04-20 09:03:54 -07:00
Erich Blume	51a878cddb	C0: review navidrome reference doc	2026-04-18 20:25:19 -07:00
Erich Blume	d26a6ae3b2	Update docs for Caddy routing and direct WireGuard peering Comprehensive docs pass reflecting the new Fly proxy architecture: - Fly proxy routes through Caddy on indri (not per-service TS Ingress) - Direct WireGuard peering via --port=41641 pinning - DERP relay performance lesson in Tailscale docs - Caddy now in public traffic path - indri tagged as flyio-target - Removed fly-reload references - Updated architecture diagrams and per-service setup guide - Added changelog fragment Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 09:57:30 -07:00
Erich Blume	fe0e913963	Switch Fly proxy to upstream keepalive pools (#337 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337	2026-04-17 16:39:52 -07:00
Erich Blume	6e60287e99	Doc review: delete install-dagger-on-nix-runner, add service-versions ref card Outdated leaf card removed; zot.md now links to new service-versions reference card instead. Added reverse link from review-services. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:52:38 -07:00
Erich Blume	07f52e9488	Deploy Paperless-ngx document management (#328 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (paperless) (push) Successful in 9s Details ## Summary - Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri - Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay - PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery - NFS document storage on sifaka (`/volume1/paperless`) - Authentik OIDC SSO via baked JSON blob from 1Password - Caddy route at `paperless.ops.eblu.me` - 1Password item "Paperless (blumeops)" created with all secrets ## Files - `containers/paperless/Dockerfile` — multi-stage build - `argocd/manifests/paperless/` — full k8s manifest set - `argocd/apps/paperless.yaml` — ArgoCD application - `argocd/manifests/databases/` — CNPG role + ExternalSecret - `ansible/roles/caddy/defaults/main.yml` — Caddy route - `service-versions.yaml` — version tracking entry - `docs/reference/services/paperless.md` — reference card ## Remaining deploy steps 1. Build container: `mise run container-build-and-release paperless` 2. Update kustomization.yaml `newTag` with actual image tag 3. Create Authentik application/provider for paperless 4. Create `paperless` database on blumeops-pg 5. Sync ArgoCD apps, then sync paperless from branch 6. Provision Caddy: `mise run provision-indri -- --tags caddy` 7. Verify at https://paperless.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #328	2026-04-08 17:54:12 -07:00
Erich Blume	efae404d1e	Remove superuser from teslamate PG role, transfer extension ownership teslamate had superuser on the shared blumeops-pg cluster (which also hosts miniflux and authentik). Downgraded to plain database owner with extension ownership (cube, earthdistance) transferred manually so it can still ALTER EXTENSION UPDATE. earthdistance is untrusted in PG so DROP+CREATE would need temporary superuser escalation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 15:36:39 -07:00
Erich Blume	fc34a7da5b	Review postgresql.md: add authentik user/db, immich-pg borgmatic secret Doc review found the authentik database, user, and external secret were missing, along with the immich-pg borgmatic secret. Added Cluster column to Users table for clarity. Set last-reviewed: 2026-04-07. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 15:21:48 -07:00
Erich Blume	f42fa2d558	Remove stale Helm chart mirror references from forgejo docs All Helm chart mirrors (grafana-helm-charts, connect-helm-charts, cloudnative-pg-charts) have been deleted from forge. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 07:37:21 -07:00
Erich Blume	64200a55c5	Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3) Replace the Helm chart deployment with plain kustomize manifests following the Authentik pattern (separate deployments per component). Consolidate the immich-storage ArgoCD app into the main immich app. Add no-helm-policy doc establishing kustomize as the standard deployment mechanism. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 09:42:25 -07:00
Erich Blume	1e391f96bb	Upgrade forgejo-runner 12.7.0 → 12.7.3, add service card Patch upgrade picks up idempotent FetchTask API, offline registration fix, cloudflare/circl security dep update, and custom gRPC user-agent. No config defaults changed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 16:31:06 -07:00
Erich Blume	f9206bf10b	Build custom Kingfisher container from sporked deploy branch (#318 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-nix (kingfisher) (push) Successful in 12s Details ## Summary - Add Dockerfile for Kingfisher built from source (sporked deploy branch) - Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime - Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher` - Add kingfisher to service-versions.yaml (version tracks upstream main SHA) - Document spork workflow in CLAUDE.md ## Test plan - [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29` - [ ] Verify image on registry: `mise run container-list` - [ ] Update kustomization newTag - [ ] Sync ArgoCD kingfisher app from branch - [ ] Trigger manual CronJob and verify scan completes - [ ] Verify reports on sifaka Reviewed-on: #318	2026-03-30 06:34:49 -07:00
Erich Blume	bb60369956	Simplify Kingfisher CronJob to HTML-only output Remove the second scan pass for JSON — one format is enough for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:50:54 -07:00
Erich Blume	2808ffd450	Document Kingfisher secret scanner service Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:47:37 -07:00
Erich Blume	3017f759a7	Migrate Forgejo from Homebrew to source build (#316 ) ## Summary - Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent - Matches the established pattern used by zot, caddy, and alloy - Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more) ## Changes - Ansible role: Replace brew install/services with binary stat check + LaunchAgent - Paths: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo` - Run user: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`) - Docs: Updated Forgejo reference card, restart-indri guide - Service review: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current ## One-time migration steps (manual, on indri) 1. Clone from Codeberg, add forge mirror remote 2. Check out v14.0.3, build with `make build && make forgejo` 3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership 4. Run `provision-indri --tags forgejo` 5. Verify, then `brew uninstall forgejo` ## Data safety - `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config) - Brew version stays installed as rollback until verification passes - No schema changes between 14.0.2 → 14.0.3 Reviewed-on: #316	2026-03-28 08:19:23 -07:00
Erich Blume	c78b86c72c	Add offsite backup for immich photo library to BorgBase (#315 ) ## Summary - Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM - Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup - Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label - Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos - Docs updated: `backups.md`, `borgmatic.md` ## Prerequisites (manual) - [x] Create `immich-photos` repo on BorgBase with same SSH key - [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit) - [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run) ## Test plan - [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics` - [ ] Deploy borgmatic role and verify both configs deployed - [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours) - [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom` - [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #315	2026-03-27 19:43:05 -07:00
Erich Blume	831b82950a	Upgrade nvidia-device-plugin v0.18.2 → v0.19.0 and add reference card Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:19:24 -07:00
Erich Blume	687e972713	Review CV doc and close build-dep review gap Fix stale CV service doc (URL, forge domain, container tag) and add guidance for reviewing build-time dependencies in private forge repos during service reviews. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:12:38 -07:00
Erich Blume	b97e37543f	Deploy Tor Snowflake proxy on ringtail (#311 ) ## Summary - Add Snowflake proxy as a native systemd service on ringtail (NixOS) - Uses `pkgs.snowflake` from nixpkgs (v2.11.0) - Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit - Prometheus metrics enabled on localhost:9999 ## What is Snowflake? A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. This is NOT a Tor exit node — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP. ## Changes - `nixos/ringtail/configuration.nix` — new systemd service definition - `docs/reference/services/snowflake-proxy.md` — service reference card - `docs/reference/infrastructure/ringtail.md` — updated systemd services section - `service-versions.yaml` — added entry (type: nixos) ## Deploy plan After review, deploy via `mise run provision-ringtail`. Service starts automatically. ## Test plan - [ ] `mise run provision-ringtail` succeeds - [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active - [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections - [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics Reviewed-on: #311	2026-03-24 20:51:40 -07:00
Erich Blume	fe201a495c	Add Prowler IaC scanning of blumeops repo (Saturday 2am) Clone repo in init container, scan Dockerfiles and K8s manifests with Prowler's IaC provider (Trivy). Reports written to sifaka:/volume1/reports/prowler-iac/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:49:38 -07:00
Erich Blume	696024306c	Add Prowler image vulnerability scanning for blumeops containers All checks were successful Build Container / detect (push) Successful in 39s Details Build Container / build-dockerfile (prowler) (push) Successful in 10m15s Details Add Trivy to the Prowler container for image and IaC scanning. New CronJob (Saturday 3am) scans all blumeops/* images in the registry for CVEs, embedded secrets, and Dockerfile misconfigs. Reports written to sifaka:/volume1/reports/prowler-images/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:43:08 -07:00
Erich Blume	d021b3534f	Deploy Prowler CIS scanner (#310 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dockerfile (prowler) (push) Successful in 10s Details ## Summary - Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning - Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream) - Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/` - Read-only ClusterRole for pod, RBAC, and control plane inspection - Host path mounts + hostPID for kubelet file permission checks ## Follow-ups - Mirror prowler-cloud/prowler on forge for supply chain control - Build and push container image, update kustomization.yaml newTag - Consider adding k3s-ringtail scanning (core + RBAC checks only) ## Test plan - [ ] Build container: `mise run container-release prowler v5.22.0` - [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag - [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler` - [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri` - [ ] Verify reports appear on sifaka NFS share - [ ] `mise run services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #310	2026-03-24 16:08:09 -07:00
Erich Blume	fc45989a6c	Decommission JobSync service (#308 ) All checks were successful Build Container / detect (push) Successful in 3s Details ## Summary - Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation - Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed ## Motivation Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation. ## Remaining manual steps after merge - Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy` - Sync Homepage: `argocd app sync homepage` - Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #308	2026-03-24 08:44:23 -07:00
Erich Blume	06e721841c	Review 12 reference docs: fix stale image refs, expand stubs, add cross-refs Replace hardcoded image tags in Quick Reference tables with pointers to kustomization manifests (tags drift with every container release). Fix Prometheus CNPG scrape target, remove misleading .ts.net URLs, expand external-secrets stub, add backup/disaster-recovery cross-references. Limit doc-reviewer agent to one doc per cycle. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 09:51:57 -07:00
Erich Blume	995478b91f	Review jellyfin and automounter services Both services current: jellyfin 10.11.6 (latest upstream), automounter 1.11.0 (Mac App Store). Add missing frigate share to automounter docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 13:06:23 -07:00
Erich Blume	6d7597670e	Add plan-a-meal how-to for Mealie cooking timelines Agent-facing guide for generating unified cooking timelines from Mealie meal plans. Covers querying the API, picking balanced meals (protein/carb/vegetable), and interleaving recipe steps into a relative timeline so everything finishes together. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 11:07:16 -07:00
Erich Blume	11330ebea0	Deploy Mealie recipe manager (#299 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (mealie) (push) Successful in 2s Details Build Container / build (mealie) (push) Successful in 8s Details ## Summary - Deploy Mealie (self-hosted recipe manager) on minikube-indri via ArgoCD - Build container from source via forge mirror (`mirrors/mealie`) — multi-stage Dockerfile with Node.js frontend + Python/uv backend - Add Caddy proxy entry for `meals.ops.eblu.me` - Part of a larger meal planning pipeline: Mealie stores categorized recipes, a planner script selects balanced meals, and Ollama generates unified cooking timelines ## Status - [x] Mirror mealie repo on forge - [x] Dockerfile (from-source build) - [x] ArgoCD app + k8s manifests - [x] Caddy proxy entry - [x] Service docs, routing table, app registry - [ ] Local Dagger build test - [ ] Container build + push to registry - [ ] Update kustomization.yaml with real image tag - [ ] Deploy and verify - [ ] Provision Caddy ## Test plan - Build container locally via `dagger call build --src=. --container-name=mealie` - Trigger CI build via `mise run container-build-and-release mealie` - Deploy from branch: `argocd app set mealie --revision deploy-mealie && argocd app sync mealie` - Verify Mealie UI at `https://meals.ops.eblu.me` - Verify API docs at `https://meals.ops.eblu.me/docs` Reviewed-on: #299	2026-03-16 21:59:10 -07:00
Erich Blume	f46a04b902	Restructure docs: consolidate, recategorize, and extract All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details - Consolidate 4 Authentik Nix derivation docs into one card (authentik-nix-build-components.md) - Merge build-grafana-container + build-grafana-sidecar into build-grafana-images.md - Move agent-change-process from how-to/ to explanation/ (it's a methodology doc, not a task guide) - Extract Caddy custom build section from reference card into how-to/deployment/build-caddy-with-plugins.md - Move expose-service-publicly from how-to/ to tutorials/ (it's a comprehensive walkthrough, not a quick task reference) - Update all wiki-link references across affected docs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 19:55:59 -07:00
Erich Blume	ac01c2d6e2	Fix stale docs and shell quoting in devpi start script - ArgoCD ref: correct Git Source URL to forge.ops.eblu.me:2222 - Authentik ref: add Zot as active OIDC client, blueprint, and secret - Federated login: remove Zot from Future Work (completed in PR #236) - devpi/start.sh: use bash array for command building (proper quoting) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 19:25:27 -07:00
Erich Blume	272ea1e767	Upgrade Caddy v2.10.2 → v2.11.2, fix forge mirrors (#294 ) ## Summary - Upgrade Caddy from v2.10.2 to v2.11.2 (7 CVE fixes across v2.11.1 and v2.11.2) - Create `mirrors/caddy-l4` forge mirror for Layer 4 plugin - Migrate all `~/code/3rd` clones on indri from `localhost:3001` to HTTPS `forge.ops.eblu.me/mirrors/` remotes - Remove stale clones (`apple-silicon-detector`, `whisper.cpp`) - Update caddy docs and service-versions tracking ## CVEs Fixed - CVE-2026-27585 through CVE-2026-27590 (path/host bypass, TLS fail-open, FastCGI issues) - Forward auth identity injection (privilege escalation) - `vars_regexp` placeholder secret exposure - Built on Go 1.26.1 (patches Go-level CVEs) ## What was done on indri (not in repo) - `xcaddy build` with Gandi DNS + Layer 4 plugins → `~/code/3rd/caddy/bin/caddy` now v2.11.2 - Remotes updated: caddy, forgejo-runner, zot → `https://forge.ops.eblu.me/mirrors/.git` - Deleted: `~/code/3rd/apple-silicon-detector`, `~/code/3rd/whisper.cpp` ## Deployment and Testing - [x] Ansible dry-run passed (`--tags caddy --check --diff`) - [ ] Restart caddy LaunchAgent to pick up the new binary - [ ] Verify all proxied services respond via `.ops.eblu.me` - [ ] Run `mise run services-check` Reviewed-on: #294	2026-03-15 10:33:48 -07:00
Erich Blume	53d620365a	Bump zot registry to v2.1.15 (#293 ) ## Summary - Upgrade zot OCI registry from v2.1.13 to v2.1.15 on indri - Addresses CVE-2025-30204 (golang-jwt memory) and open redirect via callback_ui - No config template changes needed (externalUrl is auto-allowlisted) - Requires Go 1.25.7 (bump from 1.25.6 via mise) ## Data Safety - Data directory ~/erichblume/zot is NOT touched during build or deploy - No schema migrations in v2.1.14 or v2.1.15 - Storage format remains OCI spec 1.1.0 ## Deployment Steps - [ ] SSH to indri: bump Go to 1.25.7 via `mise use go@1.25.7` - [ ] Fetch and checkout v2.1.15 in ~/code/3rd/zot - [ ] Build: `mise x -- make binary` - [ ] Restart LaunchAgent - [ ] Verify: `curl -s http://localhost:5050/v2/` returns 200 - [ ] Verify: `curl -s https://registry.ops.eblu.me/v2/_catalog` lists repos - [ ] Verify: `mise run services-check` Reviewed-on: #293	2026-03-14 10:00:40 -07:00
Erich Blume	ab8ea6f301	Bump Grafana Alloy to v1.14.0 (#292 ) ## Summary - Bump alloy-k8s, alloy-ringtail, and alloy-tracing-ringtail image tags from v1.13.1 to v1.14.0 - Mark indri alloy (ansible) as reviewed at v1.14.0 — source rebuild from forge mirror needed - Add missing alloy-ringtail entry to service-versions.yaml - Update alloy reference doc ## Breaking changes reviewed - `loki.secretfilter` options removed — not used in our configs - OTel Collector upgraded to v0.142.0 — Kafka receiver changes don't affect us - Exporter queue default changes — our tracing pipeline (Beyla → batch → otlphttp) uses simple config, low risk ## Deployment and Testing - [ ] Sync alloy-k8s: `argocd app set alloy-k8s --revision bump/alloy-v1.14.0 && argocd app sync alloy-k8s` - [ ] Sync alloy-ringtail: `argocd app set alloy-ringtail --revision bump/alloy-v1.14.0 --server ringtail-argocd && argocd app sync alloy-ringtail` - [ ] Sync alloy-tracing-ringtail similarly - [ ] Verify metrics flowing in Grafana - [ ] Verify traces flowing to Tempo (ringtail) - [ ] Rebuild indri alloy from source (`v1.14.0` tag on forge mirror), SCP to indri, restart - [ ] After merge: reset ArgoCD revisions to main, re-sync Reviewed-on: #292	2026-03-13 16:25:27 -07:00
Erich Blume	40f1568088	Remove unused Mosquitto MQTT broker from ringtail Mosquitto has been dormant since frigate-notify switched from MQTT to webapi polling (`529ba10`). Tear down live infra (ArgoCD app, namespace) and remove all manifests, service-versions entry, services-check, and doc references. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 18:37:31 -07:00
Erich Blume	8b9cc4effd	Add how-to card for running 1Password backup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-11 18:17:45 -07:00
Erich Blume	4f0476a851	Fix spider trap: disable SPA mode, remove index files, relax wiki-links (#290 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container (Nix) / detect (push) Successful in 1s Details Build Container (Nix) / build (quartz) (push) Successful in 1s Details Build Container / build (quartz) (push) Successful in 10s Details ## Summary Fixes the Facebook crawler spider trap that's been generating infinite recursive URLs like `/how-to/tutorials/tutorials/how-to/explanation/...` for several days. Root cause: Quartz SPA mode + nginx `try_files` fallback to `index.html` meant any fabricated URL returned the root HTML shell with HTTP 200. Crawlers followed relative links from those fake URLs, creating infinite recursion. Fix: - Disable Quartz SPA mode (`enableSPA: false`) — all pages are now fully static HTML - Replace nginx SPA fallback with `=404` + Quartz's static `404.html` - Remove `robots.txt` exclusions (no longer needed) Docs cleanup (Obsidian.nvim compat no longer needed): - Delete hand-curated category index files (`tutorials.md`, `reference.md`, `how-to.md`, `explanation.md`) — Quartz auto-generates folder pages - Delete `postgresql-storage.md` (redirect stub) and `migrate-forgejo-from-brew.md` (stale history) - Drop `docs-check-index` and `docs-check-filenames` prek hooks - Rewrite `docs-check-links` to allow path-based wiki-links (`[[path/to/file]]`) and only error on true ambiguity - Add `ai-docs` doc tree listing to replace index files for AI context - Add natural cross-links from reference cards to fix orphan docs ## Deployment and Testing - [ ] Merge and let the build pipeline run - [ ] Verify docs.eblu.me serves pages correctly with full page loads - [ ] Verify non-existent URLs return 404 - [ ] Monitor crawler traffic — should drop to near zero for fabricated URLs Reviewed-on: #290	2026-03-09 11:59:43 -07:00
Erich Blume	770a7b2d6a	Add JobSync reference card, observability docs, and RAPIDAPI_KEY plumbing (#289 ) ## Summary - Add JobSync service reference card (`docs/reference/services/jobsync.md`) with architecture, secrets, observability, and JSearch API docs - Add JobSync and Ollama to ringtail's workloads table (both were missing) - Add JobSync to the reference index - Wire `RAPIDAPI_KEY` through ExternalSecret and deployment env var for JSearch job search automation - Document Loki log queries for observability (no metrics endpoint exists) - Update deploy-jobsync how-to with new env var, observability section, and reference card link ## Deployment and Testing - [ ] Sign up for RapidAPI JSearch API (free tier: 500 req/month) - [ ] Add `rapidapi_key` field to "JobSync" 1Password item - [ ] Merge PR - [ ] `argocd app sync jobsync` to pick up new env var - [ ] Verify job search works at https://jobsync.ops.eblu.me/dashboard/automations Reviewed-on: #289	2026-03-08 15:06:52 -07:00

1 2 3

102 commits