blumeops

Author	SHA1	Message	Date
Erich Blume	3abe80523a	C0: bump indri heph hub to v1.2.1 (PWA Authentik login + /config) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>	2026-06-05 07:40:51 -07:00
Erich Blume	a2f1e06224	Add hephaestus sync hub to indri (launchagent, PWA, device-code OIDC) (#369 ) Makes indri the canonical heph hub for the hub-and-spoke task/context system, deployed as a self-updating LaunchAgent managed by Ansible. Other devices (gilbert) attach as offline-capable spokes. ## What's here - `ansible/roles/heph` (tag `heph`) — bootstrap `cargo install hephd` (only if absent; `--self-update` keeps it current after), version-pinned `heph-pwa` checkout served via `--web-root`, launchagent `mcquack.eblume.heph`: ``` hephd --mode server --http-addr 0.0.0.0:8787 --db … --web-root … --oidc-issuer …/o/heph/ --oidc-audience heph --self-update --self-update-interval-secs 600 ``` `~/.cargo/bin` is on the agent `PATH` so self-update's `cargo install` works. - Caddy — `heph.ops.eblu.me → localhost:8787` (TLS for the PWA secure context). - Authentik — new `heph` public device-code OIDC app + `default-device-code-flow` bound to the default brand's `flow_device_code` (verified live: brand `authentik-default`, field currently unset → additive). - Docs — `services/hephaestus.md` (Path-A seeding runbook + spoke caveat), `indri.md`, changelog fragment. ## Three features requested - Autoupdate — 10-min interval (`--self-update-interval-secs 600`). - PWA — `--web-root` (confirmed shipped in v1.2.0). - Spoke — gilbert reconfig documented (post-merge step). ## Deploy plan (not done yet — awaiting review) 1. Seed from gilbert (Path A): `heph daemon stop` → copy `heph.db` → `DELETE FROM meta WHERE key='origin'`. 2. Sync Authentik `apps`/blueprint; verify blueprint status via API (not just logs). 3. `provision-indri --tags heph,caddy` from this branch. 4. Point gilbert at the hub + `heph auth login`. ## Known follow-ups (heph-side, tracked in the Hephaestus project) - `heph daemon` can't bake hub/spoke config or pass `--self-update-interval-secs` → worked around by the ansible plist. - Path-A seeding lacks a clean `hephd --owner-id`/seed command → manual `meta.origin` reset for now. - Self-update moves hephd ahead of the ansible-pinned PWA shell over time (drift; tolerated by the SW cache, revisit on next release). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #369	2026-06-05 06:46:58 -07:00
Forgejo Actions	8f72f04d5c	Update docs release to v1.17.0 - Built changelog from towncrier fragments [skip ci]	2026-06-03 21:52:22 -07:00
Erich Blume	e0057b46e4	Wire ringtail blumeops-pg into backups + Grafana (#364 ) Prereq for the wave-1 decommission. The cutover moved paperless+teslamate (postgres) and mealie (SQLite) to ringtail, but borgmatic and the Grafana TeslaMate datasource still pointed at the minikube copies — the migrated live data was unbacked since cutover, and dropping the minikube DBs would break the TeslaMate dashboards. - Tailscale Service `blumeops-pg-ringtail` + Caddy L4 route `pg.ops.eblu.me:5434` - borgmatic: teslamate + paperless postgres → :5434; mealie SQLite → ssh:eblume@ringtail - Grafana TeslaMate datasource → pg.ops.eblu.me:5434 Deploy: sync databases-ringtail (tailscale svc) + grafana from branch; provision-indri --tags caddy,borgmatic; verify a backup run + dashboards. Unblocks the decommission PR. Reviewed-on: #364	2026-06-03 12:25:30 -07:00
Erich Blume	4e25180b0a	C0: clone blumeops via tailnet on ringtail provision Switch ringtail.yml from forge.eblu.me (Fly proxy, WAN) to forge.ops.eblu.me (Caddy on indri, tailnet). Ringtail is always on the tailnet — the WAN round-trip was overhead and made provision-ringtail fail any time Fly was slow or down.	2026-05-28 07:13:40 -07:00
Erich Blume	dc69b8c68b	C1: fix borgmatic shower SQLite dump (ssh to ringtail) (#357 ) ## Summary Nightly borgmatic backups have been failing for 2 days. Root cause: the shower SQLite dump `before_backup` hook (added in PR #349) referenced `kubectl --context=k3s-ringtail`, but indri's kubeconfig deliberately doesn't carry the ringtail credentials. The hook's failure aborted the entire run, taking out both the local sifaka repo and the BorgBase offsite. Verified the last good archive was `indri-2026-05-11T02:00`. ## Approach ssh into ringtail and run `k3s kubectl` there — no indri-side kubeconfig needed. `/etc/rancher/k3s/k3s.yaml` is mode 644 so no sudo required, and the existing ssh access from indri to ringtail works. Inline-shell quoting got hairy fast (fish on ringtail rejected `POD=...` bash syntax; the nix shower image lacks `tar` so `kubectl cp` fails). Pulled the dump logic into `~/bin/borgmatic-k8s-sqlite-dump`, deployed by the ansible role. Each dump entry now declares a `target`: - `local:<context>` — local kubectl with explicit context (mealie) - `ssh:<user@host>` — ssh + `k3s kubectl` on the cluster host (shower) Bytes come back via `kubectl exec ... -- cat` instead of `kubectl cp` since `cp` needs `tar` in the pod (nix-built containers don't bundle it). ## Test plan - [x] `mise run provision-indri -- --tags borgmatic --check --diff` shows expected diff - [x] Apply, helper script deployed at `~/bin/borgmatic-k8s-sqlite-dump` - [x] Helper invoked directly with `ssh:eblume@ringtail` produces a valid 288 KB SQLite file - [x] Full `borgmatic create` completes without errors — both mealie.db (1.7 MB) and shower.db (288 KB) appear in `~/.local/share/borgmatic/k8s-dumps/`, archive `indri-2026-05-13T17:31:02` written to sifaka borg repo 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #357	2026-05-13 18:55:50 -07:00
Erich Blume	292d354902	C1: deploy adelaide-baby-shower-app to ringtail k3s (#349 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m12s Details ## Summary Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app). ### What's included - ArgoCD app + manifests under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer. - Public surface (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett. - fail2ban filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default). - Caddy route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`). - Pulumi Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`. - Grafana APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter. - Container `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages. - Docs runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments. ### Defense layers on the public surface 1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list) 2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP) 3. django-axes (5 fails / 1h, keyed on username+ip_address) 4. edge `/admin/` block (returns 403 for anything that isn't login/logout) ## Prerequisites for the user to do (NOT in this PR) Halted on these per request — they touch shared/manual systems: - [x] NFS share on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000` - [ ] 1Password item `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git - [ ] Container build: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix` - [x] DNS: `mise run dns-up` after merge - [x] Fly cert: `fly certs add shower.eblu.me -a blumeops-proxy` - [ ] Caddy push: `mise run provision-indri -- --tags caddy` - [ ] Fly redeploy to pick up the new nginx block + fail2ban jail: `mise run fly-deploy` - [ ] ArgoCD sync: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging ## Test plan - [ ] Container builds successfully on nix-container-builder runner - [ ] Pod starts, migrations run, gunicorn answers on :8000 - [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean - [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block) - [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response - [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead` - [ ] Grafana dashboard "Shower APM" appears and starts showing traffic - [ ] `mise run services-check` passes Reviewed-on: #349	2026-05-11 13:47:18 -07:00
Erich Blume	2ee53fe375	C0: fix Caddyfile try_html — handle_errors can't nest inside handle{} The kind=static branch added in #342 put handle_errors inside the @host handle{} block. handle_errors is a top-level site-block directive, not an ordered HTTP handler, so Caddy refuses to load the config: parsing caddyfile tokens for 'handle': directive 'handle_errors' is not an ordered HTTP handler This crash-loops the whole reverse proxy and takes down every *.ops.eblu.me service. Tripped today during the live cv/docs cutover. Fix: drop handle_errors and append /404.html as the final try_files candidate. The 404 page is served with status 200 instead of 404, but that's acceptable for a human-facing curated 404 — the page renders correctly. Documented inline. The running Caddy on indri already has the fixed config (deployed manually during the cutover); this lands the fix in main so future provision-indri --tags caddy runs don't re-break it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 15:16:44 -07:00
Erich Blume	8d634861f6	C1: migrate cv + docs from minikube to indri-native (#342 ) ## Summary Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down. ## What's in this commit - `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/` - `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`) - `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy` - `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit - `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi) - Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment ## What is not in this commit The dead artifacts. After PR review and successful cutover, a follow-up commit deletes: - `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/` - `containers/cv/`, `containers/quartz/` - `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check` - bumps `docs.current-version` in `service-versions.yaml` to the release tag ## Cutover plan (manual, from gilbert, after review) 1. Take down old: - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade` - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns \| grep -E '^(cv\|docs)\\b'` (should be empty) - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail 2. Bring up new: - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch) - `mise run provision-indri -- --tags cv,docs,caddy` - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"` 3. Verify: `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md` 4. Cleanup commit + merge. Total expected downtime: minutes (not the few-hour budget you authorized). ## Test plan - [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean - [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread - [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200 - [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment` - [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page - [ ] `mise run services-check` clean - [ ] `mise run service-review --type ansible` shows cv and docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #342	2026-04-29 14:55:11 -07:00
Erich Blume	14ca0160ba	Migrate devpi from minikube to indri (launchd) (#341 ) ## Summary Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent. ## What changed - New ansible role `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box). - Caddy: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`. - Playbook: `indri.yml` gains pre_tasks for the root password and the new role. - service-versions.yaml: devpi flipped from `type: argocd` to `type: ansible`. - ArgoCD: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted. - Docs: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list. ## Already deployed (live on indri) - Service running: `launchctl list mcquack.eblume.devpi` → PID 53888 - `curl https://pypi.ops.eblu.me/+api` returns 200 ✅ - `mise run docs-mikado` works again ✅ - 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/` - Minikube namespace and PVC fully reclaimed ## Test plan - [ ] `mise run services-check` (after merge) - [ ] CI workflows that use devpi succeed - [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines) ## Context This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain. Reviewed-on: #341	2026-04-29 13:38:36 -07:00
Erich Blume	d7af004842	Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m53s Details - Enable Forgejo /metrics endpoint (app.ini [metrics] section) - Add Alloy scrape target for Forgejo metrics on indri - Add upstream_response_time histogram to Fly proxy Alloy config - Replace single p95 panel with p50/p90/p99 + upstream breakdown filtered to forge.eblu.me host Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:05:59 -07:00
Erich Blume	7a42aeb77c	Mitigate Forgejo archive endpoint DoS from crawler abuse All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m35s Details Crawlers hitting /archive/ endpoints with unique commit SHAs generated 54GB of git bundles in 2 days, pegging Forgejo at 43% CPU. Fix at multiple layers: - Redirect archive requests to tailnet at Fly proxy (302) - Expand robots.txt: block /users/, //archive/, //releases/download/ - Cache release artifact downloads at nginx (immutable, 7d TTL) - Enable [cron.archive_cleanup] with 2h TTL and run-at-start Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:21:22 -07:00
Erich Blume	519175c672	Fix borgmatic LaunchAgent TCC dialog hang by removing mise wrapper LaunchAgents now call borgmatic directly at its mise-installed path instead of routing through `mise x`, which triggered macOS TCC permission dialogs (e.g. "mise wants to access Documents") that hung headless sessions and caused backup failures. Also adds `mise install` to the ansible role so borgmatic installation is fully managed, and pins the version in both mise.toml and the role defaults. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 07:23:46 -07:00
Erich Blume	cd5b6b63f7	Add paperless DB to borgmatic backups Discovered during DR that paperless was the only service DB not backed up by borgmatic. Uses same blumeops-pg cluster on port 5432. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 17:58:06 -07:00
Erich Blume	07f52e9488	Deploy Paperless-ngx document management (#328 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (paperless) (push) Successful in 9s Details ## Summary - Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri - Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay - PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery - NFS document storage on sifaka (`/volume1/paperless`) - Authentik OIDC SSO via baked JSON blob from 1Password - Caddy route at `paperless.ops.eblu.me` - 1Password item "Paperless (blumeops)" created with all secrets ## Files - `containers/paperless/Dockerfile` — multi-stage build - `argocd/manifests/paperless/` — full k8s manifest set - `argocd/apps/paperless.yaml` — ArgoCD application - `argocd/manifests/databases/` — CNPG role + ExternalSecret - `ansible/roles/caddy/defaults/main.yml` — Caddy route - `service-versions.yaml` — version tracking entry - `docs/reference/services/paperless.md` — reference card ## Remaining deploy steps 1. Build container: `mise run container-build-and-release paperless` 2. Update kustomization.yaml `newTag` with actual image tag 3. Create Authentik application/provider for paperless 4. Create `paperless` database on blumeops-pg 5. Sync ArgoCD apps, then sync paperless from branch 6. Provision Caddy: `mise run provision-indri -- --tags caddy` 7. Verify at https://paperless.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #328	2026-04-08 17:54:12 -07:00
Erich Blume	6cab5091ea	Add storage-provisioner health check to minikube Ansible role The storage-provisioner is a bare Pod with no controller. If the node restarts via Docker Desktop (rather than `minikube start`), kubelet restores static pods but bare pods are lost. Detect this and re-run `minikube start` to restore addons. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-04 12:04:25 -07:00
Erich Blume	a18a424866	Pin NixOS service versions via nixpkgs-services overlay (#321 ) ## Summary - Add `nixpkgs-services` flake input pinned to a specific nixpkgs commit, with an overlay that pulls `forgejo-runner`, `snowflake`, and `k3s` from it instead of the rolling `nixpkgs` - Dagger `flake-update` pipeline now excludes `nixpkgs-services` via `--exclude` - Fix stale nix-container-builder version in service-versions.yaml (was 12.6.4, actually running 12.7.2) - Add k3s and minikube to service-versions.yaml tracking - Document the pinning approach in review-services how-to and ringtail reference ## Motivation During service review, discovered that flake updates had silently upgraded forgejo-runner from 12.6.4 → 12.7.2 without updating service-versions.yaml. This "sneak-in upgrade" bypasses the service review process. The overlay ensures these three services only change versions deliberately. ## Test plan - [ ] Verify `nix flake update` from `nixos/ringtail/` does not change `nixpkgs-services` lock entry - [ ] Verify `mise run provision-ringtail` builds successfully with the overlay - [ ] Confirm running service versions unchanged after deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #321	2026-04-01 21:37:57 -07:00
Erich Blume	c069f889d2	Harden borgmatic photos backup: restrict dirs, add keepalives + checkpoints Restrict backup to library/ and upload/ only (skip regenerable encoded-video/, thumbs/, backups/). Add SSH ServerAliveInterval to prevent broken pipe on long transfers, and checkpoint_interval so interrupted backups save progress. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-30 10:30:28 -07:00
Erich Blume	3017f759a7	Migrate Forgejo from Homebrew to source build (#316 ) ## Summary - Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent - Matches the established pattern used by zot, caddy, and alloy - Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more) ## Changes - Ansible role: Replace brew install/services with binary stat check + LaunchAgent - Paths: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo` - Run user: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`) - Docs: Updated Forgejo reference card, restart-indri guide - Service review: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current ## One-time migration steps (manual, on indri) 1. Clone from Codeberg, add forge mirror remote 2. Check out v14.0.3, build with `make build && make forgejo` 3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership 4. Run `provision-indri --tags forgejo` 5. Verify, then `brew uninstall forgejo` ## Data safety - `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config) - Brew version stays installed as rollback until verification passes - No schema changes between 14.0.2 → 14.0.3 Reviewed-on: #316	2026-03-28 08:19:23 -07:00
Erich Blume	c78b86c72c	Add offsite backup for immich photo library to BorgBase (#315 ) ## Summary - Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM - Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup - Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label - Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos - Docs updated: `backups.md`, `borgmatic.md` ## Prerequisites (manual) - [x] Create `immich-photos` repo on BorgBase with same SSH key - [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit) - [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run) ## Test plan - [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics` - [ ] Deploy borgmatic role and verify both configs deployed - [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours) - [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom` - [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #315	2026-03-27 19:43:05 -07:00
Erich Blume	ca0c9354ee	Add borgmatic backups for authentik and immich databases (#314 ) ## Summary - Add `authentik` database (blumeops-pg cluster) to borgmatic pg_dump backups - Add `immich` database (immich-pg cluster) to borgmatic pg_dump backups - For immich-pg: new borgmatic managed role with `pg_read_all_data`, ExternalSecret, Tailscale LoadBalancer service, and Caddy L4 TCP proxy on port 5433 - Update backup docs to reflect all four CNPG databases + mealie SQLite ## Deploy plan Deploy order matters — k8s resources must exist before ansible can route to them: 1. ArgoCD (databases app): sync to pick up immich-pg borgmatic role, ExternalSecret, and Tailscale service ``` argocd app set blumeops-pg --revision feature/borgmatic-all-pg-backups argocd app sync blumeops-pg ``` 2. Wait for `immich-pg-tailscale` service to get a Tailscale IP and `immich-pg.tail8d86e.ts.net` to resolve 3. Ansible (caddy): deploy Caddy L4 route for port 5433 ``` mise run provision-indri -- --tags caddy ``` 4. Ansible (borgmatic): deploy updated config and .pgpass ``` mise run provision-indri -- --tags borgmatic ``` 5. Verify: trigger a manual borgmatic run and check all four pg_dump streams succeed ``` borgmatic --verbosity 1 2>&1 \| grep -E '(Dumping\|ERROR)' ``` ## Test plan - [x] `kubectl kustomize` builds cleanly - [x] `ansible --check --diff` for borgmatic and caddy show expected changes - [ ] ArgoCD sync succeeds for databases app - [ ] `immich-pg.tail8d86e.ts.net` resolves - [ ] `pg.ops.eblu.me:5433` accepts connections - [ ] `borgmatic --verbosity 1` dumps all four databases without errors Reviewed-on: #314	2026-03-27 16:59:58 -07:00
Erich Blume	fc45989a6c	Decommission JobSync service (#308 ) All checks were successful Build Container / detect (push) Successful in 3s Details ## Summary - Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation - Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed ## Motivation Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation. ## Remaining manual steps after merge - Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy` - Sync Homepage: `argocd app sync homepage` - Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #308	2026-03-24 08:44:23 -07:00
Erich Blume	3e9873d669	Fix borgmatic backup: use correct kubectl context on indri The Mealie SQLite dump hook used `minikube-indri` (the context name on gilbert), but on indri itself the context is just `minikube`. This caused the before_backup hook to fail, aborting all backups since the hook was added. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:07:44 -07:00
Erich Blume	11330ebea0	Deploy Mealie recipe manager (#299 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (mealie) (push) Successful in 2s Details Build Container / build (mealie) (push) Successful in 8s Details ## Summary - Deploy Mealie (self-hosted recipe manager) on minikube-indri via ArgoCD - Build container from source via forge mirror (`mirrors/mealie`) — multi-stage Dockerfile with Node.js frontend + Python/uv backend - Add Caddy proxy entry for `meals.ops.eblu.me` - Part of a larger meal planning pipeline: Mealie stores categorized recipes, a planner script selects balanced meals, and Ollama generates unified cooking timelines ## Status - [x] Mirror mealie repo on forge - [x] Dockerfile (from-source build) - [x] ArgoCD app + k8s manifests - [x] Caddy proxy entry - [x] Service docs, routing table, app registry - [ ] Local Dagger build test - [ ] Container build + push to registry - [ ] Update kustomization.yaml with real image tag - [ ] Deploy and verify - [ ] Provision Caddy ## Test plan - Build container locally via `dagger call build --src=. --container-name=mealie` - Trigger CI build via `mise run container-build-and-release mealie` - Deploy from branch: `argocd app set mealie --revision deploy-mealie && argocd app sync mealie` - Verify Mealie UI at `https://meals.ops.eblu.me` - Verify API docs at `https://meals.ops.eblu.me/docs` Reviewed-on: #299	2026-03-16 21:59:10 -07:00
Erich Blume	1f0308bbd2	Fix Caddy v2.11 Host header rewrite breaking proxied services Caddy v2.11 (#7454) auto-rewrites the Host header to match the upstream address for HTTPS backends. This causes services behind Tailscale Ingress to see .tail8d86e.ts.net instead of .ops.eblu.me, breaking Authentik OAuth flows, Homepage host validation, and other services that check the Host header. Only apply header_up for HTTPS backends (Tailscale Ingress); HTTP backends (forge, registry, jellyfin, sifaka) are unaffected. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-15 18:28:18 -07:00
Erich Blume	272ea1e767	Upgrade Caddy v2.10.2 → v2.11.2, fix forge mirrors (#294 ) ## Summary - Upgrade Caddy from v2.10.2 to v2.11.2 (7 CVE fixes across v2.11.1 and v2.11.2) - Create `mirrors/caddy-l4` forge mirror for Layer 4 plugin - Migrate all `~/code/3rd` clones on indri from `localhost:3001` to HTTPS `forge.ops.eblu.me/mirrors/` remotes - Remove stale clones (`apple-silicon-detector`, `whisper.cpp`) - Update caddy docs and service-versions tracking ## CVEs Fixed - CVE-2026-27585 through CVE-2026-27590 (path/host bypass, TLS fail-open, FastCGI issues) - Forward auth identity injection (privilege escalation) - `vars_regexp` placeholder secret exposure - Built on Go 1.26.1 (patches Go-level CVEs) ## What was done on indri (not in repo) - `xcaddy build` with Gandi DNS + Layer 4 plugins → `~/code/3rd/caddy/bin/caddy` now v2.11.2 - Remotes updated: caddy, forgejo-runner, zot → `https://forge.ops.eblu.me/mirrors/.git` - Deleted: `~/code/3rd/apple-silicon-detector`, `~/code/3rd/whisper.cpp` ## Deployment and Testing - [x] Ansible dry-run passed (`--tags caddy --check --diff`) - [ ] Restart caddy LaunchAgent to pick up the new binary - [ ] Verify all proxied services respond via `.ops.eblu.me` - [ ] Run `mise run services-check` Reviewed-on: #294	2026-03-15 10:33:48 -07:00
Erich Blume	53d620365a	Bump zot registry to v2.1.15 (#293 ) ## Summary - Upgrade zot OCI registry from v2.1.13 to v2.1.15 on indri - Addresses CVE-2025-30204 (golang-jwt memory) and open redirect via callback_ui - No config template changes needed (externalUrl is auto-allowlisted) - Requires Go 1.25.7 (bump from 1.25.6 via mise) ## Data Safety - Data directory ~/erichblume/zot is NOT touched during build or deploy - No schema migrations in v2.1.14 or v2.1.15 - Storage format remains OCI spec 1.1.0 ## Deployment Steps - [ ] SSH to indri: bump Go to 1.25.7 via `mise use go@1.25.7` - [ ] Fetch and checkout v2.1.15 in ~/code/3rd/zot - [ ] Build: `mise x -- make binary` - [ ] Restart LaunchAgent - [ ] Verify: `curl -s http://localhost:5050/v2/` returns 200 - [ ] Verify: `curl -s https://registry.ops.eblu.me/v2/_catalog` lists repos - [ ] Verify: `mise run services-check` Reviewed-on: #293	2026-03-14 10:00:40 -07:00
Erich Blume	ab8ea6f301	Bump Grafana Alloy to v1.14.0 (#292 ) ## Summary - Bump alloy-k8s, alloy-ringtail, and alloy-tracing-ringtail image tags from v1.13.1 to v1.14.0 - Mark indri alloy (ansible) as reviewed at v1.14.0 — source rebuild from forge mirror needed - Add missing alloy-ringtail entry to service-versions.yaml - Update alloy reference doc ## Breaking changes reviewed - `loki.secretfilter` options removed — not used in our configs - OTel Collector upgraded to v0.142.0 — Kafka receiver changes don't affect us - Exporter queue default changes — our tracing pipeline (Beyla → batch → otlphttp) uses simple config, low risk ## Deployment and Testing - [ ] Sync alloy-k8s: `argocd app set alloy-k8s --revision bump/alloy-v1.14.0 && argocd app sync alloy-k8s` - [ ] Sync alloy-ringtail: `argocd app set alloy-ringtail --revision bump/alloy-v1.14.0 --server ringtail-argocd && argocd app sync alloy-ringtail` - [ ] Sync alloy-tracing-ringtail similarly - [ ] Verify metrics flowing in Grafana - [ ] Verify traces flowing to Tempo (ringtail) - [ ] Rebuild indri alloy from source (`v1.14.0` tag on forge mirror), SCP to indri, restart - [ ] After merge: reset ArgoCD revisions to main, re-sync Reviewed-on: #292	2026-03-13 16:25:27 -07:00
Erich Blume	3a811fb188	Deploy JobSync — job search tracker on ringtail k3s (#288 ) All checks were successful Build Container (Nix) / detect (push) Successful in 1s Details Build Container / detect (push) Successful in 2s Details Build Container / build (jobsync) (push) Successful in 2s Details Build Container (Nix) / build (jobsync) (push) Successful in 8s Details ## Summary C2 Mikado chain to deploy [JobSync](https://github.com/Gsync/jobsync) — a self-hosted job application tracker — to ringtail's k3s cluster. ### Mikado Graph ``` deploy-jobsync (goal) ├── build-jobsync-container │ └── mirror-jobsync └── integrate-jobsync-ollama ``` ### What is JobSync? Next.js app with SQLite for tracking job applications. Features resume management, application pipeline tracking, and AI-powered resume review/job matching. ### Key Decisions - Ringtail k3s (not minikube-indri) — colocates with Ollama for zero-latency AI - Nix container via `buildLayeredImage` — no Dockerfile, mirrors upstream source on forge - Ollama for AI — uses existing deployment, no API keys needed for AI features - No upstream fork — vanilla JobSync, Anthropic AI deferred to future work if needed ### Current Status Planning phase — cards committed, ready for review before implementation begins. Reviewed-on: #288	2026-03-08 11:02:05 -07:00
Erich Blume	c029e5851a	Review migrate-forgejo-from-brew doc, fix stale Phase 3 reference Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 08:29:58 -08:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	31d925814f	Deploy Ollama LLM server on ringtail (#277 ) ## Summary - Deploy Ollama as a new ArgoCD-managed service on ringtail's k3s cluster with GPU acceleration - Declarative model management via `models.txt` + sidecar sync script (mirrors kiwix torrent pattern) - Initial models: `qwen2.5:14b`, `deepseek-r1:14b`, `phi4:14b`, `gemma3:12b` - hostPath PV on `/mnt/storage1/ollama` for fast local model storage (200Gi) - Tailscale ingress at `ollama.ops.eblu.me` for API access from tailnet - Enable GPU time-slicing (`replicas: 2`) on nvidia-device-plugin so Frigate and Ollama share the RTX 4080 ## Deployment and Testing - [ ] Deploy nvidia-device-plugin changes first: `argocd app sync nvidia-device-plugin` - [ ] Verify GPU time-slicing: `kubectl describe node ringtail --context=k3s-ringtail` shows `nvidia.com/gpu: 2` - [ ] Sync `apps` app with `--revision feature/ollama-ringtail` - [ ] Set ollama app to branch: `argocd app set ollama --revision feature/ollama-ringtail && argocd app sync ollama` - [ ] Verify model-sync sidecar pulls models: `kubectl logs -n ollama deploy/ollama -c model-sync --context=k3s-ringtail` - [ ] Test API: `curl https://ollama.ops.eblu.me/api/tags` - [ ] Test inference: `curl https://ollama.ops.eblu.me/api/generate -d '{"model":"qwen2.5:14b","prompt":"Hello"}'` - [ ] Verify Frigate still works after GPU sharing change - [ ] After merge: `argocd app set ollama --revision main && argocd app sync ollama` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/277	2026-03-02 20:39:51 -08:00
Erich Blume	03d71544ec	Add multi-cluster observability with ringtail metrics and dashboards (#270 ) ## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - Kubernetes Clusters dashboard — multi-cluster with `cluster` and `namespace` template variables - Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270	2026-02-25 22:01:00 -08:00
Erich Blume	84338c32c2	Add authenticated GitHub PAT for Forgejo mirror sync (#269 ) ## Summary - mirror-create: Auto-includes GitHub PAT from 1Password for authenticated upstream fetches at mirror creation time - mirror-update-pats: New mise task that SSHes into indri and rewrites the git remote URL in every GitHub mirror's bare repo config to embed the PAT. Idempotent, supports `--dry-run` - app.ini.j2: Explicit `[mirror]` section with `DEFAULT_INTERVAL = 8h` and `MIN_INTERVAL = 10m` (bakes in the defaults for visibility) - manage-forgejo-mirrors: New how-to doc covering mirror creation, PAT storage, the `mirror-update-pats` task, and the full 20-day PAT rotation procedure ## Context GitHub tightened unauthenticated rate limits for git clone/fetch in May 2025. With 23 GitHub mirrors syncing every 8 hours, authenticated fetches avoid throttling. The PAT is stored in 1Password (`Forgejo Secrets` → `github-mirror-pat`) and has been applied to all existing mirrors. ## Deployment and Testing - [x] `mirror-update-pats` dry-run verified (23 mirrors detected) - [x] `mirror-update-pats` applied to all 23 GitHub mirrors on indri - [x] Idempotency confirmed (re-run shows 0 updated, 23 skipped) - [ ] Provision indri with `--tags forgejo` to apply `[mirror]` config - [ ] Trigger a manual mirror sync and verify success in Forgejo UI Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/269	2026-02-25 20:20:23 -08:00
Erich Blume	5f9bc20345	Fix mirror org refs in ArgoCD apps and widen credential template (#266 ) ## Summary - Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials) - Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts) - Fix stale alloy clone comment in Ansible defaults - Bump immich v2.5.2 → v2.5.6 (bug-fix patches only) - Update ArgoCD README bootstrap command and credential docs ## Context Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit ``cd57814``. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos. ## Deployment 1. Sync the ArgoCD `argocd` app first (picks up the widened credential template) 2. Sync the `apps` app (picks up new repo URLs for all 8 apps) 3. Verify immich resolves its ComparisonError: `argocd app get immich` 4. Sync immich to deploy v2.5.6: `argocd app sync immich` 5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266	2026-02-25 06:55:53 -08:00
Erich Blume	2c081eed28	Add Forgejo repository health metrics and Grafana dashboard (#245 ) ## Summary - New `forgejo_metrics` Ansible role that queries the Forgejo REST API every 60s and writes Prometheus textfile metrics (open PRs, issues, languages, releases, commits, Actions runs/duration/success) - Grafana dashboard "Forgejo Repository Health" with 12 panels across 4 rows: overview stats, CI/CD health, repository info, and staleness tracking - Deletes superseded `forgejo-actions-dashboard` plan doc (this implementation covers a broader scope) ## Deployment and Testing - [ ] `mise run provision-indri -- --tags forgejo_metrics` to deploy the collector - [ ] `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/forgejo.prom'` to verify metrics - [ ] `argocd app sync grafana-config` to deploy the dashboard - [ ] Check Grafana dashboard "Forgejo Repository Health" loads with data - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/245	2026-02-22 11:16:03 -08:00
Erich Blume	a82c705bf6	Add SSO login button to Jellyfin login page Deploy branding.xml with a "Sign in with Authentik" button in the login disclaimer. Local password login remains available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-21 20:08:57 -08:00
Erich Blume	07fb48626d	Add Authentik SSO integration for Jellyfin (#239 ) ## Summary - Add Authentik OIDC provider + application for Jellyfin via blueprint (all authenticated users allowed, no policy binding) - Wire `jellyfin-client-secret` through ExternalSecret and Authentik worker deployment - Install [jellyfin-plugin-sso](https://github.com/9p4/jellyfin-plugin-sso) v4.0.0.3 via Ansible, with OIDC config template - Authentik `admins` group maps to Jellyfin administrator role - Local login left enabled; SSO is additive ## Deployment and Testing - [ ] Sync ArgoCD `authentik` app on branch — verify provider + application appear in Authentik admin - [ ] `mise run provision-indri -- --tags jellyfin --check --diff` (dry run) - [ ] `mise run provision-indri -- --tags jellyfin` (deploy plugin + config) - [ ] Test SSO flow: `https://jellyfin.ops.eblu.me/sso/OID/start/authentik` - [ ] Verify `eblume` account auto-links via `preferred_username` match - [ ] Verify admins group → Jellyfin admin - [ ] Reset ArgoCD app revision to main after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/239	2026-02-21 20:05:44 -08:00
Erich Blume	8775c3841a	Allow anonymous access to zot /metrics endpoint Add accessControl.metrics.users with empty string to allow unauthenticated Prometheus/Alloy scraping. Zot represents anonymous users with an empty username internally. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-21 12:37:59 -08:00
Erich Blume	ff63679efb	Enable zot registry auth + wire CI credentials (#237 ) ## Summary - Enable OIDC + API key authentication on zot registry with three-tier accessControl - `anonymousPolicy: ["read"]` — anyone can pull - `artifact-workloads` group: `["read", "create"]` — CI push, no overwrite/delete - `admins` group: `["read", "create", "update", "delete"]` — break-glass - Wire both CI push paths (Dagger and Nix/skopeo) with `ZOT_CI_API_KEY` credentials - Add `artifact-workloads` PolicyBinding in Authentik blueprint for zot app access - Add `ZOT_CI_API_KEY` to Forgejo Actions secrets via existing ansible role Completes the `wire-ci-registry-auth` and `harden-zot-registry` Mikado cards. ## Manual Deployment Steps (after merge) 1. Deploy Authentik blueprint: `argocd app sync authentik` 2. In Authentik admin UI: set a password for the `zot-ci` service account 3. Deploy zot config: `mise run provision-indri -- --tags zot` 4. Log in to `https://registry.ops.eblu.me` as `zot-ci` via OIDC → generate API key 5. Store API key in 1Password as `zot-ci-apikey` in blumeops vault 6. Sync Forgejo secrets: `mise run provision-indri -- --tags forgejo_actions_secrets` 7. Trigger a test container build to verify CI push 8. Verify anonymous pull: `curl -sf https://registry.ops.eblu.me/v2/_catalog` ## Uncertainties - Zot `accessControl` group matching with OIDC: Groups from Authentik's `profile` scope claim should map to zot policy groups, but the exact claim-to-group matching needs runtime verification - `http.auth.apikey: true`: This config key is documented but needs verification against the specific zot version built from source on indri - API key permissions: Need to confirm zot API keys inherit the generating user's group for accessControl evaluation ## Test Plan - [ ] `mise run provision-indri -- --check --diff --tags zot` shows expected config changes - [ ] Anonymous pull works after deploy - [ ] Unauthenticated push fails (401) - [ ] OIDC browser login redirects to Authentik and back - [ ] API key push works after key generation - [ ] CI push succeeds with both Dagger and skopeo paths - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/237	2026-02-21 12:20:29 -08:00
Erich Blume	21b6533aea	Register Zot as OIDC client in Authentik (#236 ) ## Summary - Add Authentik blueprint (`zot.yaml`) with OAuth2 provider, application, `artifact-workloads` group, and `zot-ci` service account - Wire `zot-client-secret` through ExternalSecret → worker Deployment env var → blueprint `!Env` - Add Ansible pre_task to fetch OIDC secret from 1Password (item ID `oor7os5kapczgpbwv7obkca4y4`) - Add `oidc-credentials.json.j2` template and deploy task in zot role (with `when` guard) ## Manual Steps Required Before Deploy 1. Generate client secret: `openssl rand -hex 32` 2. Store in 1Password: add field `zot-client-secret` to "Authentik (blumeops)" item in vault `blumeops` ## What This Does NOT Do - Does NOT modify `config.json.j2` (that's the root goal `harden-zot-registry`) - Does NOT wire CI auth (that's `wire-ci-registry-auth`) - Does NOT set service account password or API keys (manual post-deploy) ## Verification After ArgoCD sync: - [ ] Authentik admin UI shows "Zot Registry" application - [ ] OIDC discovery at `https://authentik.ops.eblu.me/application/o/zot/.well-known/openid-configuration` returns valid JSON - [ ] Blueprint status is `successful` - [ ] `artifact-workloads` group exists with `zot-ci` service account 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/236	2026-02-21 08:45:06 -08:00
Erich Blume	cd50c1454a	Integrate Forgejo with Authentik OIDC (#228 ) ## Summary - Refactor Authentik blueprints: extract shared `admins` group into `common.yaml`, add `groups` scope mapping to all providers for group-based admin propagation - Add Forgejo OAuth2 provider and application blueprint (`forgejo.yaml`) - Add `forgejo-client-secret` to ExternalSecret and worker deployment env - Configure Forgejo `[oauth2_client]` with `ACCOUNT_LINKING=login` to safely link existing accounts - Update documentation (forgejo.md, authentik.md, federated-login.md) ## Deployment and Testing After merge, deployment requires these steps in order: 1. Authentik (ArgoCD): - `argocd app set authentik --revision feature/forgejo-authentik-oidc && argocd app sync authentik` - Verify: Forgejo app/provider visible in Authentik admin UI - Verify: Grafana SSO still works (blueprint refactor) 2. Forgejo app.ini (Ansible): - `mise run provision-indri -- --tags forgejo --check --diff` (dry run) - `mise run provision-indri -- --tags forgejo` (apply, restarts Forgejo) 3. Create Forgejo auth source (CLI on indri): ``` ssh indri 'sudo -u forgejo /opt/homebrew/bin/forgejo admin auth add-oauth \ --name authentik \ --provider openidConnect \ --key forgejo \ --secret "$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/Authentik (blumeops)/forgejo-client-secret")" \ --auto-discover-url https://authentik.ops.eblu.me/application/o/forgejo/.well-known/openid-configuration \ --scopes "openid email profile groups" \ --group-claim-name groups \ --admin-group admins' ``` 4. Link eblume account: Sign in with Authentik on Forgejo, confirm link with local password 5. Verify: `tea repo list`, Forgejo Actions, local password break-glass After merge: `argocd app set authentik --revision main && argocd app sync authentik` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/228	2026-02-20 17:39:50 -08:00
Erich Blume	71cb256527	Deploy Authentik identity provider (C2 Mikado) (#227 ) ## Summary C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex. This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved. ## Current Mikado State - Goal: `deploy-authentik` (active) - Leaf prerequisites: - `build-authentik-container` — Build Nix container image - `provision-authentik-database` — Create PostgreSQL database on CNPG cluster - `create-authentik-secrets` — Create 1Password item with credentials ## Process refinements - Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early ## Test plan - [ ] `mise run docs-mikado` shows correct dependency chain - [ ] Leaf nodes can be worked independently - [ ] Container builds on ringtail - [ ] Authentik starts and reaches healthy state - [ ] Forgejo OAuth2 connector works Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227	2026-02-20 12:55:59 -08:00
Erich Blume	0cdc143227	Deploy Dex OIDC identity provider with Grafana SSO (#222 ) ## Summary - Deploys Dex OIDC identity provider on ringtail k3s cluster as central authentication service - Integrates Grafana as first SSO client via `auth.generic_oauth` - Uses Kubernetes CRD storage backend (no PVC needed) - All secrets (bcrypt hash, client secrets) injected via ExternalSecrets from 1Password item "Dex (blumeops)" - NixOS-built container image via `containers/dex/default.nix` ## Pre-requisites (manual, before deployment) 1. Create 1Password item "Dex (blumeops)" in `blumeops` vault with fields: - `password`: strong generated password for Dex login - `static-password-hash`: bcrypt hash of above (`htpasswd -BnC 10 eblume`, copy hash after `eblume:`) - `grafana-client-secret`: random 32-char hex (`openssl rand -hex 16`) 2. Build container: `mise run container-tag-and-release dex v1.0.0` ## Deployment sequence 1. Build container: `mise run container-tag-and-release dex v1.0.0` 2. Deploy Caddy: `mise run provision-indri -- --tags caddy` 3. Sync ArgoCD: `argocd app sync apps` → `argocd app sync dex` 4. Verify Dex: `curl https://dex.ops.eblu.me/.well-known/openid-configuration` 5. Sync Grafana: `argocd app sync grafana-config` → `argocd app sync grafana` 6. Test SSO: Visit `https://grafana.ops.eblu.me/login`, click "Sign in with Dex" ## Verification - [ ] Container image exists: `mise run container-list` shows `dex:v1.0.0-nix` - [ ] `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid OIDC discovery - [ ] `curl https://dex.ops.eblu.me/healthz` returns healthy - [ ] Grafana login shows "Sign in with Dex" button alongside local login - [ ] OIDC flow: click Dex → enter credentials → redirect back → logged in as Admin - [ ] Break-glass: local admin login still works - [ ] `mise run services-check` passes ## Files changed \| File \| Action \| Purpose \| \|------\|--------\|---------\| \| `containers/dex/default.nix` \| Create \| NixOS container build \| \| `argocd/apps/dex.yaml` \| Create \| ArgoCD app targeting ringtail \| \| `argocd/manifests/dex/*` (8 files) \| Create \| K8s manifests (RBAC, ExternalSecret, Deployment, Service, Ingress) \| \| `argocd/manifests/grafana-config/external-secret-dex-oauth.yaml` \| Create \| Grafana OIDC client secret \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Modify \| Add new ExternalSecret resource \| \| `argocd/manifests/grafana/values.yaml` \| Modify \| Add `auth.generic_oauth` config + envFromSecrets \| \| `ansible/roles/caddy/defaults/main.yml` \| Modify \| Add `dex.ops.eblu.me` reverse proxy entry \| \| `docs/changelog.d/feature-dex-oidc.feature.md` \| Create \| Changelog fragment \| Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/222	2026-02-19 20:24:24 -08:00
Erich Blume	16a4a9a616	Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216 ) ## Summary - Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired - Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s - Move ntfy ArgoCD app from indri minikube to ringtail k3s - Update Frigate docs to reflect detector removal and planned RTX 4080 migration - Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail ## Deployment After merge: 1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps: ``` argocd app sync apps --prune ``` 2. Sync new ringtail apps: ``` argocd app sync mosquitto-ringtail argocd app sync ntfy-ringtail ``` 3. Manually clean up the detector LaunchAgent on indri: ``` ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ``` ## Notes - Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns) - ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup - Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed - Frigate + frigate-notify will be ported to ringtail in a follow-up PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216	2026-02-19 11:22:44 -08:00
Erich Blume	b475a1fcd7	Fix 1Password secret tasks always reporting changed in ringtail playbook (#213 ) ## Summary - Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml` - Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created` ## Test plan - [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213	2026-02-19 07:25:24 -08:00
Erich Blume	918df9e642	Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209 ) ## Summary Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions runner for building containers with Nix. ### K3s cluster - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint) - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`) - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access - `kubectl` added to system packages ### 1Password Connect + External Secrets Operator - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`, `external-secrets-config-ringtail` - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password` namespace in post_tasks ### Systemd Forgejo Actions runner - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env` - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access ### Nix container build workflow - New `.forgejo/workflows/build-container-nix.yaml` triggers on `-nix-v[0-9]` tags (e.g. `nettest-nix-v1.0.0`) - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy` - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering ### Mise task updates - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`) - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`) ## Post-merge 1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild 2. Register k3s cluster in ArgoCD (first time only): ```fish ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' \| \ sed 's\|127.0.0.1\|ringtail.tail8d86e.ts.net\|' > /tmp/k3s-ringtail.yaml set -x KUBECONFIG /tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail 3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail -> external-secrets-config-ringtail 4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder' 5. Check Forgejo admin panel for ringtail-nix-builder runner online 6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209	2026-02-18 21:15:30 -08:00
Erich Blume	535f897054	Polish ringtail NixOS config and add documentation (#208 ) ## Summary - Fix Super+Return keybinding to launch wezterm in sway - Set fish as default login shell - Remove `initialPassword` (real password already set) - Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep) - Add ringtail reference card, update host inventory and reference index - Changelog fragment ## Post-merge deployment - `mise run provision-ringtail` to rebuild NixOS - On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration) - Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208	2026-02-18 17:53:47 -08:00
Erich Blume	b76f2314c2	Add force: true to ringtail git task nixos-rebuild can dirty the tree (e.g. flake.lock updates), which blocks the Ansible git module. Force ensures we always reset to the upstream state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:32:23 -08:00
Erich Blume	b9d813cde1	Add NixOS configuration for ringtail workstation (#207 ) ## Summary - NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/` - Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe) - NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam - Tailscale integration for tailnet connectivity - Ansible playbook + `mise run provision-ringtail` for ongoing management - Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap ## Deployment Order 1. Merge PR 2. `pulumi up` in tailscale stack → creates auth key 3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets` 4. On ringtail NixOS installer: - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo) - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail` 5. Reboot, `tailscale up --auth-key=<key>` 6. Verify: `tailscale status`, SSH from gilbert ## Test plan - [ ] Review NixOS configuration for completeness - [ ] Verify disko partition layout matches ringtail hardware - [ ] Run `pulumi preview` for tailscale stack - [ ] Install NixOS on ringtail - [ ] Confirm tailscale connectivity - [ ] Confirm sway desktop works - [ ] Test `mise run provision-ringtail` for ongoing management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207	2026-02-18 08:24:25 -08:00

1 2 3

134 commits