Commit graph

143 commits

Author SHA1 Message Date
d0b5423135 C1: pin ringtail wired IP to 192.168.1.21 (static)
Removes DHCP lease renewal as a failure mode on ringtail after an outage
on 2026-05-12 where the IP and routes silently disappeared from enp5s0
without any kernel link event. NetworkManager stays enabled for wireless
fallback but no longer manages the wired interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:33:57 -07:00
292d354902 C1: deploy adelaide-baby-shower-app to ringtail k3s (#349)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m12s
## Summary

Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app).

### What's included

- **ArgoCD app + manifests** under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer.
- **Public surface** (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett.
- **fail2ban** filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default).
- **Caddy** route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`).
- **Pulumi** Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`.
- **Grafana** APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter.
- **Container** `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages.
- **Docs** runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments.

### Defense layers on the public surface

1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list)
2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP)
3. django-axes (5 fails / 1h, keyed on username+ip_address)
4. edge `/admin/` block (returns 403 for anything that isn't login/logout)

## Prerequisites for the user to do (NOT in this PR)

Halted on these per request — they touch shared/manual systems:

- [x] **NFS share** on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000`
- [ ] **1Password item** `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git
- [ ] **Container build**: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix`
- [x] **DNS**: `mise run dns-up` after merge
- [x] **Fly cert**: `fly certs add shower.eblu.me -a blumeops-proxy`
- [ ] **Caddy push**: `mise run provision-indri -- --tags caddy`
- [ ] **Fly redeploy** to pick up the new nginx block + fail2ban jail: `mise run fly-deploy`
- [ ] **ArgoCD sync**: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging

## Test plan

- [ ] Container builds successfully on nix-container-builder runner
- [ ] Pod starts, migrations run, gunicorn answers on :8000
- [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean
- [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response
- [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead`
- [ ] Grafana dashboard "Shower APM" appears and starts showing traffic
- [ ] `mise run services-check` passes

Reviewed-on: #349
2026-05-11 13:47:18 -07:00
9fb5442ccd C0: kiwix — doc review, fix Adding Archives source path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:46:16 -07:00
4aa0872949 C0: review ollama doc — refresh image, models, last-reviewed
Bumped documented image tag to 0.20.4 (matches kustomization newTag),
added the two qwen3.5 models from models.txt, and stamped the card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:33 -07:00
8d634861f6 C1: migrate cv + docs from minikube to indri-native (#342)
## Summary

Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down.

## What's in this commit

- `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/`
- `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`)
- `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy`
- `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit
- `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi)
- Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment

## What is not in this commit

The dead artifacts. After PR review and successful cutover, a follow-up commit deletes:

- `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/`
- `containers/cv/`, `containers/quartz/`
- `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check`
- bumps `docs.current-version` in `service-versions.yaml` to the release tag

## Cutover plan (manual, from gilbert, after review)

1. **Take down old:**
   - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade`
   - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns | grep -E '^(cv|docs)\\b'` (should be empty)
   - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail
2. **Bring up new:**
   - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch)
   - `mise run provision-indri -- --tags cv,docs,caddy`
   - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"`
3. **Verify:** `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md`
4. **Cleanup commit + merge.**

Total expected downtime: minutes (not the few-hour budget you authorized).

## Test plan
- [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean
- [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread
- [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200
- [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment`
- [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page
- [ ] `mise run services-check` clean
- [ ] `mise run service-review --type ansible` shows cv and docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #342
2026-04-29 14:55:11 -07:00
a529d60f60 C0: remove containers/devpi/ build artifact
Devpi now runs natively on indri (uv venv via ansible role), so the
Dagger container build at containers/devpi/ is unused. Removing it.

Also updated dagger.md examples to use 'miniflux' as the example
container-name argument.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:40:45 -07:00
14ca0160ba Migrate devpi from minikube to indri (launchd) (#341)
## Summary

Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent.

## What changed

- **New ansible role** `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box).
- **Caddy**: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`.
- **Playbook**: `indri.yml` gains pre_tasks for the root password and the new role.
- **service-versions.yaml**: devpi flipped from `type: argocd` to `type: ansible`.
- **ArgoCD**: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted.
- **Docs**: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list.

## Already deployed (live on indri)

- Service running: `launchctl list mcquack.eblume.devpi` → PID 53888
- `curl https://pypi.ops.eblu.me/+api` returns 200 
- `mise run docs-mikado` works again 
- 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/`
- Minikube namespace and PVC fully reclaimed

## Test plan

- [ ] `mise run services-check` (after merge)
- [ ] CI workflows that use devpi succeed
- [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines)

## Context

This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain.

Reviewed-on: #341
2026-04-29 13:38:36 -07:00
817acc5e5e C0: transmission doc — review and correct storage/monitoring details
Marked last-reviewed: 2026-04-29. Fixed the storage layout table —
`/config/` is an emptyDir (ephemeral), not NFS, and the watch directory
is disabled. Documented the transmission-exporter sidecar that exposes
Prometheus metrics on port 19091.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:00:01 -07:00
005e2a03ed C0: split gandi-operations docs; add dns-acme-cleanup mise task
Splits the nebulous gandi-operations how-to into two single-topic cards
(manage-eblu-me-dns, rotate-gandi-pat) and adds a mise task for the
recurring _acme-challenge TXT cleanup needed due to a value-comparison
bug in libdns/gandi v1.1.0 that prevents certmagic's cleanup phase from
removing presented TXT values.

The gandi reference card is updated to drop the false "different
credential from Pulumi PAT" claim — verified during the 2026-04-27
incident that Caddy and Pulumi share a single PAT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 09:48:46 -07:00
72b27b7fd2 C0: docs — add mealie borg restore how-to
Captures the procedure used to restore mealie's SQLite DB from a borgmatic
archive after the post-DR wipe: extract from borg, snapshot the wiped DB,
swap via a helper pod on the ReadWriteOnce PVC, fix UID 911 ownership.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:04:28 -07:00
7d94b9073a C0: docs — default argocd login to --sso; drop extraneous --grpc-web
Now that argocd's Authentik OAuth2 client is public, `argocd login --sso`
works for day-to-day use. Promote it to the default in AGENTS.md,
argocd-cli reference, and troubleshooting; keep the admin/password flow
documented as a break-glass fallback for when Authentik is unavailable.

Also drops --grpc-web from every interactive login command — confirmed
extraneous (login succeeds without it). Left in CI workflows and
`argocd cluster add` untouched; those are different contexts that I
didn't re-test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:43:21 -07:00
db8fd946ae Bump Dagger to 0.20.6 and migrate runner-job-image to Alpine container.py
Bumps the Dagger engine/CLI from v0.20.1 to v0.20.6 (mise pin, dagger.json
engineVersion, SDK regen) and rewrites the runner-job-image container as a
native Dagger pipeline on Alpine 3.23 using the shared alpine_runtime helper,
replacing the Debian-based Dockerfile. All Forgejo Actions in this repo use
actions/checkout (a JS action), so musl is not a compatibility concern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 08:28:18 -07:00
1425bf1f5c Upgrade forgejo-runner to v12.8, adopt server.connections, and clean up docs (#338)
## Summary
- consolidate forgejo-runner how-to docs into current cards
- upgrade the k8s forgejo-runner deployment to the latest v12.8.x runner image
- switch the k8s runner from first-boot register flow to declarative server.connections config
- keep the runner image on the native Dagger build path and update the surrounding manifests/secrets

## Notes
- PR opened early for C1 review
- implementation and deployment verification will follow in subsequent commits

Reviewed-on: #338
2026-04-20 09:03:54 -07:00
51a878cddb C0: review navidrome reference doc 2026-04-18 20:25:19 -07:00
d26a6ae3b2 Update docs for Caddy routing and direct WireGuard peering
Comprehensive docs pass reflecting the new Fly proxy architecture:
- Fly proxy routes through Caddy on indri (not per-service TS Ingress)
- Direct WireGuard peering via --port=41641 pinning
- DERP relay performance lesson in Tailscale docs
- Caddy now in public traffic path
- indri tagged as flyio-target
- Removed fly-reload references
- Updated architecture diagrams and per-service setup guide
- Added changelog fragment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:57:30 -07:00
fe0e913963 Switch Fly proxy to upstream keepalive pools (#337)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s
## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
2026-04-17 16:39:52 -07:00
d7c3c687f4 Document DR rebuild procedure and update restart-indri
- New how-to: rebuild-minikube-cluster with full bootstrap procedure
  validated during 2026-04-13 DR event
- Update restart-indri: warn about minikube delete, macOS permission
  dialog on first Tailscale SSH, forgejo_actions_secrets dep cycle
- Update disaster-recovery reference: link to rebuild procedure
- Update CLAUDE.md: never run minikube delete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:07:54 -07:00
6e60287e99 Doc review: delete install-dagger-on-nix-runner, add service-versions ref card
Outdated leaf card removed; zot.md now links to new service-versions
reference card instead. Added reverse link from review-services.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:52:38 -07:00
8d80a4a3a5 Rewrite runner-logs: API-based log fetching, multi-repo support
Replace broken SSH+filesystem log retrieval with Forgejo web API
endpoint. Fix CLI to use run numbers (not task IDs), add --repo
for querying any forge repo (e.g. sporks), --limit/-n for listing
size. Document runner-logs as the way to verify build success in
CLAUDE.md and container build docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:42:58 -07:00
c06eccc61c Review hosts.md: add last-reviewed, normalize links, add reference tag
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 21:06:53 -07:00
c86b5d7772 Native Dagger container builds + Navidrome v0.61.1 (#330)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (navidrome) (push) Successful in 22m26s
## Summary
- Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops`
- Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step
- Migrate navidrome as the first container (`containers/navidrome/container.py`)
- Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding)
- Add `dagger call container-version` for CI version extraction without Dockerfile parsing
- All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode
- Legacy `docker_build()` fallback preserved for all other containers

## Motivation
When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline.

## Container build dispatch needed
After merge, dispatch container build for navidrome:
```
mise run container-build-and-release navidrome --ref 470b4bd
```

## Deploy steps
1. Wait for container build to complete
2. Back up navidrome-data PVC (non-reversible DB migrations)
3. `argocd app set navidrome --revision main && argocd app sync navidrome`
4. Verify at https://dj.ops.eblu.me

## Future
Remaining containers migrate incrementally in follow-up PRs using the same pattern.

Reviewed-on: #330
2026-04-11 17:11:56 -07:00
40556e5a2d Review gandi.md: add missing forge.eblu.me CNAME record
The Pulumi code has had a forge.eblu.me CNAME since it was added, but the
doc's DNS table only listed docs and cv. Also fixed the __main__.py
description to mention CNAMEs alongside A records.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:54:46 -07:00
07f52e9488 Deploy Paperless-ngx document management (#328)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (paperless) (push) Successful in 9s
## Summary

- Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri
- Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay
- PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery
- NFS document storage on sifaka (`/volume1/paperless`)
- Authentik OIDC SSO via baked JSON blob from 1Password
- Caddy route at `paperless.ops.eblu.me`
- 1Password item "Paperless (blumeops)" created with all secrets

## Files

- `containers/paperless/Dockerfile` — multi-stage build
- `argocd/manifests/paperless/` — full k8s manifest set
- `argocd/apps/paperless.yaml` — ArgoCD application
- `argocd/manifests/databases/` — CNPG role + ExternalSecret
- `ansible/roles/caddy/defaults/main.yml` — Caddy route
- `service-versions.yaml` — version tracking entry
- `docs/reference/services/paperless.md` — reference card

## Remaining deploy steps

1. Build container: `mise run container-build-and-release paperless`
2. Update kustomization.yaml `newTag` with actual image tag
3. Create Authentik application/provider for paperless
4. Create `paperless` database on blumeops-pg
5. Sync ArgoCD apps, then sync paperless from branch
6. Provision Caddy: `mise run provision-indri -- --tags caddy`
7. Verify at https://paperless.ops.eblu.me

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #328
2026-04-08 17:54:12 -07:00
efae404d1e Remove superuser from teslamate PG role, transfer extension ownership
teslamate had superuser on the shared blumeops-pg cluster (which also
hosts miniflux and authentik). Downgraded to plain database owner with
extension ownership (cube, earthdistance) transferred manually so it
can still ALTER EXTENSION UPDATE. earthdistance is untrusted in PG so
DROP+CREATE would need temporary superuser escalation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:36:39 -07:00
fc34a7da5b Review postgresql.md: add authentik user/db, immich-pg borgmatic secret
Doc review found the authentik database, user, and external secret were
missing, along with the immich-pg borgmatic secret. Added Cluster column
to Users table for clarity. Set last-reviewed: 2026-04-07.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:21:48 -07:00
f42fa2d558 Remove stale Helm chart mirror references from forgejo docs
All Helm chart mirrors (grafana-helm-charts, connect-helm-charts,
cloudnative-pg-charts) have been deleted from forge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 07:37:21 -07:00
64200a55c5 Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3)
Replace the Helm chart deployment with plain kustomize manifests following
the Authentik pattern (separate deployments per component). Consolidate
the immich-storage ArgoCD app into the main immich app. Add no-helm-policy
doc establishing kustomize as the standard deployment mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:42:25 -07:00
08d57ef4d4 Review pulumi reference doc: fix Tailscale auth, add how-to links
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 10:55:57 -07:00
a18a424866 Pin NixOS service versions via nixpkgs-services overlay (#321)
## Summary

- Add `nixpkgs-services` flake input pinned to a specific nixpkgs commit, with an overlay that pulls `forgejo-runner`, `snowflake`, and `k3s` from it instead of the rolling `nixpkgs`
- Dagger `flake-update` pipeline now excludes `nixpkgs-services` via `--exclude`
- Fix stale nix-container-builder version in service-versions.yaml (was 12.6.4, actually running 12.7.2)
- Add k3s and minikube to service-versions.yaml tracking
- Document the pinning approach in review-services how-to and ringtail reference

## Motivation

During service review, discovered that flake updates had silently upgraded forgejo-runner from 12.6.4 → 12.7.2 without updating service-versions.yaml. This "sneak-in upgrade" bypasses the service review process. The overlay ensures these three services only change versions deliberately.

## Test plan

- [ ] Verify `nix flake update` from `nixos/ringtail/` does not change `nixpkgs-services` lock entry
- [ ] Verify `mise run provision-ringtail` builds successfully with the overlay
- [ ] Confirm running service versions unchanged after deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #321
2026-04-01 21:37:57 -07:00
cfbf4cadbd Review argocd-cli reference doc: stamp last-reviewed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:35:52 -07:00
4059b3d27b Add compensating controls framework and date-based report dirs (#320)
## Summary

- Add `compensating-controls.yaml` tracking 9 named controls that justify suppressed security findings
- Update all Prowler mutelist descriptions with `CC: <id>` references to named controls
- Add `mise run review-compensating-controls` task — surfaces stalest control with all codebase references
- Add [[review-compensating-controls]] how-to doc
- Organize Prowler and Kingfisher reports into `YYYY-MM-DD` subdirectories

### Compensating controls

| ID | Mitigates |
|----|-----------|
| `single-user-cluster` | Image cache abuse, RBAC breadth, system pod privileges |
| `tailscale-network-isolation` | Profiling endpoints, weak TLS, debug ports |
| `local-registry` | AlwaysPullImages gap |
| `sso-gated-admin-tools` | ArgoCD wildcard RBAC |
| `operator-managed-pods` | Tailscale proxy pod security settings |
| `ephemeral-privileged-jobs` | Prowler hostPID exposure |
| `trusted-ci-only` | Forgejo runner DinD |
| `init-container-isolation` | Grafana root init container |
| `observability-stack-audit` | Missing apiserver audit logging |

## Test plan

- [ ] `mise run review-compensating-controls` shows table and references
- [ ] `kubectl kustomize argocd/manifests/prowler/` renders correctly
- [ ] Sync prowler and kingfisher, verify next scan writes to dated subdirectory
- [ ] Grep for `CC:` in mutelist files — every muted finding should have at least one

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #320
2026-03-30 17:44:11 -07:00
1e391f96bb Upgrade forgejo-runner 12.7.0 → 12.7.3, add service card
Patch upgrade picks up idempotent FetchTask API, offline registration
fix, cloudflare/circl security dep update, and custom gRPC user-agent.
No config defaults changed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:31:06 -07:00
77eebe507e Review Ansible reference doc: add missing roles, clarify IaC positioning
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:10:24 -07:00
f9206bf10b Build custom Kingfisher container from sporked deploy branch (#318)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 12s
## Summary

- Add Dockerfile for Kingfisher built from source (sporked deploy branch)
- Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime
- Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher`
- Add kingfisher to service-versions.yaml (version tracks upstream main SHA)
- Document spork workflow in CLAUDE.md

## Test plan

- [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29`
- [ ] Verify image on registry: `mise run container-list`
- [ ] Update kustomization newTag
- [ ] Sync ArgoCD kingfisher app from branch
- [ ] Trigger manual CronJob and verify scan completes
- [ ] Verify reports on sifaka

Reviewed-on: #318
2026-03-30 06:34:49 -07:00
bb60369956 Simplify Kingfisher CronJob to HTML-only output
Remove the second scan pass for JSON — one format is enough for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:50:54 -07:00
2808ffd450 Document Kingfisher secret scanner service
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:47:37 -07:00
2bd1611ac1 Document sifaka NFS/Tailscale TUN troubleshooting
Sifaka's Tailscale can revert to userspace networking after package
updates, causing NFS mounts to fail because the NFS daemon sees
127.0.0.1 instead of the client's Tailscale IP. Added troubleshooting
how-to doc and updated sifaka reference card with frigate export and
TUN requirement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 09:12:00 -07:00
3017f759a7 Migrate Forgejo from Homebrew to source build (#316)
## Summary

- Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent
- Matches the established pattern used by zot, caddy, and alloy
- Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more)

## Changes

- **Ansible role**: Replace brew install/services with binary stat check + LaunchAgent
- **Paths**: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo`
- **Run user**: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`)
- **Docs**: Updated Forgejo reference card, restart-indri guide
- **Service review**: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current

## One-time migration steps (manual, on indri)

1. Clone from Codeberg, add forge mirror remote
2. Check out v14.0.3, build with `make build && make forgejo`
3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership
4. Run `provision-indri --tags forgejo`
5. Verify, then `brew uninstall forgejo`

## Data safety

- `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config)
- Brew version stays installed as rollback until verification passes
- No schema changes between 14.0.2 → 14.0.3

Reviewed-on: #316
2026-03-28 08:19:23 -07:00
c78b86c72c Add offsite backup for immich photo library to BorgBase (#315)
## Summary

- Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM
- Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup
- Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label
- Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos
- Docs updated: `backups.md`, `borgmatic.md`

## Prerequisites (manual)

- [x] Create `immich-photos` repo on BorgBase with same SSH key
- [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit)
- [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run)

## Test plan

- [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics`
- [ ] Deploy borgmatic role and verify both configs deployed
- [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours)
- [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom`
- [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #315
2026-03-27 19:43:05 -07:00
ca0c9354ee Add borgmatic backups for authentik and immich databases (#314)
## Summary

- Add `authentik` database (blumeops-pg cluster) to borgmatic pg_dump backups
- Add `immich` database (immich-pg cluster) to borgmatic pg_dump backups
- For immich-pg: new borgmatic managed role with `pg_read_all_data`, ExternalSecret, Tailscale LoadBalancer service, and Caddy L4 TCP proxy on port 5433
- Update backup docs to reflect all four CNPG databases + mealie SQLite

## Deploy plan

Deploy order matters — k8s resources must exist before ansible can route to them:

1. **ArgoCD (databases app):** sync to pick up immich-pg borgmatic role, ExternalSecret, and Tailscale service
   ```
   argocd app set blumeops-pg --revision feature/borgmatic-all-pg-backups
   argocd app sync blumeops-pg
   ```
2. **Wait** for `immich-pg-tailscale` service to get a Tailscale IP and `immich-pg.tail8d86e.ts.net` to resolve
3. **Ansible (caddy):** deploy Caddy L4 route for port 5433
   ```
   mise run provision-indri -- --tags caddy
   ```
4. **Ansible (borgmatic):** deploy updated config and .pgpass
   ```
   mise run provision-indri -- --tags borgmatic
   ```
5. **Verify:** trigger a manual borgmatic run and check all four pg_dump streams succeed
   ```
   borgmatic --verbosity 1 2>&1 | grep -E '(Dumping|ERROR)'
   ```

## Test plan

- [x] `kubectl kustomize` builds cleanly
- [x] `ansible --check --diff` for borgmatic and caddy show expected changes
- [ ] ArgoCD sync succeeds for databases app
- [ ] `immich-pg.tail8d86e.ts.net` resolves
- [ ] `pg.ops.eblu.me:5433` accepts connections
- [ ] `borgmatic --verbosity 1` dumps all four databases without errors

Reviewed-on: #314
2026-03-27 16:59:58 -07:00
33463764d1 Add QArt Tuner: QR code art generator with interactive web UI
Single-file Go tool implementing the QArt technique (Russ Cox, 2012)
using only the public rsc.io/qr API. Generates QR codes whose data
modules form a recognizable image by exploiting error correction
freedom via GF(2) Gaussian elimination.

Includes a web UI with live-updating sliders for version, mask,
rotation, dx/dy offset, and scale. Keyboard shortcuts for rapid
iteration. Also works as a CLI for batch generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 15:33:36 -07:00
831b82950a Upgrade nvidia-device-plugin v0.18.2 → v0.19.0 and add reference card
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:19:24 -07:00
687e972713 Review CV doc and close build-dep review gap
Fix stale CV service doc (URL, forge domain, container tag) and add
guidance for reviewing build-time dependencies in private forge repos
during service reviews.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:12:38 -07:00
fc8d2cdb12 Add preserve/* branch protection and document Pyroscope blocker
branch-cleanup: Add PROTECTED_PREFIXES with preserve/* exclusion so
preserved work-in-progress branches are never deleted.

observability.md: Document Pyroscope profiling work on branch
preserve/pyroscope-profiling/pr-313, blocked on ringtail kernel
sysctl settings (kptr_restrict=0, perf_event_paranoid≤1). Also
document Faro/RUM as future potential with privacy considerations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:32:25 -07:00
b97e37543f Deploy Tor Snowflake proxy on ringtail (#311)
## Summary

- Add Snowflake proxy as a native systemd service on ringtail (NixOS)
- Uses `pkgs.snowflake` from nixpkgs (v2.11.0)
- Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit
- Prometheus metrics enabled on localhost:9999

## What is Snowflake?

A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. **This is NOT a Tor exit node** — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP.

## Changes

- `nixos/ringtail/configuration.nix` — new systemd service definition
- `docs/reference/services/snowflake-proxy.md` — service reference card
- `docs/reference/infrastructure/ringtail.md` — updated systemd services section
- `service-versions.yaml` — added entry (type: nixos)

## Deploy plan

After review, deploy via `mise run provision-ringtail`. Service starts automatically.

## Test plan

- [ ] `mise run provision-ringtail` succeeds
- [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active
- [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections
- [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics

Reviewed-on: #311
2026-03-24 20:51:40 -07:00
fe201a495c Add Prowler IaC scanning of blumeops repo (Saturday 2am)
Clone repo in init container, scan Dockerfiles and K8s manifests
with Prowler's IaC provider (Trivy). Reports written to
sifaka:/volume1/reports/prowler-iac/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:49:38 -07:00
696024306c Add Prowler image vulnerability scanning for blumeops containers
All checks were successful
Build Container / detect (push) Successful in 39s
Build Container / build-dockerfile (prowler) (push) Successful in 10m15s
Add Trivy to the Prowler container for image and IaC scanning.
New CronJob (Saturday 3am) scans all blumeops/* images in the
registry for CVEs, embedded secrets, and Dockerfile misconfigs.
Reports written to sifaka:/volume1/reports/prowler-images/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:43:08 -07:00
d021b3534f Deploy Prowler CIS scanner (#310)
All checks were successful
Build Container / detect (push) Successful in 4s
Build Container / build-dockerfile (prowler) (push) Successful in 10s
## Summary
- Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning
- Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream)
- Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/`
- Read-only ClusterRole for pod, RBAC, and control plane inspection
- Host path mounts + hostPID for kubelet file permission checks

## Follow-ups
- Mirror prowler-cloud/prowler on forge for supply chain control
- Build and push container image, update kustomization.yaml newTag
- Consider adding k3s-ringtail scanning (core + RBAC checks only)

## Test plan
- [ ] Build container: `mise run container-release prowler v5.22.0`
- [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag
- [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler`
- [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri`
- [ ] Verify reports appear on sifaka NFS share
- [ ] `mise run services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #310
2026-03-24 16:08:09 -07:00
fc45989a6c Decommission JobSync service (#308)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary

- Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation
- Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed

## Motivation

Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation.

## Remaining manual steps after merge

- Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy`
- Sync Homepage: `argocd app sync homepage`
- Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #308
2026-03-24 08:44:23 -07:00
bd0ff30d3f Unify container build workflows (#306)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary
- Merges `build-container.yaml` and `build-container-nix.yaml` into a single workflow
- Detect job classifies each changed container by presence of `Dockerfile` and/or `default.nix`
- Dockerfile containers build on `k8s` (indri) via Dagger; Nix containers build on `nix-container-builder` (ringtail) via nix-build + skopeo
- Containers with both build files (alloy, nettest, ntfy) get built on both runners

## Test plan
- [ ] Push a change to a Dockerfile-only container (e.g. grafana) — verify it builds on k8s only
- [ ] Push a change to a nix-only container (e.g. jobsync) — verify it builds on nix-container-builder only
- [ ] Push a change to a dual container (e.g. ntfy) — verify it builds on both runners
- [ ] Test workflow_dispatch with a specific container name

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #306
2026-03-23 20:55:50 -07:00