Commit graph

1,035 commits

Author SHA1 Message Date
3660b65981 heph Authentik: register heph-pwa redirect URIs for PKCE login
The heph-pwa browser login (hephaestus PR #9) uses Authorization Code + PKCE,
which redirects back to the app origin. Register https://heph.ops.eblu.me/ (and
http://localhost:8787/ for dev) on the heph provider; Authentik also keys
token-endpoint CORS off these origins.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:27:00 -07:00
a2f1e06224 Add hephaestus sync hub to indri (launchagent, PWA, device-code OIDC) (#369)
Makes indri the canonical **heph** hub for the hub-and-spoke task/context system, deployed as a self-updating LaunchAgent managed by Ansible. Other devices (gilbert) attach as offline-capable spokes.

## What's here
- **`ansible/roles/heph`** (tag `heph`) — bootstrap `cargo install hephd` (only if absent; `--self-update` keeps it current after), version-pinned `heph-pwa` checkout served via `--web-root`, launchagent `mcquack.eblume.heph`:
  ```
  hephd --mode server --http-addr 0.0.0.0:8787 --db … --web-root …
        --oidc-issuer …/o/heph/ --oidc-audience heph
        --self-update --self-update-interval-secs 600
  ```
  `~/.cargo/bin` is on the agent `PATH` so self-update's `cargo install` works.
- **Caddy** — `heph.ops.eblu.me → localhost:8787` (TLS for the PWA secure context).
- **Authentik** — new `heph` **public device-code** OIDC app + `default-device-code-flow` bound to the default brand's `flow_device_code` (verified live: brand `authentik-default`, field currently unset → additive).
- **Docs** — `services/hephaestus.md` (Path-A seeding runbook + spoke caveat), `indri.md`, changelog fragment.

## Three features requested
- **Autoupdate** — 10-min interval (`--self-update-interval-secs 600`).
- **PWA** — `--web-root` (confirmed shipped in v1.2.0).
- **Spoke** — gilbert reconfig documented (post-merge step).

## Deploy plan (not done yet — awaiting review)
1. Seed from gilbert (Path A): `heph daemon stop` → copy `heph.db` → `DELETE FROM meta WHERE key='origin'`.
2. Sync Authentik `apps`/blueprint; verify blueprint status via API (not just logs).
3. `provision-indri --tags heph,caddy` from this branch.
4. Point gilbert at the hub + `heph auth login`.

## Known follow-ups (heph-side, tracked in the Hephaestus project)
- `heph daemon` can't bake hub/spoke config or pass `--self-update-interval-secs` → worked around by the ansible plist.
- Path-A seeding lacks a clean `hephd --owner-id`/seed command → manual `meta.origin` reset for now.
- Self-update moves hephd ahead of the ansible-pinned PWA shell over time (drift; tolerated by the SW cache, revisit on next release).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #369
2026-06-05 06:46:58 -07:00
f6c926f1f5 C0: rebuild external-secrets off main, repoint both clusters to stable tags
indri -> v2.2.0-13895bb (arm64), ringtail -> v2.2.0-13895bb-nix (amd64).
Both deployed images now trace to main commit 13895bb instead of earlier
branch builds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:19:20 -07:00
13895bb04a Localize external-secrets on ringtail (amd64 nix build) (#368)
Follow-up to #367. That PR localized external-secrets but the Dagger build (on indri's Apple Silicon runner) only produces an **arm64** image — and external-secrets also runs on **ringtail (amd64)** via the same shared manifest. This completes the localization so both clusters run the local binary on their native arch.

## Approach (matches the kube-state-metrics dual-build pattern)
- **`containers/external-secrets/default.nix`** (new) — builds the **amd64** image on ringtail's nix-container-builder. `buildGoModule` with Go 1.26 (v2.2.0 requires ≥1.26.1; nixpkgs default is 1.25.x) and `-tags all_providers`, faithful to upstream. Same v2.2.0 source from the forge mirror.
- **`argocd/manifests/external-secrets-ringtail/`** (new) — thin kustomize overlay that reuses the shared indri manifest as a base and overrides **only** the image to the `-nix` (amd64) tag. No manifest duplication.
- **`argocd/apps/external-secrets-ringtail.yaml`** — repointed at the new overlay.

Result: indri → `v2.2.0-…` (arm64, Dagger), ringtail → `v2.2.0-…-nix` (amd64, nix).

## Build
Run #581 built both arches at the branch commit. Verified the nix image is `linux/amd64`, entrypoint = the binary, user 65534.

## Deployed from branch & verified on ringtail (k3s, amd64)
- All 3 pods rolled to the nix amd64 image, `1/1 Running` (no exec-format error → arch correct)
- Controller logs clean
- **Live secret fetch proven:** force-synced `homepage/homepage-grafana` → `refreshTime` advanced, `Ready=True`
- **All 20** ringtail ExternalSecrets remain `SecretSynced=True`

## Post-merge
The `external-secrets-ringtail` app is temporarily pointed at this branch + overlay path (apps app left on `main`, manual-sync, untouched). After merge:
```
argocd app sync apps                       # picks up the new Application path on main
argocd app set external-secrets-ringtail --revision main && argocd app sync external-secrets-ringtail
```
I'll also rebuild off `main` so both clusters land on stable main-sha tags (as done for indri in #367).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #368
2026-06-04 15:37:42 -07:00
30c82079b9 C0: rebuild external-secrets image off main (v2.2.0-0e70a1b)
Repoint to the main-branch-built image so the deployed tag traces to a main
commit rather than the merged feature branch. Same v2.2.0 source, stable
provenance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:59:17 -07:00
0e70a1b524 Localize external-secrets container (native container.py build) (#367)
Knocks out the weekly "pick one non-local container and make it local" task by moving **external-secrets** off `ghcr.io` onto a locally-built image, under our own supply-chain control. Doubles as its overdue service review.

## What changed
- **`containers/external-secrets/container.py`** (new) — native Dagger build (the Dockerfile→container.py migration pattern). Clones the forge mirror at `v2.2.0` and builds the single `all_providers` static Go binary, faithful to upstream's `make build` (CGO off, no version ldflags upstream). ENTRYPOINT is `/bin/external-secrets` so the controller/webhook/cert-controller Deployments select their role via `args:` exactly as before.
- **`argocd/manifests/external-secrets/kustomization.yaml`** — image swapped to `registry.ops.eblu.me/blumeops/external-secrets:v2.2.0-2985007`. **Like-for-like (v2.2.0)**, not an upgrade.
- **`service-versions.yaml`** — marked reviewed (2026-06-04), noted the local build.

## Build
Built on the indri forge runner (run #579, ~4 min) → pushed to Zot. Image config verified: `Entrypoint=/bin/external-secrets`, `User=65534`, version label `v2.2.0`.

## Deployed from branch & verified
- All 3 pods (controller / webhook / cert-controller) rolled to the local image, `1/1 Running`
- Controller + webhook logs clean (no errors; webhook serving TLS)
- **End-to-end secret fetch proven:** force-synced `monitoring/grafana-admin` → `refreshTime` advanced to now, `Ready=True`
- All 10 ExternalSecrets cluster-wide remain `SecretSynced=True` — no collateral damage
- App `Healthy`

## Post-merge
`external-secrets` currently points at this branch (so `apps` reads OutOfSync — expected). After merge:
```
argocd app set external-secrets --revision main && argocd app sync external-secrets
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #367
2026-06-04 14:55:55 -07:00
bb55fa9566 Recurring review sweep: 4 doc cards + nvidia-device-plugin v0.19.2 (#366)
Knocks out the two daily recurring review tasks (doc review + service review) in one PR.

## Doc review (4 never-reviewed reference cards, `last-reviewed: 2026-06-04`)
- **cluster.md** — Kubernetes version v1.34.0 → **v1.35.0**; refreshed the stale ringtail workload list and noted the in-progress minikube→k3s migration (points to `[[ringtail]]` as the canonical list).
- **ntfy.md / tempo.md / alloy.md** — corrected image references: these are now **locally-built `registry.ops.eblu.me/blumeops/*` nix containers** (ntfy v2.19.2, tempo v2.10.3, alloy-k8s v1.16.0), not upstream Docker Hub. Fly.io alloy binary bumped to v1.16.1.

## Service review
- **nvidia-device-plugin** (ringtail GPU): v0.19.0 → **v0.19.2**. Upstream patch releases — CDI/Tegra fixes + dependency bumps, no breaking changes for our manifest-based CDI + RuntimeClass setup (the service-account change in the notes is helm-only).

## Not in this PR (need container rebuilds, deferred)
The other stale services are locally-built nix images, so upgrading them is a forge-runner rebuild rather than a clean tag bump — left untouched (not date-bumped, so they resurface): **prometheus** (v3.10.0→v3.12.0), **loki** (3.6.7→3.7.2), **kube-state-metrics**, **homepage**. Happy to do these as a follow-up rebuild PR.

## Deploy / verify
Not yet deployed — `nvidia-device-plugin` still points at `main`. After review:
```
argocd app set nvidia-device-plugin --revision reviews-jun4 && argocd app sync nvidia-device-plugin
# after merge:
argocd app set nvidia-device-plugin --revision main && argocd app sync nvidia-device-plugin
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #366
2026-06-04 13:37:02 -07:00
02ea1cc72a C0: point tailscale-operator base mirror fetch at tailnet forge
The public forge.eblu.me now black-holes /mirrors/ at the Fly edge
(AI-scraper mitigation), so the in-cluster ArgoCD repo-server got a 403
fetching the upstream operator manifest — leaving tailscale-operator and
tailscale-operator-ringtail in Unknown sync. Use forge.ops.eblu.me.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 12:40:21 -07:00
Forgejo Actions
8f72f04d5c Update docs release to v1.17.0
- Built changelog from towncrier fragments

[skip ci]
2026-06-03 21:52:22 -07:00
29e0f012cd C0: pin Quartz docs build to v4.5.2 (v5.0.0 broke build) v1.17.0
The Dagger build_docs pipeline cloned Quartz from the default branch
unpinned. Quartz v5.0.0 restructured its config layout (.quartz/plugins,
../quartz imports), breaking the docs build against our existing
quartz.config.ts / quartz.layout.ts. Pin the clone to the last v4
release (v4.5.2) to restore known-good behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 21:39:41 -07:00
2148714584 C0: retire Todoist blumeops-tasks; point task discovery at heph
Replace the Todoist-backed blumeops-tasks mise task with
`heph list --project Blumeops --json` (hephaestus, now at v1 prototype
on gilbert). Update task-discovery, rotation-reminder, and zk
references across docs; note the zk zettelkasten is migrating into
heph docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 21:32:10 -07:00
308c8e3dad C0: drop duplicate Homepage static entries for ringtail-migrated services
Mealie, Paperless, Immich, TeslaMate are now autodiscovered from their ringtail
Ingress gethomepage.dev annotations; the static services.yaml entries (from when
they were on minikube, which homepage-on-ringtail can't autodiscover) were
duplicating them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 15:31:59 -07:00
eaa899cfc6 C0: wave-1 decommission follow-ups (argocd admin RBAC, teslamate probe)
- argocd: grant local break-glass admin the admin role (g, admin, role:admin);
  previously only the Authentik admins group had access, locking out admin
  once its token expired (policy.default is unset).
- alloy-k8s: repoint the teslamate blackbox probe from the deleted minikube
  service to https://tesla.ops.eblu.me/ (Caddy over Tailscale), like immich.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 13:02:05 -07:00
46f0002178 Decommission wave-1 minikube services (paperless, teslamate, mealie) (#365)
Final step of the wave-1 indri-k8s migration. paperless, teslamate, mealie run on ringtail with data migrated, verified, and backed up (local + BorgBase offsite via PR #364).

- Remove minikube paperless/teslamate/mealie manifest dirs + ArgoCD app defs (prunes the parked Deployments/Services + redundant minikube mealie/paperless PVCs)
- Drop paperless/teslamate roles + ExternalSecrets from the minikube blumeops-pg cluster
- miniflux + authentik stay on minikube (later waves)

Finalization after merge: sync apps + databases to prune, then DROP DATABASE paperless/teslamate on indri's blumeops-pg (fresh safety dump taken first).

Reviewed-on: #365
2026-06-03 12:36:06 -07:00
44798a6429 C0: mealie-ringtail image rebuilt from main (e0057b4-nix)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 12:26:55 -07:00
e0057b46e4 Wire ringtail blumeops-pg into backups + Grafana (#364)
Prereq for the wave-1 decommission. The cutover moved paperless+teslamate (postgres) and mealie (SQLite) to ringtail, but borgmatic and the Grafana TeslaMate datasource still pointed at the minikube copies — the migrated live data was unbacked since cutover, and dropping the minikube DBs would break the TeslaMate dashboards.

- Tailscale Service `blumeops-pg-ringtail` + Caddy L4 route `pg.ops.eblu.me:5434`
- borgmatic: teslamate + paperless postgres → :5434; mealie SQLite → ssh:eblume@ringtail
- Grafana TeslaMate datasource → pg.ops.eblu.me:5434

Deploy: sync databases-ringtail (tailscale svc) + grafana from branch; provision-indri --tags caddy,borgmatic; verify a backup run + dashboards. Unblocks the decommission PR.
Reviewed-on: #364
2026-06-03 12:25:30 -07:00
92b54e7ba9 C0: ringtail wave-1 images rebuilt from main (fcac8e5-nix tags)
Post-merge rebuild of paperless/mealie/teslamate Nix images at the main
merge commit, replacing the feature-branch -nix tags. Image content is
identical; only the commit-sha suffix changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:36:15 -07:00
fcac8e5a72 Wave 1 indri→ringtail migration: paperless, teslamate, mealie (#363)
Migrate paperless, teslamate, and mealie off the OOM-saturated minikube-indri node onto ringtail k3s, shedding ~1.1 GiB of resident load. Second chain in the indri-k8s decommission after immich.

**Containers ported to Nix (default.nix), build-verified on ringtail:**
- paperless → wraps nixpkgs paperless-ngx 2.20.15 (pinned unstable); runs as web/worker/beat/consumer
- mealie → wraps nixpkgs mealie 3.16.0 (forward 4-minor bump, breaking-change reviewed); single gunicorn, SQLite
- teslamate → from-scratch beamPackages mixRelease (not in nixpkgs); erlang_27+elixir_1_18, npm assets, ex_cldr locales pre-fetched

**Data:** cold downtime-tolerant cutover. paperless+teslamate postgres dump/restore from quiesced source into a new ringtail blumeops-pg CNPG cluster; mealie SQLite PVC copied. Source DBs untouched until verified (rollback = repoint).

**Also:** ringtail blumeops-pg cluster + ExternalSecrets scaffold; fixes pre-existing shower version-check drift.

Runbook: docs/how-to/ringtail/migrate-wave1-ringtail.md. Deploy-from-branch + cutover happens before merge; container images rebuilt from main after merge.
Reviewed-on: #363
2026-06-03 10:34:00 -07:00
40bd929820 C0: remove visible GNU Terry Pratchett from naughty.html body
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 37s
GNU lives in the overhead — the X-Clacks-Overhead header — never on the
visible page. Keep the header, drop the footer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:55:05 -07:00
a36a18aaa6 C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of
all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI
GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite
git-history URL space and timing out Forgejo. robots.txt already disallowed
/mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~
to beat the regex asset locations), served as a roll-of-dishonour page with an
X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via
forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection
are documented in docs/explanation/ai-scraper-mitigation.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:52:20 -07:00
e0064de83d C0: update ringtail flake inputs (nixpkgs, disko)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:52:09 -07:00
f588638331 C0: rebuild valkey from squashed main commit
Image tags from PR #362 (v8.1.7-02859c5{,-nix}) referenced a branch
SHA that no longer exists on main after squash-merge. Rebuilt both
the dagger arm64 and nix amd64 variants from the squashed commit
(ecded30) and updated paperless + immich-ringtail to the new tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:53:21 -07:00
ecded30073 Make valkey local on ringtail (nix amd64) + bump to 8.1.7 (#362)
## Summary

Weekly "make one non-local container local" pickup: immich-ringtail still pulled `docker.io/valkey/valkey:8.1.6` because the existing `containers/valkey/container.py` build was arm64-only.

- Adds `containers/valkey/default.nix` — nix-built amd64 valkey image, packaged by the ringtail nix-container-builder runner using `pkgs.dockerTools.buildLayeredImage`. Mirrors the existing `containers/authentik-redis/default.nix` pattern.
- `containers/valkey/container.py` keeps building the Alpine arm64 image for paperless on indri. Bumped both builds to upstream valkey 8.1.7 (Alpine 3.22 now ships `8.1.7-r0`; nixpkgs has 8.1.7).
- Splits `VERSION` (upstream app) from `ALPINE_PIN` (apk pin) in `container.py` so both build files can declare the same upstream version and pass `container-version-check`.
- Updates `service-versions.yaml`: current-version 8.1.7, refreshed last-reviewed, upstream-source now points at the canonical valkey-io releases page.
- Switches kustomizations:
  - `immich-ringtail/kustomization.yaml`: `docker.io/valkey/valkey:8.1.6` → `registry.ops.eblu.me/blumeops/valkey:v8.1.7-02859c5-nix`, comment updated.
  - `paperless/kustomization.yaml`: `v8.1.6-r0-fabca04` → `v8.1.7-02859c5`.

## Build

build-container run #563 — both jobs succeeded after a transient runner crash on the first dispatch (#562 build-nix), which surfaced two separate bugs that landed in a separate C0 on main:

- `runner-logs` silently returned 0 with no output when the log file didn't exist on indri
- `ssh indri` swallowing remote exit codes (fish login shell), which the wrapper now works around via a stdout marker

## Test plan

- [ ] `argocd app set immich-ringtail --revision valkey-nix && argocd app sync immich-ringtail`
- [ ] `argocd app set paperless --revision valkey-nix && argocd app sync paperless`
- [ ] Both valkey pods come Ready and start serving on :6379
- [ ] Immich app + paperless can read/write their respective cache
- [ ] After merge: rebuild from squashed main commit + update kustomization tags (squash-tag follow-up)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #362
2026-05-28 14:51:09 -07:00
1ce381cb6e C0: surface missing-log failures in runner-logs
`mise run runner-logs <run> -j <n>` previously silently succeeded with
no output when forgejo had no log for the task. Two layered causes:

1. zstdcat exits 0 even when the file is missing (writes "can't stat
   … -- ignored" to stderr).
2. ssh to indri runs fish, which silently drops the remote exit code so
   the subprocess returncode is always 0.

Probe `test -f` over SSH and parse a stdout marker (EXISTS / MISSING) to
detect the missing-log case, then report it explicitly with the indri
path and a hint about action_task.log_in_storage = 0 so the operator
knows where to look next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:36:33 -07:00
e703d25efe C0: rebuild unpoller container from squashed main commit
Image was previously tagged with the unpoller-v3 branch SHA (1b27242),
which doesn't exist in main's history after squash-merge. Rebuilt from
the squashed commit so the tag references a reachable commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:10:21 -07:00
4d1f4af25b Upgrade unpoller v2.34.0 → v3.2.0, migrate to container.py (#361)
## Summary

- Service Review pickup: unpoller (last reviewed 73 days ago).
- Upgrades unpoller from v2.34.0 to v3.2.0 (major version bump).
- Migrates the container build from a Dockerfile to a native Dagger pipeline (`containers/unpoller/container.py`) following the navidrome / miniflux pattern.
- Refreshes `service-versions.yaml` (last-reviewed, current-version).

## Breaking changes (upstream)

- **v3.0.0** — UniFi network API shifts (later 10.x). Some metric / event / log names and labels may have changed. Worth a follow-up sweep of the unpoller Grafana dashboard for missing series.
- **v3.2.0** — defaults to a 60s background poll feeding cached Prometheus scrapes (was on-demand poll per scrape). To restore previous behavior, set `interval = 0` in `up.conf`. Leaving the new default in this PR — every-15s scrapes will simply serve from cache, which is fine for our use.

## Build

- Image: `registry.ops.eblu.me/blumeops/unpoller:v3.2.0-1b27242`
- Built by build-container workflow run #559 from this branch.

## Test plan

- [ ] `argocd app set unpoller --revision unpoller-v3 && argocd app sync unpoller`
- [ ] Pod comes Ready
- [ ] Verify metrics exported (`Site/Client/UAP/USG/USW` counts in logs, `unpoller_*` series in Prometheus)
- [ ] Spot-check unpoller Grafana dashboard for missing series after the v3 API shift
- [ ] After merge: `argocd app set unpoller --revision main && argocd app sync unpoller`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #361
2026-05-28 09:59:46 -07:00
f6febb1f77 C0: switch fly proxy deploy strategy to immediate
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 34s
Bluegreen kept timing out — the new green machine couldn't reach
"started" within Fly's 5-minute deploy budget. The cold-start sequence
(tailscaled → tailscale up → wait-for-MagicDNS → nginx startup) eats
most of that, leaving no headroom for healthcheck propagation.

For a single-machine proxy, bluegreen offers little benefit anyway:
no warm second instance, so trading 5-10s of downtime for predictable
completion is the right call.
2026-05-28 07:59:22 -07:00
4e25180b0a C0: clone blumeops via tailnet on ringtail provision
Switch ringtail.yml from forge.eblu.me (Fly proxy, WAN) to
forge.ops.eblu.me (Caddy on indri, tailnet). Ringtail is always
on the tailnet — the WAN round-trip was overhead and made
provision-ringtail fail any time Fly was slow or down.
2026-05-28 07:13:40 -07:00
c00d7db507 Recurring maintenance batch (2026-05-27) (#360)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 14m10s
Bundle of recurring overdue tasks:

- Ringtail flake update
- Security & compliance report review
- Tooling deps bump (prek, fly, mise, forgejo workflows)
- Top stale doc review
- Top stale service review (if trivial)

Larger items (service version bumps requiring upgrades, non-local container migration) split out as separate PRs.

Reviewed-on: #360
2026-05-28 06:01:57 -07:00
Erich Blume
753fa9cb63 C0: disable VRR on ringtail DP-1 to stop OMEN panel flicker
The OMEN 27i IPS pumps brightness when its refresh swings into the low
VRR range during low-framerate content (game cutscenes), producing a
~20Hz flicker that compounds over a session until a reboot. GPU health
is clean (no Xid/ECC/thermal); pinning fixed 165Hz eliminates it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:59:29 -07:00
Erich Blume
c09bd5b612 C0: cap systemd-coredump on ringtail to stop game-crash lockups
Wine/Proton game segfaults (e.g. Diablo IV) produced multi-GB cores that
systemd-coredump spent minutes compressing to disk, pinning the CPU and
freezing the desktop. Cap ProcessSizeMax/ExternalSizeMax at 1G (oversized
cores logged but skipped) and MaxUse at 2G to bound the store.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:54:32 -07:00
35ae171783 C0: fix sync button location in manage-forgejo-mirrors
The verify step pointed to the main repo page, but the "Synchronize now"
button is in the Mirror settings section of the settings page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:15:07 -07:00
57fd88b269 C0: fix op item edit syntax in zot key rotation
The pbpaste | op item edit ... "field[password]=-" stdin syntax is
rejected by op 2.34 as "invalid JSON" — recent op versions treat
piped input as a full JSON template, not a single field value.
Procedure now uses an inline assignment via a local fish variable.
2026-05-22 21:50:43 -07:00
08a1cb164a C0: fix 1password export filename in backup how-to
1Password's desktop app names exports as
1PasswordExport-<uuid>-<timestamp>.1pux automatically — you can't
choose the name. Procedure now points the task at that glob.
2026-05-22 21:36:13 -07:00
d02bf062af C0: review 1password reference card
Added vault split (blumeops vs Personal), noted onepassword-connect
runs on both indri and ringtail, and lifted op CLI guidance from
agent memory into the card. Bumped last-reviewed.
2026-05-22 21:29:11 -07:00
ee51bcafb4 Rip out compensating-controls framework (#359)
## Summary

Removes the compensating-controls (CC) framework. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files stay in place but no longer carry \`CC: <id>\` prefixes — each entry now just keeps a free-form \`Description\` of why it's muted.

The CC review cadence proved to be more process overhead than this single-operator homelab needed.

## What changed

**Deleted**
- \`compensating-controls.yaml\` — the CC registry
- \`mise-tasks/review-compensating-controls\` — the staleness-review task
- \`docs/how-to/operations/review-compensating-controls.md\`
- \`docs/how-to/operations/record-review-evidence.md\` (was aspirational)
- \`docs/explanation/compliance-mute-categories.md\` (proposed-future CC/NA/RA work)
- 5 orphan \`+review-cc-*\` / \`+compliance-mute-categories\` changelog fragments

**Modified**
- 6 mutelist YAML files: stripped \`CC: <id>.\` prefix from every \`Description\` / \`statement\` field, kept the free-form text
- \`mise-tasks/review-compliance-reports\`: removed CC mentions from docstrings, panel text, and the node-verification table title. Node-verification logic itself is unchanged.
- \`docs/reference/operations/security.md\`: removed the "Compensating controls" section
- \`docs/how-to/operations/read-compliance-reports.md\`: rewrote step 3 of "Acting on findings" to point at the mutelist YAML directly
- \`docs/changelog.d/prowler-iac-mutelist.infra.md\`: rewrote to drop the "two new compensating controls" framing

## What did not change

- All Prowler manifests (cronjobs, RBAC, PVs, kustomization) — scans still run on the same schedule
- The Kingfisher deployment
- The trivy-shim in the Prowler container — that's about Trivy ignorefile plumbing, independent of the CC concept
- The mutelist entries themselves — each \`Resources\` list is unchanged; only the prose of \`Description\` was edited
- \`CHANGELOG.md\` — historical releases are left as-is

## Test plan

- [ ] Wait for human review before deploying — once merged, re-point ArgoCD: \`argocd app set prowler --revision main && argocd app sync prowler\` (no manifest changes besides the ConfigMap, so impact is limited to muted-finding descriptions in next week's report)
- [ ] Confirm next weekly Prowler K8s CIS run (Sunday 3am) still completes and produces a report on sifaka
- [ ] Confirm next weekly Prowler IaC run still honors \`trivyignore.yaml\` (the trivy shim is untouched but the ignorefile content was rewritten)
- [ ] \`mise run review-compliance-reports\` — verify node-verification block still runs and prints the renamed table title

Reviewed-on: #359
2026-05-22 21:08:53 -07:00
2fae0f7161 C0: switch grafana deployment to Recreate strategy
Grafana uses an RWO PVC for SQLite + Bleve search index. RollingUpdate
spawns the new pod before terminating the old one, so the new pod
crashloops on the index lock until rollout timeout. Recreate terminates
the old pod first, letting the new pod acquire the lock cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 06:33:26 -07:00
1897eb1c5b C0: move immich blackbox probe to ringtail alloy
Immich migrated to ringtail's k3s cluster but the probe still targeted
the in-cluster service DNS on indri's minikube, firing ServiceProbeFailure
indefinitely. Moved the target into alloy-ringtail's config so the probe
runs in the cluster where immich actually lives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 08:46:22 -07:00
e222d47d45 C0: deploy shower v1.1.3 (kustomize newTag bump)
Image v1.1.3-3645098-nix was built directly on ringtail and pushed via
skopeo, bypassing the Forgejo runner: indri was severely overloaded
(load avg 24.92, minikube VM at 344% CPU) and the workflow-dispatch
endpoint timed out. The image content is identical to what the runner
would have produced — same default.nix at commit 3645098 (on main),
same NIX_PATH (current nixpkgs flake), same skopeo invocation. Tag
short-sha matches the commit that defines the recipe so we aren't
pinning to a ghost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 20:09:54 -07:00
3645098bf1 C0: bump shower to v1.1.3
Wheel/sdist + FOD hashes probed on ringtail. Full nix-build verified
end-to-end before commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 19:57:37 -07:00
Erich Blume
96dbbb3cbe C0: add sn2-prelaunch wrapper to clear SN2 stale lockfiles
UE5 writes Saved/running.dat as a "session in progress" marker. If
the previous session exited uncleanly (SIGKILL, crash), it lingers,
and SN2 pops up an invisible 0×0 Error dialog at next launch that
the GameThread blocks on forever — visible only as a black screen
with a spinning loader. Wrap the Steam command to clear the marker
files before each launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 12:26:10 -07:00
815a0cc6e6 C0: shower — rebuild from main SHA (post-merge retag)
PR #358 was squash-merged so the branch commit b8c7783 baked into the
prior image tag isn't reachable from main's history. Rebuild from main
HEAD (a33fa47) and retag. Image content is byte-identical (FOD is
content-addressed, inputs unchanged); only the SHA in the tag changes
so future provenance tracing stays on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 06:57:24 -07:00
a33fa47b80 C1: deploy shower v1.1.2 (#358)
## Summary

Deploys `adelaide-baby-shower-app` **v1.1.2** to ringtail k3s.

- Bumps `containers/shower/default.nix` `version` to 1.1.2.
- Refreshes sdist + wheel `fetchurl` hashes against the forge PyPI artifacts.
- Re-probed FOD `outputHash` on the nix-container-builder runner (ringtail) and pinned the new closure hash.
- Bumps kustomize `newTag` to `v1.1.2-b8c7783-nix` (built from this branch's tip).
- Bumps `service-versions.yaml` entry for shower to `1.1.2` / `last-reviewed: 2026-05-15`.

## Build provenance

Built by Forgejo Actions run #553 on `nix-container-builder` (ringtail) at commit `b8c7783`. After merge a C0 follow-on will rebuild from main and retag so future provenance points at main history.

## Test plan

- [ ] `argocd app set shower --revision shower-v1.1.2 && argocd app sync shower` deploys cleanly
- [ ] Pod migrates the SQLite PV and serves at `shower.ops.eblu.me` / `shower.eblu.me`
- [ ] No new errors in pod logs after `collectstatic` + gunicorn boot

Reviewed-on: #358
2026-05-15 06:50:46 -07:00
Erich Blume
12314857d8 C0: add GE-Proton to ringtail Steam extraCompatPackages
Lets Subnautica 2 (and any other game) opt into the GE-Proton
build via Steam's per-game compatibility tool override, as a
workaround for the Proton Experimental + DXVK D3D12 Mercuna hang.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 06:27:43 -07:00
4d2bc9975f C0: deploy shower v1.1.1 (kustomize newTag bump)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:51:10 -07:00
4e117dc921 C0: pin shower v1.1.1 FOD outputHash (probed on ringtail)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:40:22 -07:00
6e90c4c363 C0: bump shower to v1.1.1 (probe FOD hash)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:12:00 -07:00
dc69b8c68b C1: fix borgmatic shower SQLite dump (ssh to ringtail) (#357)
## Summary

Nightly borgmatic backups have been failing for 2 days. Root cause: the
shower SQLite dump `before_backup` hook (added in PR #349) referenced
`kubectl --context=k3s-ringtail`, but indri's kubeconfig deliberately
doesn't carry the ringtail credentials. The hook's failure aborted the
entire run, taking out *both* the local sifaka repo and the BorgBase
offsite. Verified the last good archive was `indri-2026-05-11T02:00`.

## Approach

ssh into ringtail and run `k3s kubectl` there — no indri-side
kubeconfig needed. `/etc/rancher/k3s/k3s.yaml` is mode 644 so no sudo
required, and the existing ssh access from indri to ringtail works.

Inline-shell quoting got hairy fast (fish on ringtail rejected `POD=...`
bash syntax; the nix shower image lacks `tar` so `kubectl cp` fails).
Pulled the dump logic into `~/bin/borgmatic-k8s-sqlite-dump`, deployed
by the ansible role. Each dump entry now declares a `target`:

- `local:<context>` — local kubectl with explicit context (mealie)
- `ssh:<user@host>` — ssh + `k3s kubectl` on the cluster host (shower)

Bytes come back via `kubectl exec ... -- cat` instead of `kubectl cp`
since `cp` needs `tar` in the pod (nix-built containers don't bundle it).

## Test plan

- [x] `mise run provision-indri -- --tags borgmatic --check --diff` shows expected diff
- [x] Apply, helper script deployed at `~/bin/borgmatic-k8s-sqlite-dump`
- [x] Helper invoked directly with `ssh:eblume@ringtail` produces a valid 288 KB SQLite file
- [x] Full `borgmatic create` completes without errors — both mealie.db (1.7 MB) and shower.db (288 KB) appear in `~/.local/share/borgmatic/k8s-dumps/`, archive `indri-2026-05-13T17:31:02` written to sifaka borg repo

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #357
2026-05-13 18:55:50 -07:00
947e4310c3 C2: migrate immich from minikube to ringtail (mikado chain) (#356)
## Summary

C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the
largest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.

This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.

## Goal end-state

- Immich `server`, `machine-learning`, `valkey` on ringtail.
- ML pod uses ringtail's RTX 4080 (performance win — currently
  CPU-only).
- CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail.
- Library still on sifaka NFS — ringtail mounts the same path.
- `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress.
- Minikube `immich` and `immich-pg` are removed.

## Cards

| Card | Depends on |
|---|---|
| `migrate-immich-to-ringtail` (goal) | all six below |
| `cnpg-on-ringtail` | — |
| `immich-pg-on-ringtail` | cnpg-on-ringtail |
| `immich-pg-data-migration` | immich-pg-on-ringtail |
| `sifaka-nfs-from-ringtail` | — |
| `immich-app-on-ringtail` | immich-pg-on-ringtail, sifaka-nfs-from-ringtail |
| `immich-cutover-and-decommission` | immich-pg-data-migration, immich-app-on-ringtail |

## Key constraints

- **No data loss.** Downtime is acceptable; data loss is not. Two
  surfaces matter: postgres (ML embeddings, face data — slow to
  re-derive) and the library files (don't move, but NFS access from
  ringtail must be verified).
- **Migration method:** Option A is a CNPG `externalCluster`
  basebackup → promote. Option B is `pg_dump`/`pg_restore` as a
  documented fallback. Either way, dry-run against a scratch
  cluster first.
- **Why pg moves too** (not cross-cluster): keeping pg on minikube
  would block the whole decommission, and Immich is chatty with pg
  so tailnet round-trips would hurt.

## Test plan

- [ ] Plan review — does the dependency graph make sense?
- [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the
      chain correctly.
- [ ] Per-card implementation cycles land separately (commit
      convention enforced by hook).

Reviewed-on: #356
2026-05-13 16:46:17 -07:00
bc8ceb502b Merge pull request 'C1: pin ringtail wired IP to 192.168.1.21 (static)' (#355) from ringtail-static-ip into main 2026-05-12 09:59:59 -07:00