Compare commits

...

937 commits

Author SHA1 Message Date
bc34b601be Merge pull request 'heph Authentik: grant offline_access scope (fixes spoke sync refresh-token 400)' (#371) from heph-offline-access into main 2026-06-06 18:29:47 -07:00
50a36ff93a heph Authentik: grant offline_access scope (fixes spoke sync refresh-token 400)
The heph CLI requests scope "openid offline_access", but the Authentik
heph OAuth2 provider only mapped openid/email/profile. Without the
offline_access mapping the issued refresh token is bound to the login
session rather than the 30-day refresh-token window; once the session
lapses, hephd's refresh_token grant returns 400 Bad Request and spoke
sync silently degrades (heph sync --status -> auth_failure: true).

Add the built-in offline_access scope mapping to the provider's
property_mappings and document the requirement in the service reference.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-06 18:07:13 -07:00
cf63fcb5b5 C0: track heph in service-versions (self-updating; note drift task)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 08:22:46 -07:00
3abe80523a C0: bump indri heph hub to v1.2.1 (PWA Authentik login + /config)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-05 07:40:51 -07:00
6576880b0e heph Authentik: register heph-pwa redirect URIs (PKCE login) (#370)
Adds the heph-pwa redirect URIs to the Authentik `heph` OAuth2 provider so the new browser **Login with Authentik** flow (Authorization Code + PKCE, hephaestus PR #9) can redirect back and exchange the code:

- `https://heph.ops.eblu.me/` (the PWA origin)
- `http://localhost:8787/` (local dev: `hephd --web-root`)

Authentik also keys token-endpoint CORS off these origins, so they're required for the browser token exchange. Additive (the provider was `redirect_uris: []`); harmless until the PWA feature deploys.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #370
2026-06-05 07:30:31 -07:00
a2f1e06224 Add hephaestus sync hub to indri (launchagent, PWA, device-code OIDC) (#369)
Makes indri the canonical **heph** hub for the hub-and-spoke task/context system, deployed as a self-updating LaunchAgent managed by Ansible. Other devices (gilbert) attach as offline-capable spokes.

## What's here
- **`ansible/roles/heph`** (tag `heph`) — bootstrap `cargo install hephd` (only if absent; `--self-update` keeps it current after), version-pinned `heph-pwa` checkout served via `--web-root`, launchagent `mcquack.eblume.heph`:
  ```
  hephd --mode server --http-addr 0.0.0.0:8787 --db … --web-root …
        --oidc-issuer …/o/heph/ --oidc-audience heph
        --self-update --self-update-interval-secs 600
  ```
  `~/.cargo/bin` is on the agent `PATH` so self-update's `cargo install` works.
- **Caddy** — `heph.ops.eblu.me → localhost:8787` (TLS for the PWA secure context).
- **Authentik** — new `heph` **public device-code** OIDC app + `default-device-code-flow` bound to the default brand's `flow_device_code` (verified live: brand `authentik-default`, field currently unset → additive).
- **Docs** — `services/hephaestus.md` (Path-A seeding runbook + spoke caveat), `indri.md`, changelog fragment.

## Three features requested
- **Autoupdate** — 10-min interval (`--self-update-interval-secs 600`).
- **PWA** — `--web-root` (confirmed shipped in v1.2.0).
- **Spoke** — gilbert reconfig documented (post-merge step).

## Deploy plan (not done yet — awaiting review)
1. Seed from gilbert (Path A): `heph daemon stop` → copy `heph.db` → `DELETE FROM meta WHERE key='origin'`.
2. Sync Authentik `apps`/blueprint; verify blueprint status via API (not just logs).
3. `provision-indri --tags heph,caddy` from this branch.
4. Point gilbert at the hub + `heph auth login`.

## Known follow-ups (heph-side, tracked in the Hephaestus project)
- `heph daemon` can't bake hub/spoke config or pass `--self-update-interval-secs` → worked around by the ansible plist.
- Path-A seeding lacks a clean `hephd --owner-id`/seed command → manual `meta.origin` reset for now.
- Self-update moves hephd ahead of the ansible-pinned PWA shell over time (drift; tolerated by the SW cache, revisit on next release).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #369
2026-06-05 06:46:58 -07:00
f6c926f1f5 C0: rebuild external-secrets off main, repoint both clusters to stable tags
indri -> v2.2.0-13895bb (arm64), ringtail -> v2.2.0-13895bb-nix (amd64).
Both deployed images now trace to main commit 13895bb instead of earlier
branch builds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 16:19:20 -07:00
13895bb04a Localize external-secrets on ringtail (amd64 nix build) (#368)
Follow-up to #367. That PR localized external-secrets but the Dagger build (on indri's Apple Silicon runner) only produces an **arm64** image — and external-secrets also runs on **ringtail (amd64)** via the same shared manifest. This completes the localization so both clusters run the local binary on their native arch.

## Approach (matches the kube-state-metrics dual-build pattern)
- **`containers/external-secrets/default.nix`** (new) — builds the **amd64** image on ringtail's nix-container-builder. `buildGoModule` with Go 1.26 (v2.2.0 requires ≥1.26.1; nixpkgs default is 1.25.x) and `-tags all_providers`, faithful to upstream. Same v2.2.0 source from the forge mirror.
- **`argocd/manifests/external-secrets-ringtail/`** (new) — thin kustomize overlay that reuses the shared indri manifest as a base and overrides **only** the image to the `-nix` (amd64) tag. No manifest duplication.
- **`argocd/apps/external-secrets-ringtail.yaml`** — repointed at the new overlay.

Result: indri → `v2.2.0-…` (arm64, Dagger), ringtail → `v2.2.0-…-nix` (amd64, nix).

## Build
Run #581 built both arches at the branch commit. Verified the nix image is `linux/amd64`, entrypoint = the binary, user 65534.

## Deployed from branch & verified on ringtail (k3s, amd64)
- All 3 pods rolled to the nix amd64 image, `1/1 Running` (no exec-format error → arch correct)
- Controller logs clean
- **Live secret fetch proven:** force-synced `homepage/homepage-grafana` → `refreshTime` advanced, `Ready=True`
- **All 20** ringtail ExternalSecrets remain `SecretSynced=True`

## Post-merge
The `external-secrets-ringtail` app is temporarily pointed at this branch + overlay path (apps app left on `main`, manual-sync, untouched). After merge:
```
argocd app sync apps                       # picks up the new Application path on main
argocd app set external-secrets-ringtail --revision main && argocd app sync external-secrets-ringtail
```
I'll also rebuild off `main` so both clusters land on stable main-sha tags (as done for indri in #367).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #368
2026-06-04 15:37:42 -07:00
30c82079b9 C0: rebuild external-secrets image off main (v2.2.0-0e70a1b)
Repoint to the main-branch-built image so the deployed tag traces to a main
commit rather than the merged feature branch. Same v2.2.0 source, stable
provenance.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 14:59:17 -07:00
0e70a1b524 Localize external-secrets container (native container.py build) (#367)
Knocks out the weekly "pick one non-local container and make it local" task by moving **external-secrets** off `ghcr.io` onto a locally-built image, under our own supply-chain control. Doubles as its overdue service review.

## What changed
- **`containers/external-secrets/container.py`** (new) — native Dagger build (the Dockerfile→container.py migration pattern). Clones the forge mirror at `v2.2.0` and builds the single `all_providers` static Go binary, faithful to upstream's `make build` (CGO off, no version ldflags upstream). ENTRYPOINT is `/bin/external-secrets` so the controller/webhook/cert-controller Deployments select their role via `args:` exactly as before.
- **`argocd/manifests/external-secrets/kustomization.yaml`** — image swapped to `registry.ops.eblu.me/blumeops/external-secrets:v2.2.0-2985007`. **Like-for-like (v2.2.0)**, not an upgrade.
- **`service-versions.yaml`** — marked reviewed (2026-06-04), noted the local build.

## Build
Built on the indri forge runner (run #579, ~4 min) → pushed to Zot. Image config verified: `Entrypoint=/bin/external-secrets`, `User=65534`, version label `v2.2.0`.

## Deployed from branch & verified
- All 3 pods (controller / webhook / cert-controller) rolled to the local image, `1/1 Running`
- Controller + webhook logs clean (no errors; webhook serving TLS)
- **End-to-end secret fetch proven:** force-synced `monitoring/grafana-admin` → `refreshTime` advanced to now, `Ready=True`
- All 10 ExternalSecrets cluster-wide remain `SecretSynced=True` — no collateral damage
- App `Healthy`

## Post-merge
`external-secrets` currently points at this branch (so `apps` reads OutOfSync — expected). After merge:
```
argocd app set external-secrets --revision main && argocd app sync external-secrets
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #367
2026-06-04 14:55:55 -07:00
bb55fa9566 Recurring review sweep: 4 doc cards + nvidia-device-plugin v0.19.2 (#366)
Knocks out the two daily recurring review tasks (doc review + service review) in one PR.

## Doc review (4 never-reviewed reference cards, `last-reviewed: 2026-06-04`)
- **cluster.md** — Kubernetes version v1.34.0 → **v1.35.0**; refreshed the stale ringtail workload list and noted the in-progress minikube→k3s migration (points to `[[ringtail]]` as the canonical list).
- **ntfy.md / tempo.md / alloy.md** — corrected image references: these are now **locally-built `registry.ops.eblu.me/blumeops/*` nix containers** (ntfy v2.19.2, tempo v2.10.3, alloy-k8s v1.16.0), not upstream Docker Hub. Fly.io alloy binary bumped to v1.16.1.

## Service review
- **nvidia-device-plugin** (ringtail GPU): v0.19.0 → **v0.19.2**. Upstream patch releases — CDI/Tegra fixes + dependency bumps, no breaking changes for our manifest-based CDI + RuntimeClass setup (the service-account change in the notes is helm-only).

## Not in this PR (need container rebuilds, deferred)
The other stale services are locally-built nix images, so upgrading them is a forge-runner rebuild rather than a clean tag bump — left untouched (not date-bumped, so they resurface): **prometheus** (v3.10.0→v3.12.0), **loki** (3.6.7→3.7.2), **kube-state-metrics**, **homepage**. Happy to do these as a follow-up rebuild PR.

## Deploy / verify
Not yet deployed — `nvidia-device-plugin` still points at `main`. After review:
```
argocd app set nvidia-device-plugin --revision reviews-jun4 && argocd app sync nvidia-device-plugin
# after merge:
argocd app set nvidia-device-plugin --revision main && argocd app sync nvidia-device-plugin
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #366
2026-06-04 13:37:02 -07:00
02ea1cc72a C0: point tailscale-operator base mirror fetch at tailnet forge
The public forge.eblu.me now black-holes /mirrors/ at the Fly edge
(AI-scraper mitigation), so the in-cluster ArgoCD repo-server got a 403
fetching the upstream operator manifest — leaving tailscale-operator and
tailscale-operator-ringtail in Unknown sync. Use forge.ops.eblu.me.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-04 12:40:21 -07:00
Forgejo Actions
8f72f04d5c Update docs release to v1.17.0
- Built changelog from towncrier fragments

[skip ci]
2026-06-03 21:52:22 -07:00
29e0f012cd C0: pin Quartz docs build to v4.5.2 (v5.0.0 broke build)
The Dagger build_docs pipeline cloned Quartz from the default branch
unpinned. Quartz v5.0.0 restructured its config layout (.quartz/plugins,
../quartz imports), breaking the docs build against our existing
quartz.config.ts / quartz.layout.ts. Pin the clone to the last v4
release (v4.5.2) to restore known-good behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 21:39:41 -07:00
2148714584 C0: retire Todoist blumeops-tasks; point task discovery at heph
Replace the Todoist-backed blumeops-tasks mise task with
`heph list --project Blumeops --json` (hephaestus, now at v1 prototype
on gilbert). Update task-discovery, rotation-reminder, and zk
references across docs; note the zk zettelkasten is migrating into
heph docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 21:32:10 -07:00
308c8e3dad C0: drop duplicate Homepage static entries for ringtail-migrated services
Mealie, Paperless, Immich, TeslaMate are now autodiscovered from their ringtail
Ingress gethomepage.dev annotations; the static services.yaml entries (from when
they were on minikube, which homepage-on-ringtail can't autodiscover) were
duplicating them.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 15:31:59 -07:00
eaa899cfc6 C0: wave-1 decommission follow-ups (argocd admin RBAC, teslamate probe)
- argocd: grant local break-glass admin the admin role (g, admin, role:admin);
  previously only the Authentik admins group had access, locking out admin
  once its token expired (policy.default is unset).
- alloy-k8s: repoint the teslamate blackbox probe from the deleted minikube
  service to https://tesla.ops.eblu.me/ (Caddy over Tailscale), like immich.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 13:02:05 -07:00
46f0002178 Decommission wave-1 minikube services (paperless, teslamate, mealie) (#365)
Final step of the wave-1 indri-k8s migration. paperless, teslamate, mealie run on ringtail with data migrated, verified, and backed up (local + BorgBase offsite via PR #364).

- Remove minikube paperless/teslamate/mealie manifest dirs + ArgoCD app defs (prunes the parked Deployments/Services + redundant minikube mealie/paperless PVCs)
- Drop paperless/teslamate roles + ExternalSecrets from the minikube blumeops-pg cluster
- miniflux + authentik stay on minikube (later waves)

Finalization after merge: sync apps + databases to prune, then DROP DATABASE paperless/teslamate on indri's blumeops-pg (fresh safety dump taken first).

Reviewed-on: #365
2026-06-03 12:36:06 -07:00
44798a6429 C0: mealie-ringtail image rebuilt from main (e0057b4-nix)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 12:26:55 -07:00
e0057b46e4 Wire ringtail blumeops-pg into backups + Grafana (#364)
Prereq for the wave-1 decommission. The cutover moved paperless+teslamate (postgres) and mealie (SQLite) to ringtail, but borgmatic and the Grafana TeslaMate datasource still pointed at the minikube copies — the migrated live data was unbacked since cutover, and dropping the minikube DBs would break the TeslaMate dashboards.

- Tailscale Service `blumeops-pg-ringtail` + Caddy L4 route `pg.ops.eblu.me:5434`
- borgmatic: teslamate + paperless postgres → :5434; mealie SQLite → ssh:eblume@ringtail
- Grafana TeslaMate datasource → pg.ops.eblu.me:5434

Deploy: sync databases-ringtail (tailscale svc) + grafana from branch; provision-indri --tags caddy,borgmatic; verify a backup run + dashboards. Unblocks the decommission PR.
Reviewed-on: #364
2026-06-03 12:25:30 -07:00
92b54e7ba9 C0: ringtail wave-1 images rebuilt from main (fcac8e5-nix tags)
Post-merge rebuild of paperless/mealie/teslamate Nix images at the main
merge commit, replacing the feature-branch -nix tags. Image content is
identical; only the commit-sha suffix changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-03 10:36:15 -07:00
fcac8e5a72 Wave 1 indri→ringtail migration: paperless, teslamate, mealie (#363)
Migrate paperless, teslamate, and mealie off the OOM-saturated minikube-indri node onto ringtail k3s, shedding ~1.1 GiB of resident load. Second chain in the indri-k8s decommission after immich.

**Containers ported to Nix (default.nix), build-verified on ringtail:**
- paperless → wraps nixpkgs paperless-ngx 2.20.15 (pinned unstable); runs as web/worker/beat/consumer
- mealie → wraps nixpkgs mealie 3.16.0 (forward 4-minor bump, breaking-change reviewed); single gunicorn, SQLite
- teslamate → from-scratch beamPackages mixRelease (not in nixpkgs); erlang_27+elixir_1_18, npm assets, ex_cldr locales pre-fetched

**Data:** cold downtime-tolerant cutover. paperless+teslamate postgres dump/restore from quiesced source into a new ringtail blumeops-pg CNPG cluster; mealie SQLite PVC copied. Source DBs untouched until verified (rollback = repoint).

**Also:** ringtail blumeops-pg cluster + ExternalSecrets scaffold; fixes pre-existing shower version-check drift.

Runbook: docs/how-to/ringtail/migrate-wave1-ringtail.md. Deploy-from-branch + cutover happens before merge; container images rebuilt from main after merge.
Reviewed-on: #363
2026-06-03 10:34:00 -07:00
40bd929820 C0: remove visible GNU Terry Pratchett from naughty.html body
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 37s
GNU lives in the overhead — the X-Clacks-Overhead header — never on the
visible page. Keep the header, drop the footer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:55:05 -07:00
a36a18aaa6 C0: black-hole /mirrors/* at Fly edge + name-and-shame scrapers
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 35s
A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of
all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI
GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite
git-history URL space and timing out Forgejo. robots.txt already disallowed
/mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~
to beat the regex asset locations), served as a roll-of-dishonour page with an
X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via
forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection
are documented in docs/explanation/ai-scraper-mitigation.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 20:52:20 -07:00
e0064de83d C0: update ringtail flake inputs (nixpkgs, disko)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-01 15:52:09 -07:00
f588638331 C0: rebuild valkey from squashed main commit
Image tags from PR #362 (v8.1.7-02859c5{,-nix}) referenced a branch
SHA that no longer exists on main after squash-merge. Rebuilt both
the dagger arm64 and nix amd64 variants from the squashed commit
(ecded30) and updated paperless + immich-ringtail to the new tags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:53:21 -07:00
ecded30073 Make valkey local on ringtail (nix amd64) + bump to 8.1.7 (#362)
## Summary

Weekly "make one non-local container local" pickup: immich-ringtail still pulled `docker.io/valkey/valkey:8.1.6` because the existing `containers/valkey/container.py` build was arm64-only.

- Adds `containers/valkey/default.nix` — nix-built amd64 valkey image, packaged by the ringtail nix-container-builder runner using `pkgs.dockerTools.buildLayeredImage`. Mirrors the existing `containers/authentik-redis/default.nix` pattern.
- `containers/valkey/container.py` keeps building the Alpine arm64 image for paperless on indri. Bumped both builds to upstream valkey 8.1.7 (Alpine 3.22 now ships `8.1.7-r0`; nixpkgs has 8.1.7).
- Splits `VERSION` (upstream app) from `ALPINE_PIN` (apk pin) in `container.py` so both build files can declare the same upstream version and pass `container-version-check`.
- Updates `service-versions.yaml`: current-version 8.1.7, refreshed last-reviewed, upstream-source now points at the canonical valkey-io releases page.
- Switches kustomizations:
  - `immich-ringtail/kustomization.yaml`: `docker.io/valkey/valkey:8.1.6` → `registry.ops.eblu.me/blumeops/valkey:v8.1.7-02859c5-nix`, comment updated.
  - `paperless/kustomization.yaml`: `v8.1.6-r0-fabca04` → `v8.1.7-02859c5`.

## Build

build-container run #563 — both jobs succeeded after a transient runner crash on the first dispatch (#562 build-nix), which surfaced two separate bugs that landed in a separate C0 on main:

- `runner-logs` silently returned 0 with no output when the log file didn't exist on indri
- `ssh indri` swallowing remote exit codes (fish login shell), which the wrapper now works around via a stdout marker

## Test plan

- [ ] `argocd app set immich-ringtail --revision valkey-nix && argocd app sync immich-ringtail`
- [ ] `argocd app set paperless --revision valkey-nix && argocd app sync paperless`
- [ ] Both valkey pods come Ready and start serving on :6379
- [ ] Immich app + paperless can read/write their respective cache
- [ ] After merge: rebuild from squashed main commit + update kustomization tags (squash-tag follow-up)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #362
2026-05-28 14:51:09 -07:00
1ce381cb6e C0: surface missing-log failures in runner-logs
`mise run runner-logs <run> -j <n>` previously silently succeeded with
no output when forgejo had no log for the task. Two layered causes:

1. zstdcat exits 0 even when the file is missing (writes "can't stat
   … -- ignored" to stderr).
2. ssh to indri runs fish, which silently drops the remote exit code so
   the subprocess returncode is always 0.

Probe `test -f` over SSH and parse a stdout marker (EXISTS / MISSING) to
detect the missing-log case, then report it explicitly with the indri
path and a hint about action_task.log_in_storage = 0 so the operator
knows where to look next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 14:36:33 -07:00
e703d25efe C0: rebuild unpoller container from squashed main commit
Image was previously tagged with the unpoller-v3 branch SHA (1b27242),
which doesn't exist in main's history after squash-merge. Rebuilt from
the squashed commit so the tag references a reachable commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-28 10:10:21 -07:00
4d1f4af25b Upgrade unpoller v2.34.0 → v3.2.0, migrate to container.py (#361)
## Summary

- Service Review pickup: unpoller (last reviewed 73 days ago).
- Upgrades unpoller from v2.34.0 to v3.2.0 (major version bump).
- Migrates the container build from a Dockerfile to a native Dagger pipeline (`containers/unpoller/container.py`) following the navidrome / miniflux pattern.
- Refreshes `service-versions.yaml` (last-reviewed, current-version).

## Breaking changes (upstream)

- **v3.0.0** — UniFi network API shifts (later 10.x). Some metric / event / log names and labels may have changed. Worth a follow-up sweep of the unpoller Grafana dashboard for missing series.
- **v3.2.0** — defaults to a 60s background poll feeding cached Prometheus scrapes (was on-demand poll per scrape). To restore previous behavior, set `interval = 0` in `up.conf`. Leaving the new default in this PR — every-15s scrapes will simply serve from cache, which is fine for our use.

## Build

- Image: `registry.ops.eblu.me/blumeops/unpoller:v3.2.0-1b27242`
- Built by build-container workflow run #559 from this branch.

## Test plan

- [ ] `argocd app set unpoller --revision unpoller-v3 && argocd app sync unpoller`
- [ ] Pod comes Ready
- [ ] Verify metrics exported (`Site/Client/UAP/USG/USW` counts in logs, `unpoller_*` series in Prometheus)
- [ ] Spot-check unpoller Grafana dashboard for missing series after the v3 API shift
- [ ] After merge: `argocd app set unpoller --revision main && argocd app sync unpoller`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #361
2026-05-28 09:59:46 -07:00
f6febb1f77 C0: switch fly proxy deploy strategy to immediate
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 34s
Bluegreen kept timing out — the new green machine couldn't reach
"started" within Fly's 5-minute deploy budget. The cold-start sequence
(tailscaled → tailscale up → wait-for-MagicDNS → nginx startup) eats
most of that, leaving no headroom for healthcheck propagation.

For a single-machine proxy, bluegreen offers little benefit anyway:
no warm second instance, so trading 5-10s of downtime for predictable
completion is the right call.
2026-05-28 07:59:22 -07:00
4e25180b0a C0: clone blumeops via tailnet on ringtail provision
Switch ringtail.yml from forge.eblu.me (Fly proxy, WAN) to
forge.ops.eblu.me (Caddy on indri, tailnet). Ringtail is always
on the tailnet — the WAN round-trip was overhead and made
provision-ringtail fail any time Fly was slow or down.
2026-05-28 07:13:40 -07:00
c00d7db507 Recurring maintenance batch (2026-05-27) (#360)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 14m10s
Bundle of recurring overdue tasks:

- Ringtail flake update
- Security & compliance report review
- Tooling deps bump (prek, fly, mise, forgejo workflows)
- Top stale doc review
- Top stale service review (if trivial)

Larger items (service version bumps requiring upgrades, non-local container migration) split out as separate PRs.

Reviewed-on: #360
2026-05-28 06:01:57 -07:00
Erich Blume
753fa9cb63 C0: disable VRR on ringtail DP-1 to stop OMEN panel flicker
The OMEN 27i IPS pumps brightness when its refresh swings into the low
VRR range during low-framerate content (game cutscenes), producing a
~20Hz flicker that compounds over a session until a reboot. GPU health
is clean (no Xid/ECC/thermal); pinning fixed 165Hz eliminates it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 12:59:29 -07:00
Erich Blume
c09bd5b612 C0: cap systemd-coredump on ringtail to stop game-crash lockups
Wine/Proton game segfaults (e.g. Diablo IV) produced multi-GB cores that
systemd-coredump spent minutes compressing to disk, pinning the CPU and
freezing the desktop. Cap ProcessSizeMax/ExternalSizeMax at 1G (oversized
cores logged but skipped) and MaxUse at 2G to bound the store.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 11:54:32 -07:00
35ae171783 C0: fix sync button location in manage-forgejo-mirrors
The verify step pointed to the main repo page, but the "Synchronize now"
button is in the Mirror settings section of the settings page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-27 07:15:07 -07:00
57fd88b269 C0: fix op item edit syntax in zot key rotation
The pbpaste | op item edit ... "field[password]=-" stdin syntax is
rejected by op 2.34 as "invalid JSON" — recent op versions treat
piped input as a full JSON template, not a single field value.
Procedure now uses an inline assignment via a local fish variable.
2026-05-22 21:50:43 -07:00
08a1cb164a C0: fix 1password export filename in backup how-to
1Password's desktop app names exports as
1PasswordExport-<uuid>-<timestamp>.1pux automatically — you can't
choose the name. Procedure now points the task at that glob.
2026-05-22 21:36:13 -07:00
d02bf062af C0: review 1password reference card
Added vault split (blumeops vs Personal), noted onepassword-connect
runs on both indri and ringtail, and lifted op CLI guidance from
agent memory into the card. Bumped last-reviewed.
2026-05-22 21:29:11 -07:00
ee51bcafb4 Rip out compensating-controls framework (#359)
## Summary

Removes the compensating-controls (CC) framework. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files stay in place but no longer carry \`CC: <id>\` prefixes — each entry now just keeps a free-form \`Description\` of why it's muted.

The CC review cadence proved to be more process overhead than this single-operator homelab needed.

## What changed

**Deleted**
- \`compensating-controls.yaml\` — the CC registry
- \`mise-tasks/review-compensating-controls\` — the staleness-review task
- \`docs/how-to/operations/review-compensating-controls.md\`
- \`docs/how-to/operations/record-review-evidence.md\` (was aspirational)
- \`docs/explanation/compliance-mute-categories.md\` (proposed-future CC/NA/RA work)
- 5 orphan \`+review-cc-*\` / \`+compliance-mute-categories\` changelog fragments

**Modified**
- 6 mutelist YAML files: stripped \`CC: <id>.\` prefix from every \`Description\` / \`statement\` field, kept the free-form text
- \`mise-tasks/review-compliance-reports\`: removed CC mentions from docstrings, panel text, and the node-verification table title. Node-verification logic itself is unchanged.
- \`docs/reference/operations/security.md\`: removed the "Compensating controls" section
- \`docs/how-to/operations/read-compliance-reports.md\`: rewrote step 3 of "Acting on findings" to point at the mutelist YAML directly
- \`docs/changelog.d/prowler-iac-mutelist.infra.md\`: rewrote to drop the "two new compensating controls" framing

## What did not change

- All Prowler manifests (cronjobs, RBAC, PVs, kustomization) — scans still run on the same schedule
- The Kingfisher deployment
- The trivy-shim in the Prowler container — that's about Trivy ignorefile plumbing, independent of the CC concept
- The mutelist entries themselves — each \`Resources\` list is unchanged; only the prose of \`Description\` was edited
- \`CHANGELOG.md\` — historical releases are left as-is

## Test plan

- [ ] Wait for human review before deploying — once merged, re-point ArgoCD: \`argocd app set prowler --revision main && argocd app sync prowler\` (no manifest changes besides the ConfigMap, so impact is limited to muted-finding descriptions in next week's report)
- [ ] Confirm next weekly Prowler K8s CIS run (Sunday 3am) still completes and produces a report on sifaka
- [ ] Confirm next weekly Prowler IaC run still honors \`trivyignore.yaml\` (the trivy shim is untouched but the ignorefile content was rewritten)
- [ ] \`mise run review-compliance-reports\` — verify node-verification block still runs and prints the renamed table title

Reviewed-on: #359
2026-05-22 21:08:53 -07:00
2fae0f7161 C0: switch grafana deployment to Recreate strategy
Grafana uses an RWO PVC for SQLite + Bleve search index. RollingUpdate
spawns the new pod before terminating the old one, so the new pod
crashloops on the index lock until rollout timeout. Recreate terminates
the old pod first, letting the new pod acquire the lock cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-19 06:33:26 -07:00
1897eb1c5b C0: move immich blackbox probe to ringtail alloy
Immich migrated to ringtail's k3s cluster but the probe still targeted
the in-cluster service DNS on indri's minikube, firing ServiceProbeFailure
indefinitely. Moved the target into alloy-ringtail's config so the probe
runs in the cluster where immich actually lives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 08:46:22 -07:00
e222d47d45 C0: deploy shower v1.1.3 (kustomize newTag bump)
Image v1.1.3-3645098-nix was built directly on ringtail and pushed via
skopeo, bypassing the Forgejo runner: indri was severely overloaded
(load avg 24.92, minikube VM at 344% CPU) and the workflow-dispatch
endpoint timed out. The image content is identical to what the runner
would have produced — same default.nix at commit 3645098 (on main),
same NIX_PATH (current nixpkgs flake), same skopeo invocation. Tag
short-sha matches the commit that defines the recipe so we aren't
pinning to a ghost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 20:09:54 -07:00
3645098bf1 C0: bump shower to v1.1.3
Wheel/sdist + FOD hashes probed on ringtail. Full nix-build verified
end-to-end before commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 19:57:37 -07:00
Erich Blume
96dbbb3cbe C0: add sn2-prelaunch wrapper to clear SN2 stale lockfiles
UE5 writes Saved/running.dat as a "session in progress" marker. If
the previous session exited uncleanly (SIGKILL, crash), it lingers,
and SN2 pops up an invisible 0×0 Error dialog at next launch that
the GameThread blocks on forever — visible only as a black screen
with a spinning loader. Wrap the Steam command to clear the marker
files before each launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 12:26:10 -07:00
815a0cc6e6 C0: shower — rebuild from main SHA (post-merge retag)
PR #358 was squash-merged so the branch commit b8c7783 baked into the
prior image tag isn't reachable from main's history. Rebuild from main
HEAD (a33fa47) and retag. Image content is byte-identical (FOD is
content-addressed, inputs unchanged); only the SHA in the tag changes
so future provenance tracing stays on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 06:57:24 -07:00
a33fa47b80 C1: deploy shower v1.1.2 (#358)
## Summary

Deploys `adelaide-baby-shower-app` **v1.1.2** to ringtail k3s.

- Bumps `containers/shower/default.nix` `version` to 1.1.2.
- Refreshes sdist + wheel `fetchurl` hashes against the forge PyPI artifacts.
- Re-probed FOD `outputHash` on the nix-container-builder runner (ringtail) and pinned the new closure hash.
- Bumps kustomize `newTag` to `v1.1.2-b8c7783-nix` (built from this branch's tip).
- Bumps `service-versions.yaml` entry for shower to `1.1.2` / `last-reviewed: 2026-05-15`.

## Build provenance

Built by Forgejo Actions run #553 on `nix-container-builder` (ringtail) at commit `b8c7783`. After merge a C0 follow-on will rebuild from main and retag so future provenance points at main history.

## Test plan

- [ ] `argocd app set shower --revision shower-v1.1.2 && argocd app sync shower` deploys cleanly
- [ ] Pod migrates the SQLite PV and serves at `shower.ops.eblu.me` / `shower.eblu.me`
- [ ] No new errors in pod logs after `collectstatic` + gunicorn boot

Reviewed-on: #358
2026-05-15 06:50:46 -07:00
Erich Blume
12314857d8 C0: add GE-Proton to ringtail Steam extraCompatPackages
Lets Subnautica 2 (and any other game) opt into the GE-Proton
build via Steam's per-game compatibility tool override, as a
workaround for the Proton Experimental + DXVK D3D12 Mercuna hang.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 06:27:43 -07:00
4d2bc9975f C0: deploy shower v1.1.1 (kustomize newTag bump)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:51:10 -07:00
4e117dc921 C0: pin shower v1.1.1 FOD outputHash (probed on ringtail)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:40:22 -07:00
6e90c4c363 C0: bump shower to v1.1.1 (probe FOD hash)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 20:12:00 -07:00
dc69b8c68b C1: fix borgmatic shower SQLite dump (ssh to ringtail) (#357)
## Summary

Nightly borgmatic backups have been failing for 2 days. Root cause: the
shower SQLite dump `before_backup` hook (added in PR #349) referenced
`kubectl --context=k3s-ringtail`, but indri's kubeconfig deliberately
doesn't carry the ringtail credentials. The hook's failure aborted the
entire run, taking out *both* the local sifaka repo and the BorgBase
offsite. Verified the last good archive was `indri-2026-05-11T02:00`.

## Approach

ssh into ringtail and run `k3s kubectl` there — no indri-side
kubeconfig needed. `/etc/rancher/k3s/k3s.yaml` is mode 644 so no sudo
required, and the existing ssh access from indri to ringtail works.

Inline-shell quoting got hairy fast (fish on ringtail rejected `POD=...`
bash syntax; the nix shower image lacks `tar` so `kubectl cp` fails).
Pulled the dump logic into `~/bin/borgmatic-k8s-sqlite-dump`, deployed
by the ansible role. Each dump entry now declares a `target`:

- `local:<context>` — local kubectl with explicit context (mealie)
- `ssh:<user@host>` — ssh + `k3s kubectl` on the cluster host (shower)

Bytes come back via `kubectl exec ... -- cat` instead of `kubectl cp`
since `cp` needs `tar` in the pod (nix-built containers don't bundle it).

## Test plan

- [x] `mise run provision-indri -- --tags borgmatic --check --diff` shows expected diff
- [x] Apply, helper script deployed at `~/bin/borgmatic-k8s-sqlite-dump`
- [x] Helper invoked directly with `ssh:eblume@ringtail` produces a valid 288 KB SQLite file
- [x] Full `borgmatic create` completes without errors — both mealie.db (1.7 MB) and shower.db (288 KB) appear in `~/.local/share/borgmatic/k8s-dumps/`, archive `indri-2026-05-13T17:31:02` written to sifaka borg repo

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #357
2026-05-13 18:55:50 -07:00
947e4310c3 C2: migrate immich from minikube to ringtail (mikado chain) (#356)
## Summary

C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the
largest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.

This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.

## Goal end-state

- Immich `server`, `machine-learning`, `valkey` on ringtail.
- ML pod uses ringtail's RTX 4080 (performance win — currently
  CPU-only).
- CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail.
- Library still on sifaka NFS — ringtail mounts the same path.
- `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress.
- Minikube `immich` and `immich-pg` are removed.

## Cards

| Card | Depends on |
|---|---|
| `migrate-immich-to-ringtail` (goal) | all six below |
| `cnpg-on-ringtail` | — |
| `immich-pg-on-ringtail` | cnpg-on-ringtail |
| `immich-pg-data-migration` | immich-pg-on-ringtail |
| `sifaka-nfs-from-ringtail` | — |
| `immich-app-on-ringtail` | immich-pg-on-ringtail, sifaka-nfs-from-ringtail |
| `immich-cutover-and-decommission` | immich-pg-data-migration, immich-app-on-ringtail |

## Key constraints

- **No data loss.** Downtime is acceptable; data loss is not. Two
  surfaces matter: postgres (ML embeddings, face data — slow to
  re-derive) and the library files (don't move, but NFS access from
  ringtail must be verified).
- **Migration method:** Option A is a CNPG `externalCluster`
  basebackup → promote. Option B is `pg_dump`/`pg_restore` as a
  documented fallback. Either way, dry-run against a scratch
  cluster first.
- **Why pg moves too** (not cross-cluster): keeping pg on minikube
  would block the whole decommission, and Immich is chatty with pg
  so tailnet round-trips would hurt.

## Test plan

- [ ] Plan review — does the dependency graph make sense?
- [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the
      chain correctly.
- [ ] Per-card implementation cycles land separately (commit
      convention enforced by hook).

Reviewed-on: #356
2026-05-13 16:46:17 -07:00
bc8ceb502b Merge pull request 'C1: pin ringtail wired IP to 192.168.1.21 (static)' (#355) from ringtail-static-ip into main 2026-05-12 09:59:59 -07:00
a4a30aad44 fix(ringtail): explicitly enable net.ipv4.ip_forward
After the static IP change, k3s/flannel pod networking broke because
ip_forward was 0. NixOS doesn't enable IP forwarding by default — it
was previously being set implicitly somewhere in the NM-managed /
scripted-DHCP path. With static networking we have to set it ourselves.

Verified at runtime via sysctl -w before adding here; pod outbound
came back immediately and Tailscale VIP services recovered without
any pod restarts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:51:16 -07:00
d0b5423135 C1: pin ringtail wired IP to 192.168.1.21 (static)
Removes DHCP lease renewal as a failure mode on ringtail after an outage
on 2026-05-12 where the IP and routes silently disappeared from enp5s0
without any kernel link event. NetworkManager stays enabled for wireless
fallback but no longer manages the wired interface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 09:33:57 -07:00
dc0916a548 C0: shower — rebuild from main SHA (post-merge retag)
PR #354 was squash-merged so the branch commit 444ff91 baked into the
prior image tag isn't reachable from main's history. Rebuild from main
HEAD (3c7967e) and retag. Image content is byte-identical (FOD is
content-addressed, inputs unchanged); only the SHA in the tag changes
so future provenance tracing stays on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 20:20:39 -07:00
3c7967e445 C1: deploy shower v1.1.0 (phases + guest memories) (#354)
## Summary

Deploys `adelaide-baby-shower-app` **v1.1.0** to ringtail k3s.

### App changes (since v1.0.2)

- **Four-phase `ShowerState`** replaces the boolean `locked` flag — `pre_event` → `party` → `prizes_locked` → `event_locked` — with a backfill migration that maps `locked=True → pre_event`, `locked=False → party`.
- **Guest memories**: append-only photos + comments panel where guests can leave notes for the baby. Adds `GuestPhoto` + `GuestComment` models with file-extension validators and a max-size validator; new `shower.imaging` module for thumbnail generation.
- **Admin + QR polish**: configurable host link, fixed "View Site" URL, guest-facing QR copy improvements, contest tweaks.

Three Django migrations run automatically in the entrypoint against the SQLite PV:
- `0009_shower_phase`
- `0010_guest_memories`
- `0011_book_description`

No ConfigMap / env-var changes. The deploy uses `strategy: Recreate` with a single replica, so the old pod releases the data PVC before the new one mounts it and runs migrations.

### Container build changes

The v1.1.0 tag exposed a latent issue with the Forgejo PyPI install path:

- The recent commit [2d38418e](2d38418e) closed the forge package leak at the Fly edge by blocking `/api/packages/*` publicly.
- Forgejo's PyPI simple index returns absolute file URLs hardcoded to its public `ROOT_URL` (`forge.eblu.me`), so pip-installing from the tailnet index URL still tries to download from `forge.eblu.me` → 403.
- Previous shower builds escaped this because their FOD outputs were already in the nix store; bumping to a new version forced a fresh pip run that hit the block.

Fix mirrors what we already do for the sdist: both wheel and sdist are pulled via direct `fetchurl` against `forge.ops.eblu.me`, then the wheel is copied to TMPDIR under its clean filename (nix store path's hash prefix breaks pip's wheel-filename parser) and handed to pip as a local path. The forge `--extra-index-url` is no longer needed.

FOD outputHash pinned to `sha256-kTNOswobtkgyQmmqbQM8XO4vvaGg57nCuuZGbNXb0NM=` from run 547. Image: `registry.ops.eblu.me/blumeops/shower:v1.1.0-444ff91-nix`.

### Adjacent finding (already handled)

The ringtail `gitea-runner-nix_container_builder` systemd unit was left `inactive` after the recent `provision-ringtail` (matches the known `sshd-restart-hangs-mux` lesson — the rebuild changed the unit's PATH closure + config.yaml, systemd stopped it, then the playbook hung before the activation could restart it). Manually started; the existing memory `lesson_provision_ringtail_ssh_hang.md` was extended to mention the runner as the canary service to check after provisions.

## Test plan

- [ ] `argocd app diff shower --revision shower-v1.1.0` — review the manifest change
- [ ] `argocd app set shower --revision shower-v1.1.0 && argocd app sync shower`
- [ ] `kubectl --context=k3s-ringtail logs -n shower deploy/shower` — confirm migrations 0009/0010/0011 applied, no errors
- [ ] Hit `https://shower.ops.eblu.me/` (tailnet) — splash page renders, phase indicator visible
- [ ] Hit `https://shower.ops.eblu.me/host/` — host console loads, phase dropdown shows the four states
- [ ] Hit `https://shower.eblu.me/` (public via Fly) — splash page still served
- [ ] After merge: `argocd app set shower --revision main && argocd app sync shower`

Reviewed-on: #354
2026-05-11 20:08:03 -07:00
fbc1f7720e C0: gitignore .claude/scheduled_tasks.lock
Transient lock file written by the ScheduleWakeup harness tool when
Claude paces its own work between long-running operations. Not config,
not state worth checking in.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 18:37:29 -07:00
4133785119 C1: ringtail — weekly flake.lock update (#352)
## Summary
- Recurring weekly lockfile refresh for `nixos/ringtail/flake.lock`.
- Inputs updated: `disko`, `home-manager`, `nixpkgs`.
- `nixpkgs-services` was deliberately skipped (per overlay convention — pinned services bump only on intentional update).
- Generated via `dagger call flake-update --src=. --flake-path=nixos/ringtail`.

## Test plan
- [x] `prek` hooks pass
- [ ] After merge: `mise run provision-ringtail` to deploy
- [ ] Then check for kernel update per [[manage-lockfile]]

## Notes
- Not deployed from this PR — provisioning is a follow-up.

Reviewed-on: #352
2026-05-11 16:13:07 -07:00
145df76d06 C1: service review — mealie (v3.12.0 deployed; upstream v3.17.0) (#351)
## Summary
- Recurring service review for `mealie`.
- Upstream is at **v3.17.0** (released 2026-05-06); deployed image is **v3.12.0** — 5 minor versions behind.
- Container is built locally from the forge mirror (`containers/mealie/Dockerfile`), so upgrade requires a fresh build + changelog review for breaking changes between v3.12 and v3.17.
- Deferring the actual upgrade to a separate task; this PR just refreshes `last-reviewed` and captures the gap in `notes`.

## Test plan
- [x] `prek` hooks pass
- [ ] Follow-up: open task to bump `containers/mealie/Dockerfile` `CONTAINER_APP_VERSION`, build, and update kustomization tag

## Notes
- No deployment changes in this PR.

Reviewed-on: #351
2026-05-11 16:12:36 -07:00
bb7efa850a C1: doc review — replicating-blumeops tutorial (#350)
## Summary
- Periodic doc review of `tutorials/replicating-blumeops.md` (was never reviewed).
- Fixed 4 instances of "BluemeOps" → "BlumeOps" (also caught 1 in `contributing.md`).
- Added `last-reviewed: 2026-05-11` and bumped `modified`.
- Verified all wiki-link targets resolve.

## Test plan
- [x] `prek` hooks pass (link checker, frontmatter checker)
- [ ] Optional: `mise run docs-preview docs/tutorials/replicating-blumeops.md`

Reviewed-on: #350
2026-05-11 16:11:35 -07:00
f83be3bf37 C1: review CC observability-stack-audit (extend to k3s) (#353)
## Summary
- Recurring compensating-control review (oldest stale control: 42 days).
- Verified the control is in effect on both clusters:
  - `alloy-k8s` on minikube-indri — Synced/Healthy, DaemonSet 1/1 ready
  - `alloy-ringtail` on k3s-ringtail — Synced/Healthy
  - `loki` (`monitoring/loki-0`) — Running, receiving logs (52 restarts in 18h is worth watching but not blocking review)
- Generalized the description: previously named only minikube, but the indri→ringtail migration means we now operate two clusters and both rely on this control.
- Added a follow-up note: enabling native apiserver audit logging is far more tractable on k3s (`--audit-log-path` / `--audit-policy-file`) than it was on minikube — worth revisiting once the migration concludes.

## Test plan
- [x] `prek` hooks pass
- [x] Verified alloy + loki status via `kubectl --context=minikube-indri` and `argocd app get`

## Notes
- No deployment changes.

Reviewed-on: #353
2026-05-11 16:10:39 -07:00
40d9a1ef9e C0: shower — rebuild from main SHA (post-PR-349 retag)
Standard squash-merge dance per
docs/how-to/deployment/build-container-image.md#Squash-merge-and-container-tags
— retags from v1.0.2-039d9b9-nix (branch SHA) to v1.0.2-292d354-nix
([main] tag from run 544 built off the merge commit). Functionally
identical; preserves source traceability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-11 13:55:25 -07:00
292d354902 C1: deploy adelaide-baby-shower-app to ringtail k3s (#349)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m12s
## Summary

Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app).

### What's included

- **ArgoCD app + manifests** under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer.
- **Public surface** (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett.
- **fail2ban** filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default).
- **Caddy** route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`).
- **Pulumi** Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`.
- **Grafana** APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter.
- **Container** `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages.
- **Docs** runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments.

### Defense layers on the public surface

1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list)
2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP)
3. django-axes (5 fails / 1h, keyed on username+ip_address)
4. edge `/admin/` block (returns 403 for anything that isn't login/logout)

## Prerequisites for the user to do (NOT in this PR)

Halted on these per request — they touch shared/manual systems:

- [x] **NFS share** on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000`
- [ ] **1Password item** `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git
- [ ] **Container build**: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix`
- [x] **DNS**: `mise run dns-up` after merge
- [x] **Fly cert**: `fly certs add shower.eblu.me -a blumeops-proxy`
- [ ] **Caddy push**: `mise run provision-indri -- --tags caddy`
- [ ] **Fly redeploy** to pick up the new nginx block + fail2ban jail: `mise run fly-deploy`
- [ ] **ArgoCD sync**: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging

## Test plan

- [ ] Container builds successfully on nix-container-builder runner
- [ ] Pod starts, migrations run, gunicorn answers on :8000
- [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean
- [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response
- [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead`
- [ ] Grafana dashboard "Shower APM" appears and starts showing traffic
- [ ] `mise run services-check` passes

Reviewed-on: #349
2026-05-11 13:47:18 -07:00
eceb2b99ce C0: bump homepage image to fixed-perms build (v1.11.0-678f26b-nix)
Pulls in 678f26b0 (chowned /app/config). Resolves the EACCES crash loop
on ringtail.
2026-05-10 21:16:34 -07:00
678f26b0e7 C0: fix homepage container /app/config write permissions
The previous Dockerfile chowned /app/config to 1000:1000 so the runtime
user could seed missing skeleton configs (e.g. proxmox.yaml) and write
/app/config/logs. The nix derivation didn't replicate that, so the new
amd64 image crashed with EACCES on cold start (fixed-forward — caught
during ringtail cutover, ArgoCD #348).

Add fakeRootCommands to dockerTools to create /app and /app/config and
chown them at build time. The deployment's ConfigMap subPath mounts
leave the parent directory as image filesystem, so its ownership has to
be set at build time, not at runtime.
2026-05-10 20:49:22 -07:00
ad7a0ed105 Merge pull request 'C1: migrate homepage dashboard from minikube to ringtail (nix-built amd64)' (#348) from homepage-to-ringtail into main 2026-05-10 20:40:33 -07:00
be54cc3411 C1: migrate homepage dashboard to ringtail k3s
Repoint the ArgoCD Application destination from minikube to ringtail and
bump the image tag to the new amd64 nix-built v1.11.0-b87f62e-nix.

Rework services.yaml for the autodiscovery shift: 11 services that
previously auto-populated via minikube Ingress annotations (ArgoCD,
Immich, Kiwix, Mealie, Miniflux, Grafana, Prometheus, Navidrome,
Paperless, TeslaMate, Transmission) become explicit static entries with
their widget configs preserved. Conversely, the ringtail services that
will now auto-populate (Frigate/NVR, Authentik, Ntfy) are removed from
the static list to avoid duplicates; Ollama becomes newly visible.

Add a Content group for Immich/Kiwix/Miniflux which previously lived
under the autodiscovered "Content" group from annotations.
2026-05-10 20:37:03 -07:00
b87f62e0f5 C1: nix-build homepage container for amd64 ringtail migration
Replace Dockerfile (arm64-only, indri-built) with a nix derivation
adapted from nixpkgs pkgs/by-name/ho/homepage-dashboard. Built via the
nix-container-builder runner on ringtail, producing an amd64 image
suitable for k3s.

Includes the upstream Next.js file-system-cache patch to avoid
prerender cache write failures on a read-only nix store path
(nixpkgs issues #328621 and #458494).

Pinned to v1.11.0 (current production version).
2026-05-10 20:32:38 -07:00
8bc19fa460 C0: tailscale main-SHA rebuild for ringtail proxyclass
Routine post-squash-merge cleanup. Bumps the ProxyClass image tag from
the now-orphaned PR branch SHA (67af7a8) to the merge commit SHA
(0108b68) so the deployed image stays traceable after branch cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:52:39 -07:00
0108b68769 C1: mirror tailscale container locally for ringtail proxyclass (#347)
## Summary

Adds the first cut of a local nix build for `docker.io/tailscale/tailscale` and rewires only the ringtail tailscale-operator overlay to use it. Indri's overlay continues pulling upstream — minikube on indri is being decommissioned in favor of ringtail's k3s, so investing in dual-cluster routing here would be wasted churn.

## Changes

- `containers/tailscale/default.nix` — `buildGoModule` over `cmd/tailscale`, `cmd/tailscaled`, `cmd/containerboot`; packaged via `dockerTools.buildLayeredImage` with `cacert`, `iptables` (legacy symlink to match upstream Synology compat), `iproute2`, `tzdata`, `busybox`.
- `argocd/manifests/tailscale-operator-ringtail/kustomization.yaml` — kustomize `images:` rewrite swapping `docker.io/tailscale/tailscale` → `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`.
- `docs/changelog.d/mirror-tailscale-container.infra.md` — fragment.

## Pin rationale

v1.94.2 matches `service-versions.yaml:96` and the current ProxyClass exactly — this PR is "make it local," not "upgrade tailscale." Version bumps come as follow-up C0/C1 changes once we decide to test newer (v1.96.x had a Fly-side MagicDNS regression; v1.98.0 is current upstream stable).

## Test plan

- [x] Image built successfully on ringtail nix-container-builder (run #528).
- [x] Image visible in registry: `registry.ops.eblu.me/blumeops/tailscale:v1.94.2-67af7a8-nix`.
- [ ] Deploy from branch: `argocd app set tailscale-operator-ringtail --revision mirror-tailscale-container && argocd app sync tailscale-operator-ringtail`.
- [ ] Verify proxy pods restart with new image and existing tailnet ingresses (e.g., authentik, immich, tempo) keep resolving.
- [ ] After merge: rebuild on main SHA, update kustomization, run `services-check`.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #347
2026-05-06 06:50:31 -07:00
6f0d80ca1e C0: doc review — index.md, add ringtail to infra overview
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:14:40 -07:00
39b042e638 C0: service review — caddy v2.11.2 (current latest, healthy)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:11:15 -07:00
24e5490259 C0: review CC init-container-isolation — defer retirement to post-ringtail
Runtime grafana pod matches the manifest and the CC's claim; bumped
last-reviewed. Noted that retiring init-chown-data in favor of fsGroup
alone should wait until grafana migrates to ringtail's k3s, since the
storage backend will change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:31:13 -07:00
074887cd57 C0: docs — explanation article on compliance mute categories
Captures the CC vs NA vs RA distinction surfaced during the 2026-05-03
weekly compliance review (CVE-2026-31789), and the image-scan mutelist
gap that blocks acting on it. Links the new article from the
review-compensating-controls how-to so it isn't orphaned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 18:19:53 -07:00
9fb5442ccd C0: kiwix — doc review, fix Adding Archives source path
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:46:16 -07:00
f16e1c81f1 C0: zot — upgrade indri registry to v2.1.16
Security fixes only (TLS verification on metrics client, CORS
Allow-Credentials suppression on wildcard origin, manifest/API-key
body-size limits, dependabot bumps). No config changes required;
re-built from source on indri and bounced launchagent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 17:41:07 -07:00
a2c61b625d C0: rotate-fly-deploy-token — fish+bash one-shot, op validator gotcha
Combine mint+store into a single command with both fish and bash
forms (the doc previously only showed manual paste). Document the
1Password CLI "Password item requires ps value" validator error and
the placeholder-password workaround for Password-category items with
empty primary password fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-04 13:42:57 -07:00
2c0917b266 C0: valkey — bump kustomization tags to main-branch SHA
Routine post-merge follow-up after #346. Branch SHA tag (946fa75) replaced
with the main-SHA-built tag (fabca04) so paperless and immich reference an
image traceable to a commit on main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 17:47:16 -07:00
fabca04771 Mirror valkey 8.1 locally for paperless and immich (#346)
## Summary

- Add native Dagger build of valkey 8.1.6-r0 on Alpine 3.22 at `containers/valkey/`
- Swap paperless redis sidecar and immich-valkey from `docker.io/valkey/valkey:8.1-alpine` to `registry.ops.eblu.me/blumeops/valkey:v8.1.6-r0-946fa75`
- Resolves the DR-2026-04 TODO in paperless kustomization about multi-arch redis

## Why

Move toward fully locally-built containers for supply chain control. Paperless and immich both pulled the same upstream tag — one mirror serves both. Authentik's nix-built Redis stays separate (different image entirely).

## Risk

Low. Both sidecars are stateless caches:
- paperless redis: no volumeMount (in-pod localhost, pure memory)
- immich-valkey: `emptyDir` (cache only)

Pod restart rebuilds the cache. Smoke-tested locally (PING/SET/GET roundtrip on `valkey 8.1.6` with `--bind 0.0.0.0 --protected-mode no`).

## Test plan

- [ ] After merge: `mise run container-build-and-release valkey` to rebuild with main SHA
- [ ] Update kustomizations to the `[main]` SHA tag (C0 follow-up)
- [ ] `argocd app sync paperless` and `argocd app sync immich`
- [ ] Verify pods come up healthy (paperless OCR queue functional, immich job queue functional)
- [ ] `mise run services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #346
2026-05-01 17:40:03 -07:00
f84f5f02b3 C0: review compensating control trusted-ci-only
Verified Forgejo runner is registered only to forge.ops.eblu.me and the
forge has registration disabled, so no untrusted users can trigger
privileged CI. Tightened notes to reflect the closed-forge mechanism
(not a per-repo allow-list).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:49:22 -07:00
4aa0872949 C0: review ollama doc — refresh image, models, last-reviewed
Bumped documented image tag to 0.20.4 (matches kustomization newTag),
added the two qwen3.5 models from models.txt, and stamped the card.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:42:33 -07:00
2d55303213 C0: alloy native macOS on indri — upgrade to v1.16.0
Completes the v1.16.0 fleet upgrade for the fourth alloy service
(type: ansible, built from source on indri). Binary built on gilbert
with Go 1.26.2 + CGO, scp'd to indri, codesigned, LaunchAgent reloaded.
Service reports clean WAL replay and resumed metric/log shipping.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 10:36:38 -07:00
55563afc7e C0: alloy — bump kustomization tags to main-branch SHA
Per the build-container-image squash-merge convention, rebuild alloy v1.16.0
container images from the main SHA (9564435) and update the three alloy
kustomizations to reference :v1.16.0-9564435[-nix] instead of the branch
SHA :v1.16.0-26a3ab5[-nix] left over from #345.

Both images were rebuilt locally on gilbert (dagger) and ringtail (nix)
because indri is still under heavy macOS memory-compressor pressure (see
separate ticket); CI on indri can't reliably run the dagger publish step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:31:27 -07:00
9564435b11 Alloy V1.16.0 (#345)
Bump Grafana Alloy v1.14.0 → v1.16.0 across all four services (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail; alloy native ansible). Also migrate the indri build path from `Dockerfile` to a native Dagger `container.py` per the build-container-image migration playbook.

## Highlights from upstream
- v1.15: database observability promoted to stable, OTel Collector → v0.147.0
- v1.16: clustering for `loki.source.kubernetes_events`, MySQL exporter 0.19.0
- One pre-existing breaking change in v1.15 (`loki.source.awsfirehose` undocumented metric prefix rename) — not used here.

## Build infra
Alloy v1.16.0's go.mod requires Go 1.26.2. The nix derivation now uses `pkgs.go_1_26` with `GOTOOLCHAIN=local` to avoid auto-downloading a toolchain blob that violated the fixed-output rule.

## Test plan
- [ ] CI: `mise run container-build-and-release alloy --ref alloy-v1.16.0` (dispatched as run 522; nix job to be re-triggered with the v1.16.0 goModules outputHash once the local ringtail build surfaces it)
- [ ] After CI green, bump `images[].newTag` in three kustomizations to the new `-<sha>` and `-<sha>-nix` tags, deploy from this branch via `argocd app set <app> --revision alloy-v1.16.0 && argocd app sync <app>`
- [ ] Manual rebuild of macOS native binary on gilbert (per ansible/roles/alloy README) and `mise run provision-indri -- --tags alloy --check --diff`
- [ ] `mise run services-check` after merge & redeploy

Reviewed-on: #345
2026-05-01 08:05:37 -07:00
7fed166c18 Update ringtail flake inputs
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-30 16:55:08 -07:00
f6e392b80c C1: SHA-pin tooling dependencies (2026-04 cycle) (#344)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m45s
## Summary

Monthly tooling dependency refresh, with a one-time conversion from version-tag pins (`rev = "vX.Y.Z"`, `image:tag`, `>=`) to SHA / digest pins everywhere.

## Changes

- **prek hooks**: all `rev = "vX.Y.Z"` → commit SHA + `# vX.Y.Z` comment. Bumped trufflehog (3.94.0→3.95.2), kingfisher (1.91.0→1.97.0), ruff (0.15.7→0.15.12), shfmt (3.13.0→3.13.1), prettier (3.8.1→3.8.3), actionlint (1.7.11→1.7.12).
- **fly/Dockerfile**: tag pins → `image@sha256:...` digest pins. Bumped nginx (1.29.6→1.30.0-alpine), tailscale (v1.94.1→v1.94.2 — still inside the safe pre-1.96.5 range), alloy (v1.14.1→v1.16.0).
- **mise-tasks**: PEP 723 inline deps converted from `>=` to `==` (PEP 508 doesn't support hashes inline). All scripts pinned to current latest: rich 15.0.0, typer 0.25.0, pyyaml 6.0.3, httpx 0.28.1.
- **prek `additional_dependencies`**: ansible-lint==26.4.0, ansible-core==2.20.5.
- **taplo-lint**: pass `--no-schema`. Upstream's `--default-schema-catalogs` returns a format taplo v0.9.3 can't parse — we don't validate against TOML schemas anyway, so this turns off the broken catalog fetch.
- **docs/update-tooling-dependencies**: documents the SHA-pin convention, `docker buildx imagetools inspect` for digest lookup, and `prek clean` before re-verifying (cache grows to several GiB).

Forgejo workflow `actions/checkout@v6.0.2` was already at the latest SHA — no change.

## Test plan

- [x] `prek run --all-files` passes after `prek clean`
- [x] `deploy-fly` workflow builds and deploys the new fly image on merge
- [x] `fly status -a blumeops-proxy` healthy after deploy
- [x] Spot-check a few mise tasks (`mise run blumeops-tasks`, `mise run docs-check-links`) to confirm pinned deps resolve cleanly

Reviewed-on: #344
2026-04-30 16:51:43 -07:00
5096223b48 C1: clean up cv + docs minikube artifacts (#343)
## Summary

Follow-up to #342. The cv and docs services are now live on indri (Caddy file_server backed by ansible-managed tarball extraction) and verified working. This PR removes the dead minikube artifacts and the tooling shims that referenced them.

## Changes

**Deletions:**
- ``argocd/apps/{cv,docs}.yaml``
- ``argocd/manifests/{cv,docs}/`` (deployment, service, ingress, pdb, kustomization)
- ``containers/{cv,quartz}/`` (Dockerfiles + start scripts)

**Tooling:**
- ``mise-tasks/container-version-check``: remove the ``quartz``→``docs`` CONTAINER_TO_SERVICE mapping (containers/quartz no longer exists)
- ``service-versions.yaml``: bump ``docs.current-version`` to ``v1.16.0`` (the blumeops docs release tag) and trim the migration-window comment

## Live state context

The argocd Applications ``cv`` and ``docs`` were already deleted from the cluster manually as part of the cutover; this PR just removes the YAML files that the ``apps`` app-of-apps was still ingesting. After merge, ``argocd app sync apps`` will reconcile and the ``apps`` Application returns to Synced.

The Caddyfile ``handle_errors`` bug that briefly crashed all ``*.ops.eblu.me`` services during cutover is fixed in a separate C0 (``2ee53fe``) on main, not here.

## Test plan
- [x] ``mise run container-version-check --all-files`` clean
- [x] ``mise run service-review --type ansible`` shows cv at 1.0.3, docs at v1.16.0
- [ ] After merge: ``argocd app sync apps`` returns clean (cv/docs entries gone, no children to reconcile)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #343
2026-04-29 15:18:39 -07:00
2ee53fe375 C0: fix Caddyfile try_html — handle_errors can't nest inside handle{}
The kind=static branch added in #342 put handle_errors inside the
@host handle{} block. handle_errors is a top-level site-block directive,
not an ordered HTTP handler, so Caddy refuses to load the config:

  parsing caddyfile tokens for 'handle': directive 'handle_errors'
  is not an ordered HTTP handler

This crash-loops the whole reverse proxy and takes down every
*.ops.eblu.me service. Tripped today during the live cv/docs cutover.

Fix: drop handle_errors and append /404.html as the final try_files
candidate. The 404 page is served with status 200 instead of 404, but
that's acceptable for a human-facing curated 404 — the page renders
correctly. Documented inline.

The running Caddy on indri already has the fixed config (deployed
manually during the cutover); this lands the fix in main so future
provision-indri --tags caddy runs don't re-break it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 15:16:44 -07:00
8d634861f6 C1: migrate cv + docs from minikube to indri-native (#342)
## Summary

Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down.

## What's in this commit

- `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/`
- `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`)
- `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy`
- `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit
- `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi)
- Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment

## What is not in this commit

The dead artifacts. After PR review and successful cutover, a follow-up commit deletes:

- `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/`
- `containers/cv/`, `containers/quartz/`
- `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check`
- bumps `docs.current-version` in `service-versions.yaml` to the release tag

## Cutover plan (manual, from gilbert, after review)

1. **Take down old:**
   - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade`
   - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns | grep -E '^(cv|docs)\\b'` (should be empty)
   - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail
2. **Bring up new:**
   - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch)
   - `mise run provision-indri -- --tags cv,docs,caddy`
   - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"`
3. **Verify:** `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md`
4. **Cleanup commit + merge.**

Total expected downtime: minutes (not the few-hour budget you authorized).

## Test plan
- [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean
- [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread
- [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200
- [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment`
- [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page
- [ ] `mise run services-check` clean
- [ ] `mise run service-review --type ansible` shows cv and docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #342
2026-04-29 14:55:11 -07:00
a529d60f60 C0: remove containers/devpi/ build artifact
Devpi now runs natively on indri (uv venv via ansible role), so the
Dagger container build at containers/devpi/ is unused. Removing it.

Also updated dagger.md examples to use 'miniflux' as the example
container-name argument.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 13:40:45 -07:00
14ca0160ba Migrate devpi from minikube to indri (launchd) (#341)
## Summary

Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent.

## What changed

- **New ansible role** `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box).
- **Caddy**: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`.
- **Playbook**: `indri.yml` gains pre_tasks for the root password and the new role.
- **service-versions.yaml**: devpi flipped from `type: argocd` to `type: ansible`.
- **ArgoCD**: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted.
- **Docs**: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list.

## Already deployed (live on indri)

- Service running: `launchctl list mcquack.eblume.devpi` → PID 53888
- `curl https://pypi.ops.eblu.me/+api` returns 200 
- `mise run docs-mikado` works again 
- 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/`
- Minikube namespace and PVC fully reclaimed

## Test plan

- [ ] `mise run services-check` (after merge)
- [ ] CI workflows that use devpi succeed
- [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines)

## Context

This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain.

Reviewed-on: #341
2026-04-29 13:38:36 -07:00
f4a24595b1 C0: review CC ephemeral-privileged-jobs
Verified TTL=604800s and hostPID limited to ephemeral Prowler CronJob
on indri. Noted that alloy-tracing on ringtail also uses hostPID but
is out of scope until Prowler scans ringtail (tracked in Todoist).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:09:34 -07:00
817acc5e5e C0: transmission doc — review and correct storage/monitoring details
Marked last-reviewed: 2026-04-29. Fixed the storage layout table —
`/config/` is an emptyDir (ephemeral), not NFS, and the watch directory
is disabled. Documented the transmission-exporter sidecar that exposes
Prometheus metrics on port 19091.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 11:00:01 -07:00
4d76fd5de5 C0: prowler — rebuild image against main HEAD
Squash-merge of #340 changed the SHA. Bump prowler tag from
v5.23.0-2daf629 (PR branch) to v5.23.0-495e45d (main HEAD) so the
Dockerfile changes are present in the image deployed off main.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:49:27 -07:00
495e45d01d Address 6 critical Prowler IaC findings (mute + grafana RBAC tighten) (#340)
## Summary

The weekly Prowler IaC scan reported 6 critical findings against `argocd/manifests/`. They split cleanly into two patterns:

- **Legitimate-by-design RBAC → mute with new compensating controls**
  - `external-secrets-controller`, `external-secrets-cert-controller` manage `secrets` (KSV-0041) and the cert-controller mutates its own webhook configurations (KSV-0114). This is what the operator is *for*. New CC: `operator-purpose-bound-rbac`.
  - `kube-state-metrics` (both `minikube-indri` and `k3s-ringtail`) holds `list/watch` on secrets to expose `kube_secret_info` and `kube_secret_labels` metrics. KSM's metric schema only reads metadata, never the `data:` field. New CC: `kube-state-metrics-metadata-only`.

- **Over-broad RBAC → fix**
  - `grafana-clusterrole` had `get/watch/list` on `secrets` because the dashboard-sidecar config used `RESOURCE=both` (ConfigMaps + Secrets). Nothing in the cluster labels Secrets with `grafana_dashboard=1`, so this was unused power. Switched both sidecar instances to `RESOURCE=configmap` and removed `secrets` from the ClusterRole.

The IaC cronjob also did not previously pass `--mutelist-file`, which is why every IaC finding reported as unmuted regardless of mutelist configuration. The new `mutelist/iac.yaml` is bundled into the existing `prowler-mutelist` ConfigMap and mounted via `items:` selector.

## Test plan

- [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/prowler/` — already passes locally
- [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/grafana/` — already passes locally
- [ ] Deploy from this branch via `argocd app set prowler --revision prowler-iac-mutelist && argocd app sync prowler` and same for `grafana`
- [ ] Manually trigger the IaC cronjob and verify `MUTED=True` on the 6 critical findings (`kubectl --context=minikube-indri -n prowler create job --from=cronjob/prowler-iac-scan prowler-iac-test`)
- [ ] Restart grafana pod and confirm dashboards still render (sidecar still finds them via ConfigMap watch)
- [ ] After verify, `argocd app set <app> --revision main && argocd app sync <app>` post-merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #340
2026-04-29 10:43:32 -07:00
718e0a0043 C0: review-compliance-reports — summarize image and IaC scans
Previously only the K8s CIS in-cluster scan was processed; the weekly
container-image and IaC Prowler scans were running on schedule but never
reviewed. Now each scan gets its own status / severity / week-over-week
delta, with top-N grouped tables (by check ID and resource) for the
high-volume image and IaC outputs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 12:18:06 -07:00
cfb6d7a7aa C0: service-review — mark cv reviewed 2026-04-27
No version bump; build deps (jinja2, pyyaml) still loose-pinned and fine.
Known issue: deployed v1.0.3 package predates phone-hide commit; tracked
separately in Todoist by user.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 11:57:33 -07:00
c9eb188e05 C0: blumeops-tasks — replace ambiguous due:+N with "Nd overdue"
The signed offset format read as "due in 5 days" rather than
"5 days overdue", causing misreads. Switch to self-explanatory
text: "5d overdue" / "due in 2d" / "due today".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 11:49:46 -07:00
4a37ffcdc2 C0: CLAUDE.md — import AGENTS.md instead of redirecting to it
Claude Code only auto-loads CLAUDE.md. The prose shim told agents to go
read AGENTS.md, which is easy to skip. Replacing the shim with
`@AGENTS.md` inlines AGENTS.md content into the session prompt, so the
startup rules (ai-docs, blumeops-tasks, change classification) land in
context unconditionally.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 11:41:13 -07:00
f9d9e00057 C0: blumeops-tasks — show due offset + recurrence, sort by overdue-ness
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 11:18:16 -07:00
005e2a03ed C0: split gandi-operations docs; add dns-acme-cleanup mise task
Splits the nebulous gandi-operations how-to into two single-topic cards
(manage-eblu-me-dns, rotate-gandi-pat) and adds a mise task for the
recurring _acme-challenge TXT cleanup needed due to a value-comparison
bug in libdns/gandi v1.1.0 that prevents certmagic's cleanup phase from
removing presented TXT values.

The gandi reference card is updated to drop the false "different
credential from Pulumi PAT" claim — verified during the 2026-04-27
incident that Caddy and Pulumi share a single PAT.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-27 09:48:46 -07:00
72b27b7fd2 C0: docs — add mealie borg restore how-to
Captures the procedure used to restore mealie's SQLite DB from a borgmatic
archive after the post-DR wipe: extract from borg, snapshot the wiped DB,
swap via a helper pod on the ReadWriteOnce PVC, fix UID 911 ownership.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-24 19:04:28 -07:00
Erich Blume
34fa2ef28a C0: ringtail — restore sway default keybindings, fix fuzzel border config
Extend (not replace) home-manager's default sway keybindings via
lib.mkOptionDefault, with lib.mkForce on the custom overrides that
conflict with defaults. Add Mod+F1 cheatsheet binding (fuzzel-filterable).

Move fuzzel's border-radius/border-width out of [main] into a proper
[border] section with the expected short names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-23 12:16:02 -07:00
Erich Blume
88eabc3de6 Disable Xalia 2026-04-21 14:47:13 -07:00
7d94b9073a C0: docs — default argocd login to --sso; drop extraneous --grpc-web
Now that argocd's Authentik OAuth2 client is public, `argocd login --sso`
works for day-to-day use. Promote it to the default in AGENTS.md,
argocd-cli reference, and troubleshooting; keep the admin/password flow
documented as a break-glass fallback for when Authentik is unavailable.

Also drops --grpc-web from every interactive login command — confirmed
extraneous (login succeeds without it). Left in CI workflows and
`argocd cluster add` untouched; those are different contexts that I
didn't re-test.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:43:21 -07:00
86317315ed C0: remove argocd OIDC client_secret wiring
Now that argocd's Authentik OAuth2 client is public (PKCE-only), the
client_secret plumbing is dead code:

- delete argocd-oidc-authentik ExternalSecret and drop it from kustomization
- remove AUTHENTIK_ARGOCD_CLIENT_SECRET env from authentik-worker
- remove argocd-client-secret mapping from authentik-config ExternalSecret

The argocd-client-secret field in the 1Password "Authentik (blumeops)"
item is now unreferenced and can be deleted there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:38:26 -07:00
0e62ad5596 C0: argocd OIDC — switch to public client for CLI SSO
Changes argocd's Authentik OAuth2 client from confidential to public and
drops the clientSecret from argocd-cm. Public + PKCE works for both the
web UI (argocd-server backend) and the argocd CLI (`argocd login --sso`)
without a shared secret, matching OAuth 2.1 guidance.

Confidential → public was needed because the CLI can't hold a client
secret; Authentik's per-app issuer model made the alternative
("cliClientID" pattern with separate public client) awkward since it
requires a shared issuer across apps which Authentik doesn't serve.

Follow-up: deadcode AUTHENTIK_ARGOCD_CLIENT_SECRET env wiring and the
argocd-oidc-authentik ExternalSecret once verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:34:39 -07:00
225b0e7008 C0: allow argocd CLI --sso localhost callback
Adds http://localhost:8085/auth/callback to the ArgoCD OAuth2 provider's
redirect_uris so `argocd login --sso` works. Loopback redirect is the
RFC 8252 pattern for native CLI apps; PKCE (already enabled) covers the
code-interception risk.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:18:08 -07:00
e6a6a6042e C0: suggest mise run runner-logs in container-build-and-release
After dispatching, poll the Forgejo API for the run matching our
head_sha and print `mise run runner-logs <N>` so the suggested monitor
command is one copy-paste away. Falls back to the bare command if the
poll times out.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 10:12:00 -07:00
0ceafc374d C0: review operator-managed-pods CC (2026-04-21)
Tailscale operator still defaults to privileged proxy pods with no
seccomp profile (issue #7359 open upstream). Control remains valid.
Added note about ProxyClass + device plugin remediation path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 09:59:48 -07:00
a9ef02a602 C0: bump frigate-notify to v0.5.4-e928054-nix (workdir fix)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 09:44:24 -07:00
e92805409e fix(frigate-notify): set WorkingDir=/app and create writable /app
The upstream binary expects CWD=/app (relative config.yml lookup,
lumberjack logfile at ./log/app.log). Without this, the pod crashed on
startup — the ConfigMap-mounted /app/config.yml wasn't found and zerolog
spammed "mkdir log: permission denied" as it tried to create ./log at
/ as nonroot.

Creates /app as 1777 (tmp-style) so nonroot can write logs; WorkingDir
set to /app so the default config path resolves correctly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 09:43:00 -07:00
c88b6d773c C0: point frigate-notify at local registry tag v0.5.4-fb4bf5a-nix
Built from main in run #516 after #339 merged. Follows the navidrome
kustomization convention (deployment image = local ref + :kustomized,
kustomization override = newTag only).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 09:31:29 -07:00
fb4bf5a7a3 Add frigate-notify nix container build (#339)
## Summary
- Mirrors `github.com/0x2142/frigate-notify` at `v0.5.4` to `forge.ops.eblu.me/mirrors/frigate-notify`.
- Adds `containers/frigate-notify/default.nix` — `buildGoModule` + `dockerTools.buildLayeredImage`, following the `ntfy` pattern.
- Uses `-tags goolm` to avoid the libolm CGO dependency (matrix notifier is imported unconditionally in the upstream but we only use ntfy alerts).
- Runs as nonroot (UID 65534), exposes port 8000, bundles `cacert`/`tzdata`.

## Why
Move `ghcr.io/0x2142/frigate-notify:v0.5.4` (ringtail-deployed) under local control. Aligns with the [[indri → ringtail migration plan]] and the `default.nix` convention for ringtail-targeted containers documented in [[build-container-image]].

## Verification
- `dagger call build-nix --src=. --container-name=frigate-notify export --path=./out.tar.gz` produces a valid 20MB docker archive (10 layers) with `blumeops/frigate-notify` tag locally.
- Hashes pinned for `fetchgit` (src) and `vendorHash` (go modules).

## Follow-up (post-merge)
1. `mise run container-build-and-release frigate-notify` — release from main SHA.
2. C0 follow-up: update `argocd/manifests/frigate/kustomization.yaml` image ref to `registry.ops.eblu.me/blumeops/frigate-notify:v0.5.4-<sha>-nix`.
3. ArgoCD auto-syncs the deployment.

## Test plan
- [ ] `dagger call build-nix` succeeds from a clean checkout.
- [ ] `mise run container-build-and-release frigate-notify --dry-run` looks correct.
- [ ] After release + kustomization swap: frigate-notify pod comes up healthy on ringtail; ntfy alerts still fire on Frigate events.

Reviewed-on: #339
2026-04-21 09:28:02 -07:00
30f39ae050 Review contributing tutorial: add last-reviewed, .ai.md fragment type, prek provenance
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 08:53:41 -07:00
fb32cc07c4 chore: repoint runner-job-image tag at CI-built v0.20.6-50f8c2a
Swaps the k8s runner label from the local bootstrap tag (v0.20.6-9b6be09)
to the equivalent image rebuilt by CI from main. Functionally identical;
closes the bootstrap loop.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 08:38:33 -07:00
50f8c2a33f Roll k8s runner to runner-job-image v0.20.6-9b6be09
Points the k8s Forgejo runner label at the locally-bootstrapped
runner-job-image built from the Alpine container.py on this branch.
Once merged, CI will rebuild the same image from the same SHA.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 08:28:18 -07:00
db8fd946ae Bump Dagger to 0.20.6 and migrate runner-job-image to Alpine container.py
Bumps the Dagger engine/CLI from v0.20.1 to v0.20.6 (mise pin, dagger.json
engineVersion, SDK regen) and rewrites the runner-job-image container as a
native Dagger pipeline on Alpine 3.23 using the shared alpine_runtime helper,
replacing the Debian-based Dockerfile. All Forgejo Actions in this repo use
actions/checkout (a JS action), so musl is not a compatibility concern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-21 08:28:18 -07:00
Erich Blume
58fe4f0073 ty 2026-04-20 15:48:15 -07:00
54841dbf70 Update ringtail flake inputs 2026-04-20 10:09:16 -07:00
d6ad8e8e59 chore: refresh forgejo-runner review date 2026-04-20 09:15:35 -07:00
21177ff47f chore: update forgejo-runner image tag 2026-04-20 09:11:37 -07:00
1425bf1f5c Upgrade forgejo-runner to v12.8, adopt server.connections, and clean up docs (#338)
## Summary
- consolidate forgejo-runner how-to docs into current cards
- upgrade the k8s forgejo-runner deployment to the latest v12.8.x runner image
- switch the k8s runner from first-boot register flow to declarative server.connections config
- keep the runner image on the native Dagger build path and update the surrounding manifests/secrets

## Notes
- PR opened early for C1 review
- implementation and deployment verification will follow in subsequent commits

Reviewed-on: #338
2026-04-20 09:03:54 -07:00
353e2785c3 docs: review zot oidc client card 2026-04-20 07:55:25 -07:00
53a7374ac1 C0: drop fix-ntfy-nix-version mikado card
Historical one-shot fix from the zot hardening chain — knowledge is
self-evident in containers/ntfy/default.nix and container-version-check
regex. Should have been removed at mikado finalization. Scrubbed the two
wiki-link references in add-container-version-sync-check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-20 07:26:53 -07:00
51a878cddb C0: review navidrome reference doc 2026-04-18 20:25:19 -07:00
deedeecef9 C0: adopt AGENTS.md as canonical agent config 2026-04-18 20:15:30 -07:00
71c1c453d6 Fetch job logs via SSH to indri instead of Forgejo web endpoint
Forgejo's web action routes don't support API token auth for private
repos (only session cookies or public access). Switch log fetching to
read the zstd-compressed log files directly from indri via SSH —
Forgejo stores all runner logs on disk regardless of which runner
executed the job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 17:08:46 -07:00
4f5a963ef6 Add API token auth and git remote detection to runner-logs
runner-logs now always authenticates with the Forgejo API token
(via --token flag, FORGEJO_TOKEN env, or 1Password) so it works on
private repos. The --repo default is auto-detected from the git
remote origin URL instead of being hardcoded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 16:39:21 -07:00
1d62653871 Fix forge.eblu.me static assets by adding missing Host header
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m26s
The static asset cache block (css/js/png/etc) was missing
proxy_set_header Host, so Caddy received "forge.eblu.me" instead of
"forge.ops.eblu.me" and couldn't route the request. HTML loaded fine
because the main location / block had the header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 16:00:56 -07:00
55abb17f50 Add resource limits to ArgoCD pods to prevent unbounded consumption
All 7 ArgoCD containers had no resource limits, allowing them to consume
unlimited CPU/memory during node pressure events. This contributed to
cluster-wide probe timeout cascades on minikube-indri.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 13:04:27 -07:00
Forgejo Actions
bdfcb4b677 Update docs release to v1.16.0
- Built changelog from towncrier fragments

[skip ci]
2026-04-18 10:00:54 -07:00
d26a6ae3b2 Update docs for Caddy routing and direct WireGuard peering
Comprehensive docs pass reflecting the new Fly proxy architecture:
- Fly proxy routes through Caddy on indri (not per-service TS Ingress)
- Direct WireGuard peering via --port=41641 pinning
- DERP relay performance lesson in Tailscale docs
- Caddy now in public traffic path
- indri tagged as flyio-target
- Removed fly-reload references
- Updated architecture diagrams and per-service setup guide
- Added changelog fragment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:57:30 -07:00
12b2786ca2 Route Fly proxy through Caddy on indri for direct WireGuard peering
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m59s
Tailscale Ingress pods in k8s can't establish direct WireGuard
connections (stuck behind pod-network NAT → DERP relay → 20s latency).
Indri's host-level Tailscale CAN peer directly with Fly.

Change all nginx upstreams to route through Caddy on indri instead of
per-service Tailscale Ingress endpoints. Tag indri as flyio-target in
the Tailscale ACL so the Fly proxy can reach it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:40:20 -07:00
bca4c2bede Expose Tailscale WireGuard UDP port on Fly proxy
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 1m33s
Enable direct peer-to-peer WireGuard connections by pinning tailscaled
to port 41641 and exposing it as a UDP service. Without this, all
traffic routes through Tailscale DERP relays causing 20+ second
latency. Requires dedicated IPv4 (allocated: 168.220.82.221).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:17:03 -07:00
c8da243663 Run alloy-tracing as root for eBPF capabilities
The nix-built Alloy image sets User=65534 (nobody). Even with
privileged: true, a non-root user gets no effective capabilities
(CapEff=0). Override with runAsUser: 0 so Beyla gets CAP_BPF and
CAP_SYS_ADMIN needed for eBPF instrumentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:42:26 -07:00
3a2913ba1f Allow BPF in privileged containers on ringtail
NixOS defaults kernel.unprivileged_bpf_disabled=2, which blocks BPF
syscalls outside the init namespace even with CAP_BPF. Set to 1 so
privileged containers (Beyla/Alloy tracing) can create BPF maps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:32:30 -07:00
24f3f9b24a Raise k3s memlock rlimit for eBPF tracing on ringtail
Beyla (alloy-tracing) has been failing since April 13 with
"failed to set memlock rlimit: operation not permitted" because k3s
inherits the default 8MB memlock limit. Set LimitMEMLOCK=infinity on
the k3s systemd service so privileged containers can use eBPF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:27:39 -07:00
Forgejo Actions
a72a2c2bd4 Update docs release to v1.15.7
- Built changelog from towncrier fragments

[skip ci]
2026-04-18 08:14:58 -07:00
9bafe85b2b Add teslamate extensions to DR restore procedure
The earthdistance extension (depends on cube) must be created before
restoring the teslamate database — discovered missing after 2026-04-13 DR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:12:26 -07:00
b4472c7849 Deploy devpi 6.19.3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:04:23 -07:00
4dab6d11bb Add remote commit check to container-build-and-release
Queries the Forgejo API to verify the target commit exists on the remote
before dispatching a build, preventing wasted CI runs on unpushed commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:03:21 -07:00
37b8a21524 Migrate devpi to Dagger build and bump to 6.19.3
Replace Dockerfile with container.py for native Dagger builds.
Bump devpi-server 6.19.1→6.19.3, devpi-web 5.0.1→5.0.2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:57:05 -07:00
fe0e913963 Switch Fly proxy to upstream keepalive pools (#337)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s
## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
2026-04-17 16:39:52 -07:00
54b1cee950 Fix Connection header: only send 'upgrade' for WebSocket requests
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 1m35s
Was sending Connection: upgrade on every proxied request, which is
semantically wrong for normal HTTP traffic. Use a map to conditionally
send 'upgrade' only when the client requests a WebSocket switch,
'close' otherwise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:27:40 -07:00
1c0ee099fb Move forge-specific latency panels to Forgejo dashboard
Fly.io dashboard keeps aggregate all-hosts p50/p90/p99. Forge-filtered
upstream response time panel moves to Forgejo's "Public Proxy" section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:13:40 -07:00
d7af004842 Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m53s
- Enable Forgejo /metrics endpoint (app.ini [metrics] section)
- Add Alloy scrape target for Forgejo metrics on indri
- Add upstream_response_time histogram to Fly proxy Alloy config
- Replace single p95 panel with p50/p90/p99 + upstream breakdown
  filtered to forge.eblu.me host

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:05:59 -07:00
8fccbda573 Extend Fly proxy latency histogram buckets to 60s
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m29s
Previous max bucket was 10s — all slower requests collapsed into +Inf,
making p50/p90/p99 unreadable during the Forgejo archive DoS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:50:28 -07:00
1631e11137 Add /user/ to forge robots.txt exclusion
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m47s
Crawlers follow auth redirects to /user/login which is pointless for them.
Saves round-trips for both sides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:34:24 -07:00
0a98f76068 Update kiwix-serve to Dagger-built container (Alpine 3.23)
Points kustomization at v3.8.2-7a42aeb, the first image built from the
new container.py (replacing the Dockerfile).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:27:42 -07:00
65bc21b162 Add op-based auth to fly-deploy mise task
The task was missing FLY_API_TOKEN injection, requiring manual fly auth
login. Now uses op read to fetch the deploy token from 1Password.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:26:29 -07:00
7a42aeb77c Mitigate Forgejo archive endpoint DoS from crawler abuse
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m35s
Crawlers hitting /archive/ endpoints with unique commit SHAs generated 54GB
of git bundles in 2 days, pegging Forgejo at 43% CPU. Fix at multiple layers:

- Redirect archive requests to tailnet at Fly proxy (302)
- Expand robots.txt: block /users/, /*/archive/, /*/releases/download/
- Cache release artifact downloads at nginx (immutable, 7d TTL)
- Enable [cron.archive_cleanup] with 2h TTL and run-at-start

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:21:22 -07:00
5f38779d52 Migrate kiwix-serve container from Dockerfile to native Dagger build
Replaces the hand-written Dockerfile with container.py using the shared
alpine_runtime helper, which bumps the base image from Alpine 3.22 to 3.23.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 13:56:32 -07:00
fd9e1ac93b Remove nativeMessagingHosts.packages — breaks Firefox wrapper build
The _1password-gui package doesn't export native messaging manifests
in the format the Firefox wrapper expects. The 1Password NixOS module
already handles native messaging host registration separately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:57:44 -07:00
e60e3d5fc7 Use programs.firefox module with 1Password extension via policy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:54:33 -07:00
f283f9453d Set Firefox as default browser via home-manager xdg.mimeApps
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:43:35 -07:00
50dfdba4e6 Add Firefox, remove claude-cli:// handler workarounds
The xdg desktop entry and Librewolf user.js prefs didn't fix the
OAuth callback hang. Try stock Firefox instead as a simpler path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:42:36 -07:00
dd1cf4f198 Configure Librewolf to delegate claude-cli:// URIs to xdg-open
The xdg desktop entry and mimeapps were already registered but
Librewolf doesn't delegate unknown URI schemes to the system
handler by default. This adds user.js prefs to complete the chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:37:16 -07:00
68f845e773 Add changelog fragment for forge robots.txt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:40:34 -07:00
7f6bbdc82c Add robots.txt to forge.eblu.me blocking crawlers from /mirrors/
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m19s
Facebook has been scraping forge mirror repos at ~3-4 req/s, slowing
down the Forgejo instance. Serve robots.txt directly from nginx to
disallow /mirrors/ while leaving eblume/* accessible to crawlers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:39:48 -07:00
5ec2411e20 Update navidrome, miniflux, forgejo-runner image tags to Alpine 3.23 builds [main]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:37:30 -07:00
3ecd888537 Switch container builds to manual-only workflow dispatch
Shared Dagger helpers (src/blumeops/) affect all Dagger-built containers,
making path-based auto-triggers unreliable. All builds now go through
`mise run container-build-and-release <name>`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 14:25:14 -07:00
352b95c141 Refactor Dagger go_build() helper and standardize Alpine 3.23
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (miniflux) (push) Successful in 10m2s
Build Container / build-dagger (forgejo-runner) (push) Successful in 10m2s
Extend go_build() with buildmode and extra_env params, migrate miniflux
and forgejo-runner to use it, and bump all Alpine bases from 3.22 to 3.23.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:10:46 -07:00
99f78c8745 Register claude-cli:// URI handler on ringtail for Claude Code OAuth
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 07:42:52 -07:00
fb1e8ff672 Deploy transmission containers from Dagger builds
Update kustomization image tags to the new container.py-built images
(v4.1.1-r1-2c483ce, v1.0.1-2c483ce).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:34:28 -07:00
2c483cefff Migrate transmission containers from Dockerfile to Dagger builds
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (transmission-exporter) (push) Successful in 2m29s
Build Container / build-dagger (transmission) (push) Successful in 2m29s
Replace Dockerfiles with native container.py for both transmission and
transmission-exporter. Updates base images (Alpine 3.23, Python 3.14),
pins uv to 0.11.6 instead of :latest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:26:00 -07:00
519175c672 Fix borgmatic LaunchAgent TCC dialog hang by removing mise wrapper
LaunchAgents now call borgmatic directly at its mise-installed path
instead of routing through `mise x`, which triggered macOS TCC
permission dialogs (e.g. "mise wants to access Documents") that hung
headless sessions and caused backup failures.

Also adds `mise install` to the ansible role so borgmatic installation
is fully managed, and pins the version in both mise.toml and the role
defaults.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 07:23:46 -07:00
30ed018fd8 Update prowler image tag to v5.23.0-7c1cd11 [main]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:51:26 -07:00
7c1cd11e45 Upgrade Prowler to 5.23.0, remove registry workaround (#336)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (prowler) (push) Successful in 36s
## Summary

- Upgrade Prowler from 5.22.0 to 5.23.0
- Remove the `enumerate-images` init container workaround from `cronjob-image-scan.yaml`
- Use native `--registry` and `--image-filter` flags now that upstream fix (PR prowler-cloud/prowler#10470) is released

The init container was a workaround for prowler-cloud/prowler#10457 where `--registry` args weren't forwarded to the provider constructor. We wrote the fix, it was merged, and v5.23.0 includes it.

## Test plan

- [ ] Build new container (`mise run container-release prowler 5.23.0`)
- [ ] Update kustomization.yaml with new image tag
- [ ] Sync prowler ArgoCD app from branch
- [ ] Manually trigger image scan job and verify `--registry` works natively
- [ ] Verify CIS and IaC scan cronjobs still work

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #336
2026-04-14 13:45:28 -07:00
6b690eb033 Review CC sso-gated-admin-tools: scope to ArgoCD only
Removed Grafana from the control description — no Prowler finding
references it. Tightened scope to match actual usage (ArgoCD wildcard
RBAC mute). Added workflow-bot scoping note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:07:52 -07:00
be30668eef Automate Prowler MANUAL finding verification (#335)
## Summary
- Adds automated node-level verification to `review-compliance-reports`: kubelet file perms/ownership, kubelet config args, etcd CA separation, RBAC cluster-admin bindings
- Mutes the 14 MANUAL Prowler findings via new `manual-node-checks.yaml` mutelist file
- New `node-config-automated-verification` compensating control documents the approach
- Script fails loudly (red FAIL + verdict panel) if any check deviates from expected values

## Test plan
- [x] `mise run review-compliance-reports` — all 12 node checks PASS
- [x] Injected bad expected value (perms 400 vs actual 600) — FAIL rendered correctly
- [x] Fixed colon-in-binding-name bug (kubeadm:cluster-admins) with tab-separated jsonpath
- [ ] After merge: sync prowler mutelist ConfigMap and verify next scan shows 0 MANUAL findings

## Note
Prowler coverage is minikube-indri only — ringtail/k3s is a known gap tracked separately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #335
2026-04-14 13:00:44 -07:00
Forgejo Actions
8c2f035e6d Update docs release to v1.15.6
- Built changelog from towncrier fragments

[skip ci]
2026-04-14 11:46:42 -07:00
04b44b350b Add changelog for ArgoCD token rotation after DR
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:45:00 -07:00
Forgejo Actions
f2514a6f02 Update docs release to v1.15.5
- Built changelog from towncrier fragments

[skip ci]
2026-04-14 11:29:27 -07:00
9d85c97b9b Update forgejo-runner kustomization tag to main-branch image
C0 follow-up: switch from branch-built tag to main-built v12.7.3-0e93cc0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:10:36 -07:00
0e93cc08b4 Build forgejo-runner container locally (#334)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dagger (forgejo-runner) (push) Successful in 1m21s
## Summary
- Add native Dagger `container.py` for forgejo-runner (Go + Alpine runtime, static binary with CGO for SQLite)
- Update kustomization to point to local registry image (tag is placeholder until CI builds)
- Uses existing `clone_from_forge("forgejo-runner", ...)` mirror

## Test plan
- [x] `dagger call build --src=. --container-name=forgejo-runner` passes locally
- [ ] CI container build from branch succeeds
- [ ] Update kustomization tag to built image, deploy from branch via ArgoCD `--revision`
- [ ] Verify runner registers and picks up jobs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #334
2026-04-14 11:06:36 -07:00
223b134776 Document uv.lock as the source of devpi dependency in Dagger builds
The lockfile bakes in devpi URLs — Dagger does a locked install, not
fresh resolution. This is the mechanism behind the cold-cache failure.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 07:41:45 -07:00
ccaef4c1a7 Document devpi cold cache failure mode and deploy teslamate v3.0.0-08c698e
After a DR rebuild, devpi's empty cache causes race conditions under
concurrent load — metadata is served but wheel files 404. Also deploys
the first container.py-built teslamate image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 07:38:06 -07:00
08c698e833 Migrate teslamate to native Dagger container.py (#333)
Some checks failed
Build Container / detect (push) Successful in 2s
Build Container / build-dagger (teslamate) (push) Failing after 6s
## Summary
- Replace legacy Dockerfile with native Dagger `container.py` build
- Two-stage pipeline: Elixir+Node builder, Debian slim runtime
- Uses shared helpers (`clone_from_forge`, `oci_labels`)
- Delete old Dockerfile (pipeline auto-discovers container.py)
- Update build-container-image docs and mark service reviewed

## Test plan
- [x] `dagger call build --src=. --container-name=teslamate` succeeds locally
- [ ] CI container build passes
- [ ] Deploy from branch and verify teslamate starts cleanly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #333
2026-04-14 07:20:52 -07:00
4ca0630d76 Review enforce-tag-immutability doc: add review date and zot reference link
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 07:00:55 -07:00
d7c3c687f4 Document DR rebuild procedure and update restart-indri
- New how-to: rebuild-minikube-cluster with full bootstrap procedure
  validated during 2026-04-13 DR event
- Update restart-indri: warn about minikube delete, macOS permission
  dialog on first Tailscale SSH, forgejo_actions_secrets dep cycle
- Update disaster-recovery reference: link to rebuild procedure
- Update CLAUDE.md: never run minikube delete

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 18:07:54 -07:00
405dab8b59 Add changelog fragments for DR recovery work
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 17:59:16 -07:00
cd5b6b63f7 Add paperless DB to borgmatic backups
Discovered during DR that paperless was the only service DB not backed
up by borgmatic. Uses same blumeops-pg cluster on port 5432.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 17:58:06 -07:00
2d2d495f95 Fix paperless redis: use upstream valkey instead of amd64-only nix image
The authentik-redis image is nix-built on ringtail (amd64 only) and was
previously running under QEMU emulation on arm64 minikube. Discovered
during DR recovery when fresh minikube lacked binfmt registration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 17:48:20 -07:00
fca3010042 Hints about service version tracking 2026-04-13 08:40:49 -07:00
22a417ac3c Oops, looks like a log file got lost, nbd 2026-04-13 08:36:20 -07:00
f61bb4f2e7 Add uv.lock for version pinning of dagger pipeline 2026-04-13 08:35:01 -07:00
b5551e227e Route Dagger build telemetry to Tempo
The Dagger engine's internal OTLP proxy returns 500 on /v1/metrics when
there's no real backend, causing ~9s retry warnings per pipeline step.
Point OTEL_EXPORTER_OTLP_ENDPOINT at Tempo to give it a real endpoint.
Also removes the stale os.environ workaround from main.py (the SDK
initializes telemetry before our module loads, so it had no effect).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 08:27:12 -07:00
ab834b641a Fix OTEL metrics exporter warnings in Dagger builds
The Dagger engine shim sets OTEL_METRICS_EXPORTER before our module
loads, so os.environ.setdefault was a no-op. Switch to a hard override.
Remove the redundant workflow-level env var since the fix belongs in
the module.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 08:11:15 -07:00
db6d8af8b1 Update grafana-sidecar image tag to v2.6.0-61fcd5d (merge build)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-13 08:02:39 -07:00
61fcd5d70a Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332)
All checks were successful
Build Container / detect (push) Successful in 4s
Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s
## Summary

- Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable)
- Port build from Dockerfile to native Dagger container.py
- Add liveness/readiness probes using the new /healthz endpoint on port 8080
- Update docs to reflect container.py migration and remove stale pin note

## Test plan

- [ ] Build container: `mise run container-build-and-release grafana-sidecar`
- [ ] Update kustomization tag with new image tag
- [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana`
- [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz`
- [ ] Verify dashboards load in Grafana UI
- [ ] `mise run services-check`

Reviewed-on: #332
2026-04-13 07:57:13 -07:00
6455d93cb3 Review local-registry control: fix inaccurate description, enumerate exceptions
The control claimed all images came from the private registry, but 12+
services pull from external public registries. Updated description to
reflect reality and catalogued external-image categories in notes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:59:37 -07:00
6e60287e99 Doc review: delete install-dagger-on-nix-runner, add service-versions ref card
Outdated leaf card removed; zot.md now links to new service-versions
reference card instead. Added reverse link from review-services.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:52:38 -07:00
8d80a4a3a5 Rewrite runner-logs: API-based log fetching, multi-repo support
Replace broken SSH+filesystem log retrieval with Forgejo web API
endpoint. Fix CLI to use run numbers (not task IDs), add --repo
for querying any forge repo (e.g. sporks), --limit/-n for listing
size. Document runner-logs as the way to verify build success in
CLAUDE.md and container build docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 09:42:58 -07:00
a18ec9d958 Update miniflux to main image tag, disable OTEL metrics in Dagger module
Point miniflux kustomization at the main-built v2.2.19-138e23d image
(replacing the branch tag). Disable the OTLP metrics exporter at module
import time to prevent ~11s retry delays in CI — the env var must be set
inside the module, not the runner shell, because the SDK runs inside the
Dagger engine container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-12 08:59:32 -07:00
138e23d525 Miniflux 2.2.19 + container.py migration + ty typechecker (#331)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (miniflux) (push) Successful in 1m3s
## Summary

- Upgrade miniflux from 2.2.17 to 2.2.19 (security hardening, performance improvements)
- Migrate miniflux from Dockerfile to native Dagger container.py build
- Refactor `alpine_runtime()` helper to support existing users (nobody/65534)
- Add `ty` (Astral) Python typechecker to prek hooks

## Test plan

- [ ] `dagger call build --src=. --container-name=miniflux` succeeds
- [ ] `dagger call container-version --container-name=miniflux` returns 2.2.19
- [ ] `mise run container-version-check` passes
- [ ] `ty check` passes cleanly
- [ ] `prek run --all-files` passes
- [ ] CI builds container successfully
- [ ] Miniflux healthcheck passes after deploy from branch

Reviewed-on: #331
2026-04-12 08:54:32 -07:00
dc5bffdd97 Update ringtail flake inputs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 21:14:46 -07:00
c06eccc61c Review hosts.md: add last-reviewed, normalize links, add reference tag
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 21:06:53 -07:00
94c937d588 Disable OTLP metrics exporter in CI, update navidrome to main tag
The Dagger Python SDK's OTLP metrics exporter hits a non-functional
local endpoint (500s), burning ~9s per retry cycle. Set
OTEL_METRICS_EXPORTER=none in the build-dagger CI job.

Also update navidrome kustomization to the main-SHA tag (c86b5d7).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-11 17:26:25 -07:00
c86b5d7772 Native Dagger container builds + Navidrome v0.61.1 (#330)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (navidrome) (push) Successful in 22m26s
## Summary
- Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops`
- Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step
- Migrate navidrome as the first container (`containers/navidrome/container.py`)
- Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding)
- Add `dagger call container-version` for CI version extraction without Dockerfile parsing
- All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode
- Legacy `docker_build()` fallback preserved for all other containers

## Motivation
When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline.

## Container build dispatch needed
After merge, dispatch container build for navidrome:
```
mise run container-build-and-release navidrome --ref 470b4bd
```

## Deploy steps
1. Wait for container build to complete
2. Back up navidrome-data PVC (non-reversible DB migrations)
3. `argocd app set navidrome --revision main && argocd app sync navidrome`
4. Verify at https://dj.ops.eblu.me

## Future
Remaining containers migrate incrementally in follow-up PRs using the same pattern.

Reviewed-on: #330
2026-04-11 17:11:56 -07:00
4fc0192731 Track Fly.io proxy component versions in service-versions.yaml
Add flyio-tailscale (v1.94.1), flyio-nginx (1.29.6-alpine), and
flyio-alloy (v1.14.1) entries with new `fly` service type so future
upgrades go through the service-review workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:40:57 -07:00
e02305e72d Pin Fly.io Tailscale to v1.94.1 to fix MagicDNS regression in v1.96.5
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m20s
Tailscale :stable pulled v1.96.5 during last deploy, which returns
SERVFAIL for tailnet DNS names (no upstream resolvers set). This broke
all public routing (forge/docs/cv.eblu.me) through the Fly proxy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:32:38 -07:00
b08b1a833f Fix services-check to show all firing alerts per alert name
check_alert() used head -1 to display only the first firing instance,
silently swallowing additional alerts (e.g. frigate pod-not-ready was
hidden behind ollama).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:10:09 -07:00
a75f28e073 Fix fly.io proxy rate limit to key on real client IP
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m24s
The general rate limit zone used $binary_remote_addr (Fly's internal
proxy IP), causing all external clients to share one bucket. Switch to
$http_fly_client_ip to match forge_auth's correct behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-10 19:00:33 -07:00
40556e5a2d Review gandi.md: add missing forge.eblu.me CNAME record
The Pulumi code has had a forge.eblu.me CNAME since it was added, but the
doc's DNS table only listed docs and cv. Also fixed the __main__.py
description to mention CNAMEs alongside A records.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 09:54:46 -07:00
5757df115d Upgrade ollama from 0.17.5 to 0.20.4
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-09 06:42:05 -07:00
22fc615a28 Update paperless image tag to main build
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 19:01:02 -07:00
07f52e9488 Deploy Paperless-ngx document management (#328)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (paperless) (push) Successful in 9s
## Summary

- Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri
- Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay
- PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery
- NFS document storage on sifaka (`/volume1/paperless`)
- Authentik OIDC SSO via baked JSON blob from 1Password
- Caddy route at `paperless.ops.eblu.me`
- 1Password item "Paperless (blumeops)" created with all secrets

## Files

- `containers/paperless/Dockerfile` — multi-stage build
- `argocd/manifests/paperless/` — full k8s manifest set
- `argocd/apps/paperless.yaml` — ArgoCD application
- `argocd/manifests/databases/` — CNPG role + ExternalSecret
- `ansible/roles/caddy/defaults/main.yml` — Caddy route
- `service-versions.yaml` — version tracking entry
- `docs/reference/services/paperless.md` — reference card

## Remaining deploy steps

1. Build container: `mise run container-build-and-release paperless`
2. Update kustomization.yaml `newTag` with actual image tag
3. Create Authentik application/provider for paperless
4. Create `paperless` database on blumeops-pg
5. Sync ArgoCD apps, then sync paperless from branch
6. Provision Caddy: `mise run provision-indri -- --tags caddy`
7. Verify at https://paperless.ops.eblu.me

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #328
2026-04-08 17:54:12 -07:00
e04455c911 Add changelog fragment for adding-a-service tutorial review
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:29:54 -07:00
d3235c5ca9 Review adding-a-service tutorial: fix ingress, repoURL, add kustomize and reference card steps
- Fix Tailscale Ingress: move hostname to tls.hosts, remove from rules (ProxyGroup compat)
- Update ArgoCD repoURL to forge.ops.eblu.me:2222
- Add kustomization.yaml section with :kustomized sentinel tag pattern
- Add Step 5: Create a Reference Card (keep under 30s reading time)
- Set last-reviewed: 2026-04-08

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:28:46 -07:00
22b77ac141 Fix Frigate preview config and services-check NoData detection
preview.quality was at the top level (invalid); moved under record
with a valid preset (very_low). Also fix services-check to catch
Grafana "Alerting (NoData)" state which was silently passing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 11:12:42 -07:00
ec63d560f3 Deploy authentik 2026.2.2 container to ringtail
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 10:56:50 -07:00
2eb28301e4 Upgrade authentik 2026.2.0 → 2026.2.2 (patch release)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (authentik) (push) Successful in 1m6s
Bug-fix release with web UI fixes, LDAP page size, and SAML SLO
redirect. Also bumps client-go to v3.2026.2.1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 10:53:03 -07:00
0366a0346b Set Frigate preview quality to CRF 8 for faster timeline loading
Previews are ~4MB/hour at default quality (CRF 1), served over NFS from
sifaka. Reducing to CRF 8 shrinks preview files to improve review page
load times when scrubbing older footage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 08:43:43 -07:00
936d29bbe1 Fix UnPoller dashboard UIDs exceeding Grafana 12's 40-char limit
Strip redundant "unifi-poller-" prefix from generated slugs, bringing
UIDs from 45-48 chars down to 32-35 chars.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-08 07:03:39 -07:00
3c894e659d Pin kube-state-metrics to main-SHA container tags
C0 follow-up to #327: update from branch-SHA tags to main-SHA tags
after squash-merge rebuild.

indri: v2.18.0-f59f885
ringtail: v2.18.0-f59f885-nix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 16:10:14 -07:00
f59f8859dc Localize kube-state-metrics container (Dockerfile + nix) (#327)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (kube-state-metrics) (push) Successful in 5s
Build Container / build-nix (kube-state-metrics) (push) Successful in 7s
## Summary

- Build kube-state-metrics v2.18.0 locally from forge mirror, replacing upstream `registry.k8s.io` image
- Dockerfile (two-stage Go build) for indri/minikube
- default.nix (buildGoModule + buildLayeredImage) for ringtail/k3s
- Both kustomization files updated with `newName` pointing to local registry

## Verification

- [x] Nix build succeeded on ringtail (`nix-build` → 10-layer image)
- [x] Dockerfile build succeeded locally (`dagger call build` → ~2min)
- [x] `container-version-check --all-files` passes (2.18.0 consistent across Dockerfile, nix, service-versions.yaml)
- [ ] CI builds container images from this branch
- [ ] Update kustomization `newTag` with SHA-tagged version from CI
- [ ] ArgoCD sync on both clusters

## Test plan

- Trigger CI build: `mise run container-build-and-release kube-state-metrics`
- Verify tags: `mise run container-list kube-state-metrics`
- Update newTag in kustomization files with CI-produced tag
- Sync ArgoCD on indri: `argocd app sync kube-state-metrics`
- Sync ArgoCD on ringtail: `argocd app sync kube-state-metrics --context=k3s-ringtail` (note: argocd uses its own auth, not kubectl context)
- Verify metrics still flowing to Prometheus

Reviewed-on: #327
2026-04-07 16:09:25 -07:00
84eda0301f Bump authentik worker memory limit 1Gi → 2Gi (OOMKilled after ringtail restart)
Worker forks 4 Dramatiq processes each loading the full Django app
(~250MB each), hitting the 1Gi limit on startup. Ringtail has ample
RAM headroom.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:39:29 -07:00
efae404d1e Remove superuser from teslamate PG role, transfer extension ownership
teslamate had superuser on the shared blumeops-pg cluster (which also
hosts miniflux and authentik). Downgraded to plain database owner with
extension ownership (cube, earthdistance) transferred manually so it
can still ALTER EXTENSION UPDATE. earthdistance is untrusted in PG so
DROP+CREATE would need temporary superuser escalation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:36:39 -07:00
fc34a7da5b Review postgresql.md: add authentik user/db, immich-pg borgmatic secret
Doc review found the authentik database, user, and external secret were
missing, along with the immich-pg borgmatic secret. Added Cluster column
to Users table for clarity. Set last-reviewed: 2026-04-07.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:21:48 -07:00
1fd8aae8f6 Upgrade ArgoCD v3.3.2 → v3.3.6, SHA-pin install manifest
Patch upgrade with bug fixes (diff normalization, installation ID cache).
Pin the upstream manifest URL to commit SHA for supply chain integrity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 08:21:11 -07:00
e85c71e73f Add changelog fragments for seccomp hardening and bracket fix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:38:41 -07:00
d3d67272a7 Fix blumeops-tasks swallowing bracket content in descriptions
Rich markup parser interprets [text] as style tags, stripping
wiki-links like [[review-compensating-controls]] to empty [].
Escape description lines with rich.markup.escape().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:37:40 -07:00
59f3422d3e Review compensating control: tailscale-network-isolation
Verified: tailscale serve status shows only svc:k8s, ACLs restrict
tag:flyio-target to port 443 with admin/operator ownership only,
indri has no flyio-target tag. All 10 muted findings remain valid.

Noted gap: no automated alerting on new flyio-target devices.
Tracked in Todoist as MC4 (Manual Compliance Control Check CronJob).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:35:13 -07:00
18fe172a54 Add seccomp RuntimeDefault profiles to alloy-k8s and immich pods
Resolves 4 unmuted Prowler core_seccomp_profile_docker_default
findings on alloy, immich-server, immich-machine-learning, and
immich-valkey.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:21:23 -07:00
a059d81314 Add review-compliance-reports task and reorganize report storage
New mise task fetches Prowler reports from sifaka, parses with proper
muted/unmuted distinction, shows week-over-week delta, and includes
a scaffold for Kingfisher once JSON/CSV output is available upstream.

Moved all legacy top-level reports on sifaka into date subdirectories
to match the current CronJob output structure. Updated
read-compliance-reports doc with task reference and links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:16:46 -07:00
54213ab810 Fix flake-update pipeline and update ringtail flake inputs
The `--exclude` flag added in #321 never existed in nix — it was
introduced broken and never tested. Replace with dynamic input
discovery: query `nix flake metadata --json` for all input names,
filter out skip_inputs (default: nixpkgs-services), pass the rest
as positional args. Also bump NIX_IMAGE 2.33.3 → 2.34.4.

Updated inputs: nixpkgs, home-manager, disko.
nixpkgs-services stays pinned.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 08:27:31 -07:00
Forgejo Actions
370a3574b2 Update docs release to v1.15.4
- Built changelog from towncrier fragments

[skip ci]
2026-04-06 07:53:54 -07:00
0eaf8680fd Rewrite observability stack tutorial to match actual practices
Replace generic Helm install instructions with kustomize/ArgoCD patterns
that reflect how BlumeOps actually deploys Prometheus, Loki, Grafana, and
Alloy. Fix "BluemeOps" typos, document Alloy as a core (not optional)
component, remove hardcoded admin password, add proper prerequisites and
cross-references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 07:52:35 -07:00
f42fa2d558 Remove stale Helm chart mirror references from forgejo docs
All Helm chart mirrors (grafana-helm-charts, connect-helm-charts,
cloudnative-pg-charts) have been deleted from forge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 07:37:21 -07:00
c7e5af6d51 Migrate 1Password Connect from Helm to kustomize (1.8.1 → 1.8.2) (#326)
## Summary

- Renders manifests from `connect-helm-charts v2.4.1` as plain kustomize (deployment + service)
- Bumps 1Password Connect from 1.8.1 → 1.8.2
- Completes the no-helm-policy migration — all services now use kustomize
- Retains all production hardening from the Helm chart (securityContext, runAsNonRoot, drop ALL, seccomp, resource limits)

## Changes

- **New:** `deployment.yaml`, `service.yaml`, `kustomization.yaml` in `argocd/manifests/1password-connect/`
- **Rewritten:** Both ArgoCD app definitions (indri + ringtail) — single source kustomize instead of multi-source Helm
- **Deleted:** `values.yaml` (Helm values no longer needed)
- **Updated:** `no-helm-policy.md`, `service-versions.yaml`, `README.md`

## Deployment plan

1. Sync `apps` app to pick up the new app definitions
2. `argocd app set 1password-connect --revision 1password-connect-kustomize`
3. `argocd app sync 1password-connect` — verify on indri
4. Repeat for ringtail
5. After merge: reset revision to main, re-sync both

## Test plan

- [ ] `kubectl kustomize` renders cleanly (verified locally)
- [ ] ArgoCD diff shows expected changes (Helm labels removed, images bumped)
- [ ] Pods come up healthy on indri
- [ ] External Secrets still resolves 1Password items
- [ ] Repeat on ringtail

Reviewed-on: #326
2026-04-06 07:31:40 -07:00
Forgejo Actions
facb803010 Update docs release to v1.15.3
- Built changelog from towncrier fragments

[skip ci]
2026-04-05 21:24:25 -07:00
f9397b7fa0 Review core-services tutorial: add SSH rationale, runner example, TODO notice
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:22:00 -07:00
5597e02467 Fix Homepage pod-selector for Immich (Helm labels → kustomize labels)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 12:12:48 -07:00
0f1143a5bd Merge pull request 'Migrate Immich from Helm to kustomize (v2.5.6 → v2.6.3)' (#324) from immich-kustomize-v2.6.3 into main 2026-04-04 12:09:40 -07:00
6cab5091ea Add storage-provisioner health check to minikube Ansible role
The storage-provisioner is a bare Pod with no controller. If the node
restarts via Docker Desktop (rather than `minikube start`), kubelet
restores static pods but bare pods are lost. Detect this and re-run
`minikube start` to restore addons.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 12:04:25 -07:00
3c819cf16e Fix Homepage migration date: 2025-12 → 2026-02 (per git history)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:53:48 -07:00
6e06efa6d0 Add AI-drafted disclaimer to no-helm-policy explanation doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:50:53 -07:00
64200a55c5 Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3)
Replace the Helm chart deployment with plain kustomize manifests following
the Authentik pattern (separate deployments per component). Consolidate
the immich-storage ArgoCD app into the main immich app. Add no-helm-policy
doc establishing kustomize as the standard deployment mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:42:25 -07:00
464e3222d2 Document upstream fix for Prowler --registry bug (pending release)
PR #10470 merged 2026-03-30; initContainer workaround stays until a
Prowler release includes the fix (latest is 5.22.0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:21:19 -07:00
afb184fefc Add Sway fullscreen rule for RDR2 (gamescope broken on NVIDIA 580.x)
Gamescope 3.16.17 segfaults on NVIDIA 580.x in nested Wayland/Sway due
to explicit sync issues (ValveSoftware/gamescope#1662). Use a Sway
window rule to force RDR2 fullscreen instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:04:29 -07:00
5de2ed9f96 Add gaming.nix for ringtail: gamescope + consolidate Steam config
Move Steam config from configuration.nix to a dedicated gaming.nix module
and add gamescope for fullscreen/resolution management with Proton games.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:48:36 -07:00
306f580bdb Point Tempo at main-built container v2.10.3-75f9ba4
C0 follow-up: update tag from branch-built image to main-SHA image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:45:57 -07:00
75f9ba4943 Build Tempo container from source (2.10.3) (#323)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (tempo) (push) Successful in 6s
## Summary
- Add `containers/tempo/Dockerfile` — two-stage Go build from forge mirror, modeled on loki
- Switch kustomization from upstream `grafana/tempo` to `registry.ops.eblu.me/blumeops/tempo`
- Bump Tempo 2.10.1 → 2.10.3

## Test plan
- [ ] Kick off container build via `mise run container-build-and-release tempo`
- [ ] Update kustomization `newTag` with built image tag
- [ ] Deploy from branch: `argocd app set tempo --revision local-tempo-container && argocd app sync tempo`
- [ ] Verify Tempo health: `curl tempo.ops.eblu.me/ready`
- [ ] Verify traces flowing in Grafana Tempo datasource

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #323
2026-04-02 13:45:02 -07:00
b1e2811077 Upgrade Grafana 12.3.3 → 12.4.2 (#322)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (grafana) (push) Successful in 7s
## Summary

- Bumps Grafana from 12.3.3 to 12.4.2
- Patches 7 CVEs, notably CVE-2026-27880 (unauthenticated OOM DoS, CVSS 7.5) and CVE-2026-27879 (authenticated OOM via resample queries)
- No config changes required — reviewed alerting, datasources, OIDC, and feature toggles against 12.4.x breaking changes

## Breaking changes reviewed

| Change | Impact |
|--------|--------|
| Alerting: pending period applies to NoData/Error | Net positive — reduces noise from transient blips |
| Default notification uses empty receiver | No impact — we explicitly set `ntfy-infra` |
| Removed feature toggles (4) | No impact — none configured |
| OAuth ID token signature validation | Low risk — verify OIDC login post-deploy |
| OpsGenie deprecated | No impact — using webhook |

## Test plan

- [ ] Container build completes at forge
- [ ] Update kustomization.yaml with new image tag
- [ ] `argocd app set grafana --revision upgrade/grafana-12.4.2 && argocd app sync grafana`
- [ ] Verify Grafana UI loads at grafana.ops.eblu.me
- [ ] Verify OIDC login via Authentik
- [ ] Verify dashboards and datasources load
- [ ] Check alerting rules are intact

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #322
2026-04-02 11:33:19 -07:00
08d57ef4d4 Review pulumi reference doc: fix Tailscale auth, add how-to links
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 10:55:57 -07:00
1e67975acb Add changelog fragment for compensating control review
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:04:38 -07:00
baee7ae54b Review single-user-cluster control and add evidence collection card
Stamp single-user-cluster last-reviewed to 2026-04-01 after verifying
Tailscale ACLs and kubeconfig distribution. Add aspirational how-to card
documenting what PCI DSS evidence collection would look like (CCW,
artifacts, Drata workflow). Link from existing review process card.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:01:57 -07:00
a18a424866 Pin NixOS service versions via nixpkgs-services overlay (#321)
## Summary

- Add `nixpkgs-services` flake input pinned to a specific nixpkgs commit, with an overlay that pulls `forgejo-runner`, `snowflake`, and `k3s` from it instead of the rolling `nixpkgs`
- Dagger `flake-update` pipeline now excludes `nixpkgs-services` via `--exclude`
- Fix stale nix-container-builder version in service-versions.yaml (was 12.6.4, actually running 12.7.2)
- Add k3s and minikube to service-versions.yaml tracking
- Document the pinning approach in review-services how-to and ringtail reference

## Motivation

During service review, discovered that flake updates had silently upgraded forgejo-runner from 12.6.4 → 12.7.2 without updating service-versions.yaml. This "sneak-in upgrade" bypasses the service review process. The overlay ensures these three services only change versions deliberately.

## Test plan

- [ ] Verify `nix flake update` from `nixos/ringtail/` does not change `nixpkgs-services` lock entry
- [ ] Verify `mise run provision-ringtail` builds successfully with the overlay
- [ ] Confirm running service versions unchanged after deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #321
2026-04-01 21:37:57 -07:00
cfbf4cadbd Review argocd-cli reference doc: stamp last-reviewed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:35:52 -07:00
Forgejo Actions
2b7b21dc9b Update docs release to v1.15.2
- Built changelog from towncrier fragments

[skip ci]
2026-03-30 17:48:40 -07:00
4059b3d27b Add compensating controls framework and date-based report dirs (#320)
## Summary

- Add `compensating-controls.yaml` tracking 9 named controls that justify suppressed security findings
- Update all Prowler mutelist descriptions with `CC: <id>` references to named controls
- Add `mise run review-compensating-controls` task — surfaces stalest control with all codebase references
- Add [[review-compensating-controls]] how-to doc
- Organize Prowler and Kingfisher reports into `YYYY-MM-DD` subdirectories

### Compensating controls

| ID | Mitigates |
|----|-----------|
| `single-user-cluster` | Image cache abuse, RBAC breadth, system pod privileges |
| `tailscale-network-isolation` | Profiling endpoints, weak TLS, debug ports |
| `local-registry` | AlwaysPullImages gap |
| `sso-gated-admin-tools` | ArgoCD wildcard RBAC |
| `operator-managed-pods` | Tailscale proxy pod security settings |
| `ephemeral-privileged-jobs` | Prowler hostPID exposure |
| `trusted-ci-only` | Forgejo runner DinD |
| `init-container-isolation` | Grafana root init container |
| `observability-stack-audit` | Missing apiserver audit logging |

## Test plan

- [ ] `mise run review-compensating-controls` shows table and references
- [ ] `kubectl kustomize argocd/manifests/prowler/` renders correctly
- [ ] Sync prowler and kingfisher, verify next scan writes to dated subdirectory
- [ ] Grep for `CC:` in mutelist files — every muted finding should have at least one

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #320
2026-03-30 17:44:11 -07:00
a76e471d54 Add Prowler mutelist and fix kube-state-metrics seccomp (#319)
## Summary

- Add mutelist files to suppress expected/accepted Prowler CIS findings from components we don't control
- Mutelist files stored in `mutelist/` directory, grouped by category, merged at runtime via initContainer
- Fix missing seccomp `RuntimeDefault` profile on kube-state-metrics deployment

### Mutelist categories

| File | Checks | Covers |
|------|--------|--------|
| `apiserver.yaml` | 12 | Minikube apiserver flags |
| `control-plane.yaml` | 3 | Scheduler, controller-manager, kubelet |
| `core-pod-security.yaml` | 7 | System pods, Tailscale operator, Grafana init, Prowler hostPID, forgejo-runner |
| `rbac.yaml` | 3 | Built-in K8s roles, ArgoCD, CNPG |

Muted findings appear as `status=MUTED` in reports (not hidden), preserving audit trail.

### Not muted (follow-up)

- Alloy, Immich pods missing seccomp — need separate investigation (Helm/operator-managed)

## Test plan

- [ ] `kubectl kustomize argocd/manifests/prowler/` renders cleanly
- [ ] Trigger manual scan: `kubectl --context=minikube-indri -n prowler create job prowler-mutelist-test --from=cronjob/prowler`
- [ ] Verify initContainer merges successfully (check pod logs)
- [ ] Verify muted findings show as `MUTED` in report
- [ ] Sync kube-state-metrics and verify pod starts with seccomp profile

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #319
2026-03-30 17:22:31 -07:00
1e391f96bb Upgrade forgejo-runner 12.7.0 → 12.7.3, add service card
Patch upgrade picks up idempotent FetchTask API, offline registration
fix, cloudflare/circl security dep update, and custom gRPC user-agent.
No config defaults changed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:31:06 -07:00
0ec0246847 Remove doc-reviewer agent 2026-03-30 16:12:48 -07:00
77eebe507e Review Ansible reference doc: add missing roles, clarify IaC positioning
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:10:24 -07:00
c069f889d2 Harden borgmatic photos backup: restrict dirs, add keepalives + checkpoints
Restrict backup to library/ and upload/ only (skip regenerable encoded-video/,
thumbs/, backups/). Add SSH ServerAliveInterval to prevent broken pipe on long
transfers, and checkpoint_interval so interrupted backups save progress.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 10:30:28 -07:00
b000efd6c3 Fix Kingfisher CronJob exit code handling
Kingfisher exits 200 (findings) or 205 (validated findings) on success.
Normalize these to 0 so the CronJob completes instead of restarting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 07:16:02 -07:00
457ab19416 Scope Kingfisher scan to eblume user only on ringtail
Mirror repos cause scan failures (likely ephemeral storage or timeout).
Scan only eblume/ repos until we investigate the root cause.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 07:11:52 -07:00
2c1f0abefc Deploy Kingfisher v165768b-0fe0eed-nix (tmp permissions fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:54:57 -07:00
0fe0eed35a Fix Kingfisher container: make /tmp world-writable
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 22s
Container runs as user 65534 (nobody) but /tmp was owned by root.
Set sticky bit + world-writable (1777) like a standard /tmp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:53:34 -07:00
14f366f993 Deploy Kingfisher v165768b-c494b62-nix (/tmp fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:51:17 -07:00
c494b62713 Fix Kingfisher container: add /tmp directory
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 24s
Kingfisher needs a writable temp directory for git clones and scanning.
Nix containers don't create /tmp by default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:49:59 -07:00
b01afb1c1d Deploy Kingfisher v165768b-aa9cc70-nix (bash fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:47:26 -07:00
aa9cc709ec Fix Kingfisher container: add bash and coreutils for CronJob shell
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 22s
Nix containers don't include a shell by default. The CronJob needs
/bin/bash for the inline script that generates timestamped filenames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:45:39 -07:00
f0c6845f0f Deploy custom Kingfisher container v165768b-f9206bf-nix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:42:24 -07:00
f9206bf10b Build custom Kingfisher container from sporked deploy branch (#318)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 12s
## Summary

- Add Dockerfile for Kingfisher built from source (sporked deploy branch)
- Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime
- Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher`
- Add kingfisher to service-versions.yaml (version tracks upstream main SHA)
- Document spork workflow in CLAUDE.md

## Test plan

- [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29`
- [ ] Verify image on registry: `mise run container-list`
- [ ] Update kustomization newTag
- [ ] Sync ArgoCD kingfisher app from branch
- [ ] Trigger manual CronJob and verify scan completes
- [ ] Verify reports on sifaka

Reviewed-on: #318
2026-03-30 06:34:49 -07:00
99a1a49175 Revert kingfisher skip in container build workflow
Kingfisher will build via Nix on ringtail instead of Dockerfile on
indri, so the skip is no longer needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:42:38 -07:00
a842b9c1e8 Skip kingfisher in CI container builds
Kingfisher's Rust + Boost/vectorscan build exhausts indri's memory
(aws-sdk-ec2 alone needs 2-3GB for rustc). Build locally on Gilbert
and push manually until we have a beefier build host.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:52:07 -07:00
924325ebd5 Fix DinD seccomp profile broken by RuntimeDefault rollout
The pod-level RuntimeDefault seccomp profile (07e9c81) overrides the
DinD sidecar's privileged flag in newer Kubernetes versions, blocking
Docker daemon syscalls. Set Unconfined explicitly on the DinD container
while keeping RuntimeDefault on the runner container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 17:09:57 -07:00
9115044219 spork-create: check for conflicting branch names before sporking
Bail if upstream already has branches named 'blumeops' or 'deploy',
which would conflict with the spork branch naming strategy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 09:36:53 -07:00
99df78664e Note upstream history rewrite as a spork sync failure mode
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 08:22:11 -07:00
e1429fc3e7 Document Spork Attack supply-chain risk
Upstream can push workflows (in .github/ or .forgejo/) that execute
on our runners via any trigger mechanism including cron. Runner label
mismatch is the current defense but is fragile. No complete fix exists
short of disabling Actions entirely.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 08:16:09 -07:00
4e09aed9d8 spork-create: fix ambiguous main branch checkout in mirror-sync template
git checkout <branch> is ambiguous when both origin and mirror remotes
have the same branch name. Use -B to explicitly create from origin.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:28:00 -07:00
21b465bf18 spork-create: enable Actions on fork after creation
Forks from mirror repos have has_actions disabled by default.
PATCH the repo settings to enable it so the mirror-sync workflow runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 00:26:25 -07:00
007b4fecdd spork-create: preflight all checks before any mutations
Check local path, mirror existence, and fork absence upfront.
Fail fast with clear error messages before touching forge or disk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:06:10 -07:00
ee6f516b2b spork-create: bail if local clone already exists
Trying to add remotes to an existing clone gets the origin wrong.
Better to error out and let the user handle it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 23:03:38 -07:00
6ecfaf02b6 Add spork strategy: tooling and documentation
Spork-create mise task sets up a floating-branch soft-fork of a
mirrored upstream project with daily mirror-sync via Forgejo Actions.
Includes explanation card, how-to guides for setup and branch
management, and the spork-create uv script.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 22:58:10 -07:00
bb60369956 Simplify Kingfisher CronJob to HTML-only output
Remove the second scan pass for JSON — one format is enough for now.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:50:54 -07:00
2808ffd450 Document Kingfisher secret scanner service
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:47:37 -07:00
35705faca2 Add Kingfisher secret scanner CronJob (#317)
## Summary

- Deploys MongoDB Kingfisher as a weekly CronJob on minikube-indri
- Scans all Forgejo repos (eblume + all orgs) for leaked secrets with live validation
- Produces timestamped HTML and JSON reports on sifaka NFS (`/volume1/reports/kingfisher/`)
- Forgejo API token sourced from 1Password via ExternalSecret
- Uses official `ghcr.io/mongodb/kingfisher:1.91.0` container image
- Runs Sunday 4am (after Prowler's 3am k8s scan)

## Resources

- CronJob, PV/PVC (sifaka NFS), ExternalSecret
- ArgoCD Application with manual sync + CreateNamespace

## Test plan

- [x] Sync ArgoCD `apps` app to pick up new kingfisher Application
- [x] Set `--revision feature/kingfisher-cronjob` on kingfisher app
- [x] Verify ExternalSecret creates the `kingfisher-forgejo-token` Secret
- [x] Trigger manual job: `kubectl create job --from=cronjob/kingfisher kingfisher-manual -n kingfisher --context=minikube-indri`
- [ ] Verify reports appear on sifaka at `/volume1/reports/kingfisher/`
- [ ] After merge: set `--revision main` and re-sync

Reviewed-on: #317
2026-03-28 21:39:55 -07:00
6b1717bf28 Add Kingfisher secret scanner to prek hooks
Running alongside TruffleHog to compare coverage. Kingfisher uses
staged-only mode with validation disabled for fast, offline-safe
pre-commit checks. Validation will be enabled in the planned cron job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:07:07 -07:00
Forgejo Actions
7fb6eff388 Update docs release to v1.15.1
- Built changelog from towncrier fragments

[skip ci]
2026-03-28 09:15:21 -07:00
2bd1611ac1 Document sifaka NFS/Tailscale TUN troubleshooting
Sifaka's Tailscale can revert to userspace networking after package
updates, causing NFS mounts to fail because the NFS daemon sees
127.0.0.1 instead of the client's Tailscale IP. Added troubleshooting
how-to doc and updated sifaka reference card with frigate export and
TUN requirement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 09:12:00 -07:00
8cbd412380 Update services-check: forgejo uses launchctl not brew
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 08:21:51 -07:00
3017f759a7 Migrate Forgejo from Homebrew to source build (#316)
## Summary

- Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent
- Matches the established pattern used by zot, caddy, and alloy
- Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more)

## Changes

- **Ansible role**: Replace brew install/services with binary stat check + LaunchAgent
- **Paths**: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo`
- **Run user**: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`)
- **Docs**: Updated Forgejo reference card, restart-indri guide
- **Service review**: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current

## One-time migration steps (manual, on indri)

1. Clone from Codeberg, add forge mirror remote
2. Check out v14.0.3, build with `make build && make forgejo`
3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership
4. Run `provision-indri --tags forgejo`
5. Verify, then `brew uninstall forgejo`

## Data safety

- `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config)
- Brew version stays installed as rollback until verification passes
- No schema changes between 14.0.2 → 14.0.3

Reviewed-on: #316
2026-03-28 08:19:23 -07:00
3cb4303a54 Add changelog for Immich resource/probe fix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:37:15 -07:00
b632cd9ffb Fix Immich resource limits and probe timeouts
Resources were under wrong Helm value keys (server.resources,
machine-learning.resources) and never applied to pods. Move to correct
bjw-s chart paths (*.controllers.main.containers.main.resources).

Increase liveness/readiness probe timeouts from 1s to 5s to prevent
kubelet from killing healthy-but-busy pods during ML inference load.
Remove CPU limits (keep requests only) to avoid throttling.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:36:32 -07:00
c78b86c72c Add offsite backup for immich photo library to BorgBase (#315)
## Summary

- Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM
- Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup
- Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label
- Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos
- Docs updated: `backups.md`, `borgmatic.md`

## Prerequisites (manual)

- [x] Create `immich-photos` repo on BorgBase with same SSH key
- [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit)
- [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run)

## Test plan

- [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics`
- [ ] Deploy borgmatic role and verify both configs deployed
- [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours)
- [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom`
- [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #315
2026-03-27 19:43:05 -07:00
ca0c9354ee Add borgmatic backups for authentik and immich databases (#314)
## Summary

- Add `authentik` database (blumeops-pg cluster) to borgmatic pg_dump backups
- Add `immich` database (immich-pg cluster) to borgmatic pg_dump backups
- For immich-pg: new borgmatic managed role with `pg_read_all_data`, ExternalSecret, Tailscale LoadBalancer service, and Caddy L4 TCP proxy on port 5433
- Update backup docs to reflect all four CNPG databases + mealie SQLite

## Deploy plan

Deploy order matters — k8s resources must exist before ansible can route to them:

1. **ArgoCD (databases app):** sync to pick up immich-pg borgmatic role, ExternalSecret, and Tailscale service
   ```
   argocd app set blumeops-pg --revision feature/borgmatic-all-pg-backups
   argocd app sync blumeops-pg
   ```
2. **Wait** for `immich-pg-tailscale` service to get a Tailscale IP and `immich-pg.tail8d86e.ts.net` to resolve
3. **Ansible (caddy):** deploy Caddy L4 route for port 5433
   ```
   mise run provision-indri -- --tags caddy
   ```
4. **Ansible (borgmatic):** deploy updated config and .pgpass
   ```
   mise run provision-indri -- --tags borgmatic
   ```
5. **Verify:** trigger a manual borgmatic run and check all four pg_dump streams succeed
   ```
   borgmatic --verbosity 1 2>&1 | grep -E '(Dumping|ERROR)'
   ```

## Test plan

- [x] `kubectl kustomize` builds cleanly
- [x] `ansible --check --diff` for borgmatic and caddy show expected changes
- [ ] ArgoCD sync succeeds for databases app
- [ ] `immich-pg.tail8d86e.ts.net` resolves
- [ ] `pg.ops.eblu.me:5433` accepts connections
- [ ] `borgmatic --verbosity 1` dumps all four databases without errors

Reviewed-on: #314
2026-03-27 16:59:58 -07:00
33463764d1 Add QArt Tuner: QR code art generator with interactive web UI
Single-file Go tool implementing the QArt technique (Russ Cox, 2012)
using only the public rsc.io/qr API. Generates QR codes whose data
modules form a recognizable image by exploiting error correction
freedom via GF(2) Gaussian elimination.

Includes a web UI with live-updating sliders for version, mask,
rotation, dx/dy offset, and scale. Keyboard shortcuts for rapid
iteration. Also works as a CLI for batch generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 15:33:36 -07:00
66a47738dd Add ringtail post-deploy maintenance: kernel check, generation pruning, GC
Update manage-lockfile doc with post-deploy steps (kernel update detection,
reboot guidance, generation pruning). Add prune-ringtail-generations mise
task that keeps the 5 most recent generations plus the most recent one
matching the booted kernel for safe rollback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:55:45 -07:00
a5b33591d3 Update ringtail flake inputs (nixpkgs, home-manager)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:37:43 -07:00
831b82950a Upgrade nvidia-device-plugin v0.18.2 → v0.19.0 and add reference card
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:19:24 -07:00
687e972713 Review CV doc and close build-dep review gap
Fix stale CV service doc (URL, forge domain, container tag) and add
guidance for reviewing build-time dependencies in private forge repos
during service reviews.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:12:38 -07:00
2c1652604b Reduce PodNotReady alert lookback from 5m to 60s
The 5-minute lookback window kept stale data from terminated pods
visible during rollouts, causing the alert to sit in Pending for
~5 minutes after every routine deployment. 60s still covers two
scrape cycles (30s interval) while clearing stale data much faster.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:48:37 -07:00
a37012385f Tighten ArgoCDAppOutOfSync alert timing to clear faster after sync
Reduced `for` from 30m to 5m and lookback window from 5m to 1m. The old
values caused alerts to linger long after apps returned to Synced state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:44:09 -07:00
fc8d2cdb12 Add preserve/* branch protection and document Pyroscope blocker
branch-cleanup: Add PROTECTED_PREFIXES with preserve/* exclusion so
preserved work-in-progress branches are never deleted.

observability.md: Document Pyroscope profiling work on branch
preserve/pyroscope-profiling/pr-313, blocked on ringtail kernel
sysctl settings (kptr_restrict=0, perf_event_paranoid≤1). Also
document Faro/RUM as future potential with privacy considerations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:32:25 -07:00
f97b5c9d5d Deploy Homepage v1.11.0-e375859
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:25:07 -07:00
e375859221 Upgrade Homepage container to v1.11.0
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dockerfile (homepage) (push) Successful in 5m47s
Minor release with new widgets (Tracearr, SparklyFitness), Seerr rename,
and dependency bumps. No breaking changes for our config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:17:36 -07:00
a5e51bd600 Review tailscale-setup tutorial: fix inaccuracies
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 07:44:36 -07:00
4ae55f9bf4 Review kubernetes-bootstrap tutorial: fix inaccuracies
- Fix k3s table entry (BlumeOps uses k3s on ringtail)
- Fix broken tailscale serve command (minikube ip returns IP, not port)
- Rewrite NFS section to match actual static PV/PVC binding pattern
- Fix "BluemeOps" typo
- Add last-reviewed frontmatter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 16:16:33 -07:00
796baaa41a Upgrade External Secrets Operator v2.2.0 + migrate Helm to kustomize (#312)
## Summary

- Upgrade External Secrets Operator from v1.3.2 (helm-chart-2.0.0) to v2.2.0
- Migrate from Helm chart deployment to static kustomize manifests, matching the repo's kustomize-first pattern
- Merge separate `-config` ArgoCD apps into the main operator apps (6 → 4 apps)
- Clean up Helm-specific labels (`helm.sh/chart`, `managed-by: Helm`)
- Update README example from v1beta1 to v1 API

## Breaking changes assessment

Low risk — v2.0.0 removed Alibaba and Device42 providers (we use neither). No templating changes affect us. All ExternalSecrets already use v1 API.

## Deployment steps

1. Sync CRDs first on both clusters (new CRD version)
2. Sync operator apps (now kustomize-based)
3. Verify ClusterSecretStore and all ExternalSecrets are healthy
4. Delete orphaned config apps: `argocd app delete external-secrets-config` and `-config-ringtail`
5. `mise run services-check`

Reviewed-on: #312
2026-03-25 15:56:41 -07:00
b97e37543f Deploy Tor Snowflake proxy on ringtail (#311)
## Summary

- Add Snowflake proxy as a native systemd service on ringtail (NixOS)
- Uses `pkgs.snowflake` from nixpkgs (v2.11.0)
- Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit
- Prometheus metrics enabled on localhost:9999

## What is Snowflake?

A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. **This is NOT a Tor exit node** — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP.

## Changes

- `nixos/ringtail/configuration.nix` — new systemd service definition
- `docs/reference/services/snowflake-proxy.md` — service reference card
- `docs/reference/infrastructure/ringtail.md` — updated systemd services section
- `service-versions.yaml` — added entry (type: nixos)

## Deploy plan

After review, deploy via `mise run provision-ringtail`. Service starts automatically.

## Test plan

- [ ] `mise run provision-ringtail` succeeds
- [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active
- [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections
- [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics

Reviewed-on: #311
2026-03-24 20:51:40 -07:00
Forgejo Actions
243a862901 Update docs release to v1.15.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-24 19:51:17 -07:00
a1c2e0833d Include link to upstream prowler issue 2026-03-24 19:48:43 -07:00
75fd5b029d Use prowler image for registry enumeration init container
The kubectl image lacks curl/python3. Use the prowler image
(which has Python) with a pure-Python urllib script instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:36:02 -07:00
d365e79068 Add kubectl image tag to prowler kustomization
The image scan init container uses the kubectl image for curl/python.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:34:24 -07:00
d90be355dd Work around Prowler --registry bug with init container
Prowler's --registry flag doesn't work (registry args not passed
to ImageProvider constructor, prowler-cloud/prowler PR #10128
regression). Use an init container to enumerate images from the
zot catalog API and generate an image list file instead.

See: https://github.com/eblume/prowler/tree/fix/image-provider-registry-args

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 17:29:25 -07:00
7d1ae1a57e Fix prowler image and IaC scan arguments
Image scan: add https:// scheme to registry URL.
IaC scan: use --scan-repository-url (Prowler clones the repo
itself), removing the need for an init container. The flag
is --scan-path for local dirs, --scan-repository-url for git.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:58:33 -07:00
7f2d53bc77 Fix prowler image scan registry URL (add https:// scheme)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:57:05 -07:00
38281a35fd Update prowler container tag to 6960243 (with Trivy)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:54:36 -07:00
fe201a495c Add Prowler IaC scanning of blumeops repo (Saturday 2am)
Clone repo in init container, scan Dockerfiles and K8s manifests
with Prowler's IaC provider (Trivy). Reports written to
sifaka:/volume1/reports/prowler-iac/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:49:38 -07:00
696024306c Add Prowler image vulnerability scanning for blumeops containers
All checks were successful
Build Container / detect (push) Successful in 39s
Build Container / build-dockerfile (prowler) (push) Successful in 10m15s
Add Trivy to the Prowler container for image and IaC scanning.
New CronJob (Saturday 3am) scans all blumeops/* images in the
registry for CVEs, embedded secrets, and Dockerfile misconfigs.
Reports written to sifaka:/volume1/reports/prowler-images/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:43:08 -07:00
07e9c810ca Add RuntimeDefault seccomp profiles to all managed workloads
Addresses 32 CIS Kubernetes Benchmark failures from Prowler scan
(core_seccomp_profile_docker_default). Applied pod-level seccomp
RuntimeDefault to 18 deployments/statefulsets and 2 cronjobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:19:40 -07:00
87f56f78b3 Update container tags to d021b35 (post-merge rebuild)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:09:07 -07:00
d021b3534f Deploy Prowler CIS scanner (#310)
All checks were successful
Build Container / detect (push) Successful in 4s
Build Container / build-dockerfile (prowler) (push) Successful in 10s
## Summary
- Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning
- Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream)
- Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/`
- Read-only ClusterRole for pod, RBAC, and control plane inspection
- Host path mounts + hostPID for kubelet file permission checks

## Follow-ups
- Mirror prowler-cloud/prowler on forge for supply chain control
- Build and push container image, update kustomization.yaml newTag
- Consider adding k3s-ringtail scanning (core + RBAC checks only)

## Test plan
- [ ] Build container: `mise run container-release prowler v5.22.0`
- [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag
- [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler`
- [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri`
- [ ] Verify reports appear on sifaka NFS share
- [ ] `mise run services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #310
2026-03-24 16:08:09 -07:00
3b7abbd689 Update container tags to fd0bebb (post-merge rebuild)
C0 follow-up to #309: update kustomization newTag for all containers
rebuilt by the merge (authentik, authentik-redis, ntfy, alloy).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 13:39:26 -07:00
fd0bebb0fc Localize authentik-redis container (#309)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dockerfile (alloy) (push) Successful in 12s
Build Container / build-dockerfile (ntfy) (push) Successful in 11s
Build Container / build-nix (alloy) (push) Successful in 20s
Build Container / build-nix (authentik) (push) Successful in 6m10s
Build Container / build-nix (authentik-redis) (push) Successful in 20s
Build Container / build-nix (ntfy) (push) Successful in 6s
## Summary

- Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs
- Introduce **attached service pattern**: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates
- Document the pattern in [[review-services]] so future attached services slot in cleanly
- Backfill `parent: grafana` on existing `grafana-sidecar` entry

## Version drift protection

1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build`
2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update
3. Both must agree before commit succeeds

## Test plan

- [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`)
- [ ] Update kustomization `newTag` to branch-built image tag
- [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`)
- [ ] Verify Authentik login, session persistence, and task queue still work
- [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #309
2026-03-24 13:27:36 -07:00
fc45989a6c Decommission JobSync service (#308)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary

- Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation
- Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed

## Motivation

Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation.

## Remaining manual steps after merge

- Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy`
- Sync Homepage: `argocd app sync homepage`
- Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #308
2026-03-24 08:44:23 -07:00
0d422f5234 Update tooling dependencies (March 2026) (#307)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m51s
## Summary

Monthly tooling dependency update per [[update-tooling-dependencies]].

- **Prek hooks:** trufflehog v3.93.4→v3.94.0, ruff v0.15.2→v0.15.7, shfmt v3.12.0-2→v3.13.0-1, ansible-lint floor→26.3.0, ansible-core floor→2.18
- **Fly.io proxy:** nginx 1.28.2→1.29.6, Grafana Alloy v1.13.1→v1.14.1
- **Forgejo workflows:** actions/checkout v4.3.1→v6.0.2 (SHA-pinned across all 5 workflows)
- **Mise tasks:** tightened Python lower bounds — rich≥14.0.0, typer≥0.24.0, httpx≥0.28.1, pyyaml≥6.0.2

## Test plan

- [x] `prek run --all-files` passes
- [ ] Verify Fly.io deploy succeeds after merge (nginx minor bump + Alloy bump)
- [ ] Spot-check a workflow run with the new actions/checkout v6

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #307
2026-03-24 08:11:46 -07:00
0698013355 Review ArgoCD config tutorial: fix sync policy, typo, add cross-references
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 07:55:00 -07:00
bec554110a Upgrade Frigate 0.17.0-rc2 → 0.17.1, add motion retention tier
Bump from RC to latest stable (security fixes for config endpoint and
cross-camera auth). Add new 0.17 motion retention tier at 365 days,
reduce continuous from 180 to 30 days.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 07:30:18 -07:00
b96b4dad47 Move Alerts dashboard into Infrastructure Alerts folder
Uses the grafana_folder annotation to place the dashboard in the
existing folder created by alert rule provisioning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:20:14 -07:00
9024d41230 Add Grafana alerts dashboard for mobile-friendly alert overview
Two panels: currently firing alerts (firing/pending/noData/error) and
recent state changes. Refreshes every 30s. Uses Grafana's built-in
alertlist panel type — no datasource needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:16:54 -07:00
9efe5c97fe Fix authentik worker OOMKill: limit concurrency to 2
Dramatiq defaults to one worker process per CPU core. On ringtail (16 cores)
this spawned 16 processes, each loading the full Django app, exceeding the
1Gi memory limit and causing a crash loop (228 restarts over 7 days).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:05:16 -07:00
bd0ff30d3f Unify container build workflows (#306)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary
- Merges `build-container.yaml` and `build-container-nix.yaml` into a single workflow
- Detect job classifies each changed container by presence of `Dockerfile` and/or `default.nix`
- Dockerfile containers build on `k8s` (indri) via Dagger; Nix containers build on `nix-container-builder` (ringtail) via nix-build + skopeo
- Containers with both build files (alloy, nettest, ntfy) get built on both runners

## Test plan
- [ ] Push a change to a Dockerfile-only container (e.g. grafana) — verify it builds on k8s only
- [ ] Push a change to a nix-only container (e.g. jobsync) — verify it builds on nix-container-builder only
- [ ] Push a change to a dual container (e.g. ntfy) — verify it builds on both runners
- [ ] Test workflow_dispatch with a specific container name

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #306
2026-03-23 20:55:50 -07:00
4cc26ed5eb Update ntfy tag to main build v2.19.2-d1dac0c-nix
C0 fix-forward: switch from branch-built image to main-built image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 10:36:34 -07:00
d1dac0c241 Upgrade ntfy v2.17.0 → v2.19.2 (#305)
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 3s
Build Container (Nix) / build (ntfy) (push) Successful in 4s
Build Container / build (ntfy) (push) Successful in 11s
## Summary
- Upgrade ntfy from v2.17.0 to v2.19.2
- Update Dockerfile and Nix build definitions with new version, commit SHA, and hashes
- Add `subPackages = [ "." ]` to Nix build to handle new `tools/loadtest` module in upstream

## Upstream changes (no breaking changes)
- **v2.18.0:** Experimental PostgreSQL backend support
- **v2.19.0:** PostgreSQL read replica support, notification sound throttling
- **v2.19.1-2:** PostgreSQL bug fixes, web push race condition fix

## Test plan
- [ ] Container builds complete on Forgejo Actions (both Dockerfile and Nix)
- [ ] Update kustomization.yaml `newTag` to the built nix image tag
- [ ] `argocd app set ntfy --revision upgrade/ntfy-v2.19.2 && argocd app sync ntfy`
- [ ] Verify ntfy health: `curl https://ntfy.ops.eblu.me/v1/health`
- [ ] Send a test notification

Reviewed-on: #305
2026-03-23 10:32:06 -07:00
06e721841c Review 12 reference docs: fix stale image refs, expand stubs, add cross-refs
Replace hardcoded image tags in Quick Reference tables with pointers to
kustomization manifests (tags drift with every container release). Fix
Prometheus CNPG scrape target, remove misleading .ts.net URLs, expand
external-secrets stub, add backup/disaster-recovery cross-references.
Limit doc-reviewer agent to one doc per cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:51:57 -07:00
3750428b58 Fix ArgoCD apps app permanent OutOfSync
Remove `group: ""` from ignoreDifferences in tailscale-operator and
tailscale-operator-ringtail — ArgoCD normalizes away the empty string
field, so the live state never matches git.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 20:42:37 -07:00
044ad7dad7 Revert fly/start.sh to polling loop — tailscale wait needs v1.96.2+
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m33s
The Fly container pulls from tailscale/tailscale:stable which is still
v1.94.2. The `tailscale wait` command doesn't exist until v1.96.2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 19:44:47 -07:00
e9b8e3d80b Revert Tailscale operator to v1.94.2 — images not yet published
v1.96.3 exists as a GitHub release but Docker Hub images for both
tailscale/tailscale and tailscale/k8s-operator haven't been published
yet (v1.94.2 is still latest). Revert the image tags; the fly/start.sh
`tailscale wait` improvement and review date stamps are retained.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 19:41:40 -07:00
2e46f99820 Upgrade Tailscale operator v1.94.2 → v1.96.3 (#304)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 7m0s
## Summary

- Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization)
- Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2)
- Stamp kube-state-metrics review date (already current at v2.18.0)

## Notable upstream changes (v1.94.2 → v1.96.3)

- Go upgraded from 1.25 to 1.26
- `tailscale wait` command — blocks until daemon is running + interface has IP
- AuthKey policy now applies only when users are not logged in (behavioral change)
- Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling)
- UPnP stability fixes

## Deploy plan

1. Merge PR
2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator`
3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...`
4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods`
5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services)
6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change)

## Test plan

- [ ] ArgoCD diff looks clean for both apps before sync
- [ ] Proxy pods on indri come up healthy with v1.96.3 images
- [ ] Proxy pods on ringtail come up healthy with v1.96.3 images
- [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus)
- [ ] Fly proxy rebuild deploys successfully with `tailscale wait`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #304
2026-03-22 19:31:22 -07:00
Forgejo Actions
262299c82a Update docs release to v1.14.3
- Built changelog from towncrier fragments

[skip ci]
2026-03-22 18:20:41 -07:00
2cb0ce4289 Review and correct Tailscale reference doc
Fix wrong ACL path, add missing device tags (ringtail, per-service tags,
ci-gateway, flyio-proxy), correct access matrix (PyPI→DevPI, homelab
grants), add homelab→homelab SSH rule, document auto approvers section,
and add last-reviewed frontmatter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 18:18:45 -07:00
f7221f3990 Update ringtail flake inputs
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 17:57:54 -07:00
6d65e6928c C2: Deploy infrastructure alerting pipeline (#303)
## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303
2026-03-22 14:52:56 -07:00
f1620abb17 Improve Frigate health checks to catch NFS and camera failures
Replace single aggregate camera_fps check with per-camera FPS validation
and NFS storage accessibility check. Motivated by an outage where Frigate
API responded OK but NFS mount was inaccessible, causing "no frames" in UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 09:55:53 -07:00
dcab489b60 agent memory ignore 2026-03-21 19:03:21 -07:00
810340a328 Update service-versions.yaml for loki 2026-03-20 16:10:19 -07:00
531a49abeb C0 update deployment for loki to 3.6.7 2026-03-20 16:06:29 -07:00
f9426b734c Update loki to 3.6.7 (#302)
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (loki) (push) Successful in 1s
Build Container / build (loki) (push) Successful in 6s
Reviewed-on: #302
2026-03-20 16:02:28 -07:00
0f0ee2a319 Update docs and kiwix kustomization tags to 613f05d builds
Also catches kiwix's transmission sidecar up from v4.0.6-r4 to
v4.1.1-r1, matching the torrent service (upgraded in PR #282 but
the kiwix sidecar was missed). No breaking changes — old RPC
protocol is supported through 4.x.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 06:40:49 -07:00
3d2a97aaf9 Update kustomization tags to OCI-labeled builds (613f05d)
Point all services at the 613f05d images which carry the new
consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4
version, no matching build) and docs/quartz (no 613f05d build).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 06:34:12 -07:00
613f05dfde Add consistent OCI labels to all container Dockerfiles
All checks were successful
Build Container (Nix) / build (miniflux) (push) Successful in 2s
Build Container (Nix) / build (navidrome) (push) Successful in 2s
Build Container / build (devpi) (push) Successful in 41s
Build Container (Nix) / build (nettest) (push) Successful in 15s
Build Container / build (grafana-sidecar) (push) Successful in 1m27s
Build Container / build (grafana) (push) Successful in 3m23s
Build Container (Nix) / build (ntfy) (push) Successful in 3m19s
Build Container (Nix) / build (prometheus) (push) Successful in 1s
Build Container (Nix) / build (quartz) (push) Successful in 1s
Build Container (Nix) / build (runner-job-image) (push) Successful in 1s
Build Container (Nix) / build (teslamate) (push) Successful in 2s
Build Container (Nix) / build (transmission) (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 1s
Build Container (Nix) / build (unpoller) (push) Successful in 1s
Build Container / build (kiwix-serve) (push) Successful in 1m17s
Build Container / build (kubectl) (push) Successful in 41s
Build Container / build (homepage) (push) Successful in 8m21s
Build Container / build (mealie) (push) Successful in 1m1s
Build Container / build (loki) (push) Successful in 8m21s
Build Container / build (miniflux) (push) Successful in 2m24s
Build Container / build (nettest) (push) Successful in 14s
Build Container / build (ntfy) (push) Successful in 8m33s
Build Container / build (prometheus) (push) Successful in 37s
Build Container / build (quartz) (push) Successful in 19s
Build Container / build (navidrome) (push) Successful in 10m36s
Build Container / build (runner-job-image) (push) Successful in 3m18s
Build Container / build (transmission) (push) Successful in 20s
Build Container / build (transmission-exporter) (push) Successful in 21s
Build Container / build (unpoller) (push) Successful in 11s
Build Container / build (teslamate) (push) Successful in 4m42s
Every container now carries title, description, version, source, and
vendor labels per the OCI image spec. Version is derived from the
existing CONTAINER_APP_VERSION ARG at build time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 20:42:00 -07:00
c92b949a20 Fix UID sed to target root-level dashboard uid only
The top-level "uid" in Grafana dashboard JSON is at 2-space indent
near the end of the file, not the first occurrence. Match on ^  "uid"
to avoid clobbering nested datasource uid references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 12:56:50 -07:00
334fbbb9e3 Fix TeslaMate/UnPoller dashboard UID sed clobbering datasource refs
The previous sed replaced ALL "uid" fields in dashboard JSON files,
including datasource references inside panels, causing dashboards to
go dark. Scope the replacement to only the first occurrence (the
top-level dashboard UID) using GNU sed 0,/pattern/ addressing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 12:53:00 -07:00
6f88baeb91 Fix Grafana starred dashboards lost on pod restart
Add init container to pre-populate ConfigMap dashboards before Grafana
starts, eliminating the race between the sidecar and the provisioner
that caused dashboard DB records to be deleted and re-created with new
IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards
fetched from upstream.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 12:40:44 -07:00
0dffdb9974 Add Claude Code subagents for infrastructure workflows
Four project-scoped subagents that formalize existing mise task
workflows as constrained, specialized AI agents:
- infra-health: background health monitor (wraps services-check)
- doc-reviewer: persistent-memory documentation reviewer
- change-classifier: C0/C1/C2 triage before work begins
- mikado-navigator: C2 chain state advisor (wraps docs-mikado)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:57:36 -07:00
ef8c2118a1 Standardize USAGE pragmas and typer parsing across mise tasks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:42:01 -07:00
86220b7b88 Update Prometheus deployment to v3.10.0-0d27797
C0 fix-forward: update kustomization newTag and mark service reviewed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 08:46:07 -07:00
0d2779762a Upgrade Prometheus to v3.10.0 (#301)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (prometheus) (push) Successful in 2s
Build Container / build (prometheus) (push) Successful in 44m11s
## Summary
- Bump Prometheus from v3.9.1 to v3.10.0 in custom container Dockerfile
- v3.10.0 adds distroless Docker image variants, new PromQL `fill` operators, and performance improvements
- Dagger build tested locally — builds cleanly

## Remaining after merge
- Update `kustomization.yaml` newTag with the auto-built image tag
- Update `service-versions.yaml` (last-reviewed + current-version)
- ArgoCD sync

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #301
2026-03-18 07:47:46 -07:00
528d3da327 Review power.md: add ringtail, mark reviewed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 07:37:31 -07:00
e0dbcbd997 Update retention changelog to reflect final PVC decision
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:46:55 -07:00
21ddc74cdc Revert PVC size changes, add hostpath comment
StatefulSet volumeClaimTemplates are immutable and minikube's hostpath
provisioner doesn't enforce PVC size limits anyway. Add comments noting
the data grows freely on the 1.8TB backing disk.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:46:17 -07:00
ef199b70f0 Increase Prometheus and Loki data retention
Prometheus: 15d → 10y (3650d), PVC 20Gi → 200Gi
Loki: 31d (744h) → 365d (8760h), PVC 20Gi → 50Gi

Indri has 1.6 TB free on the minikube backing disk — the previous
15-day Prometheus retention was losing valuable long-term metrics data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:44:00 -07:00
50d3b3b21e Rename Borgmatic to Borg Backups on Homepage
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:34:13 -07:00
e8bdecdb11 Rename Borgmatic dashboard to Borg Backups, add duration graph
Rename dashboard title since borgmatic is just the execution layer.
Add Backup Duration Over Time panel next to New Data Per Backup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:33:27 -07:00
8425f56dc3 Add Fly.io dashboard to Homepage admin bookmarks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:29:44 -07:00
64afd40a29 Fix Grafana widget fields (lowercase) and hide Miniflux read count
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:28:41 -07:00
98584d0d67 Trim Homepage widget metrics for cleaner layout
- Forgejo: show only notifications and pull requests
- Jellyfin: show only movies/series/episodes, hide now playing
- Grafana: hide data sources, show dashboards and alerts only

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:26:15 -07:00
443e090ec6 Enable equal height tiles in Homepage groups
Add useEqualHeights: true so service tiles within each row expand to
match the tallest tile, fixing uneven layout from widget metrics.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:23:53 -07:00
b0ce9be30b Fix Homepage layout: use row style with columns for full-width groups
style: row makes each group span the full page width (one per row),
while columns: 4 tiles services horizontally within each group.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:21:44 -07:00
816fd552f0 Set Homepage to single-column group layout
Add maxGroupColumns: 1 so each category gets its own full-width row,
with service tiles arranged side-by-side within each group.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:19:40 -07:00
96d0f668fd Reorganize Homepage groups: add Home, move Grafana to Infrastructure
Move NVR, Jellyfin, and DJ to new Home group. Move Grafana from Content
to Infrastructure. Switch all layout groups from column to row style.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:15:52 -07:00
3e9873d669 Fix borgmatic backup: use correct kubectl context on indri
The Mealie SQLite dump hook used `minikube-indri` (the context name on
gilbert), but on indri itself the context is just `minikube`. This caused
the before_backup hook to fail, aborting all backups since the hook was added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:07:44 -07:00
cfe3391f1a Bump Frigate retention and add recording health check
Increase retention: continuous 3→180d, detections 14→30d, alerts 30→730d.
Plenty of NFS headroom (~9.4 TiB free, ~6.6 GB/day for one camera).

Add frigate-recording check to services-check that verifies camera_fps > 0,
which would have caught the 6-day outage from the mqtt config removal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 18:24:11 -07:00
6617e44e5b Fix Frigate crash: re-add required mqtt config section
Frigate's config schema requires an `mqtt` field even when MQTT isn't
used. Commit 40f1568 removed it along with Mosquitto, causing Frigate
to fail validation on startup. Add `mqtt.enabled: false` to satisfy
the schema without needing a broker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 18:10:23 -07:00
4f99b7edaa Update alloy kustomizations to local container tags
Point alloy-k8s at v1.14.0-61f02a0 (Dockerfile) and both ringtail
deployments at v1.14.0-61f02a0-nix (Nix build).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 16:55:55 -07:00
61f02a0335 Localize Alloy container image (#300)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (alloy) (push) Successful in 14s
Build Container / build (alloy) (push) Successful in 38m34s
## Summary

- Add `containers/alloy/` with dual Dockerfile + Nix build files for Grafana Alloy v1.14.0
- Both builds fetch source from forge mirror (`forge.ops.eblu.me/mirrors/alloy.git`), build the web UI (Node), then compile the Go binary with `netgo embedalloyui` tags
- Update all three alloy deployments (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail) to use `registry.ops.eblu.me/blumeops/alloy`
- `promtail_journal_enabled` tag omitted — requires systemd headers and none of our configs use `loki.source.journal`

## Build verification

- **Dockerfile:** Tested locally via `docker build`, binary reports `v1.14.0` with correct tags
- **Nix:** Tested on ringtail via `nix-build`, all three hashes (fetchgit, npmDeps, goModules) resolved and build succeeds

## Post-merge steps

1. Wait for CI to build the container from main (both Dockerfile and Nix workflows)
2. `mise run container-list alloy` to find the `[main]` tagged image
3. C0 follow-up to update `newTag` in all three kustomizations from `v1.14.0-placeholder` to the real tag
4. Sync ArgoCD apps and verify pods come up healthy

Reviewed-on: #300
2026-03-17 16:42:53 -07:00
Forgejo Actions
cdba9dca96 Update docs release to v1.14.2
- Built changelog from towncrier fragments

[skip ci]
2026-03-17 13:24:13 -07:00
1f000c8e39 Add last-updated subsort to docs-review, review gilbert card
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 13:22:01 -07:00
995478b91f Review jellyfin and automounter services
Both services current: jellyfin 10.11.6 (latest upstream),
automounter 1.11.0 (Mac App Store). Add missing frigate share
to automounter docs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 13:06:23 -07:00
e5ce510fdc Fix plan-a-meal random recipe API queries
Mealie's orderBy=random requires a paginationSeed parameter, otherwise
the API returns 422. Added the seed to all random query examples.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:10:48 -07:00
6d7597670e Add plan-a-meal how-to for Mealie cooking timelines
Agent-facing guide for generating unified cooking timelines from
Mealie meal plans. Covers querying the API, picking balanced meals
(protein/carb/vegetable), and interleaving recipe steps into a
relative timeline so everything finishes together.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 11:07:16 -07:00
3602ed7781 Add OpenAI integration to Mealie
Enable recipe parsing from images/photos, ingredient extraction, and
URL scraping via OpenAI API (gpt-4o). Rename ExternalSecret from
mealie-oidc to mealie-secrets to hold both OIDC and OpenAI credentials.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 07:12:51 -07:00
c2a1e168bd Update Mealie container tag to main build
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:59:48 -07:00
11330ebea0 Deploy Mealie recipe manager (#299)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (mealie) (push) Successful in 2s
Build Container / build (mealie) (push) Successful in 8s
## Summary

- Deploy Mealie (self-hosted recipe manager) on minikube-indri via ArgoCD
- Build container from source via forge mirror (`mirrors/mealie`) — multi-stage Dockerfile with Node.js frontend + Python/uv backend
- Add Caddy proxy entry for `meals.ops.eblu.me`
- Part of a larger meal planning pipeline: Mealie stores categorized recipes, a planner script selects balanced meals, and Ollama generates unified cooking timelines

## Status

- [x] Mirror mealie repo on forge
- [x] Dockerfile (from-source build)
- [x] ArgoCD app + k8s manifests
- [x] Caddy proxy entry
- [x] Service docs, routing table, app registry
- [ ] Local Dagger build test
- [ ] Container build + push to registry
- [ ] Update kustomization.yaml with real image tag
- [ ] Deploy and verify
- [ ] Provision Caddy

## Test plan

- Build container locally via `dagger call build --src=. --container-name=mealie`
- Trigger CI build via `mise run container-build-and-release mealie`
- Deploy from branch: `argocd app set mealie --revision deploy-mealie && argocd app sync mealie`
- Verify Mealie UI at `https://meals.ops.eblu.me`
- Verify API docs at `https://meals.ops.eblu.me/docs`

Reviewed-on: #299
2026-03-16 21:59:10 -07:00
b54d87e071 Fix shell syntax error in unpoller dashboard initcontainer
Comments can't appear inside a for-in list in sh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:59:03 -07:00
b0846ab5fa Update unpoller container tag to main build
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:54:03 -07:00
4dc3e5cae2 Add UnPoller for UniFi network metrics (#298)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (unpoller) (push) Successful in 2s
Build Container / build (unpoller) (push) Successful in 7s
## Summary
- Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus
- Custom-built container from forge mirror (`containers/unpoller/Dockerfile`)
- Credentials pulled from 1Password via external-secrets
- Prometheus scrape job added, docs and service-versions updated

## Test plan
- [ ] Build container: `mise run container-release unpoller v2.34.0`
- [ ] Update kustomization tag with built image tag
- [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller`
- [ ] Verify pod connects to UX7 controller (check logs)
- [ ] Confirm `unpoller` target appears in Prometheus
- [ ] Query `unifi_` metrics in Grafana

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #298
2026-03-16 15:52:45 -07:00
a29ced71b5 Upgrade borgmatic 2.0.13 → 2.1.3 (#297)
## Summary
- Upgraded borgmatic from 2.0.13 to 2.1.3 on indri (via mise/pipx)
- Key changes: improved borg warning handling, memory/performance improvements, `source_directories_must_exist` now defaults to true (already set in our config)
- Verified: config validates, dry-run passed against both sifaka (local) and borgbase (offsite) repos

## Borg Warnings Investigation
The main concern was 2.1.0's change to treat borg warnings as errors. In 2.1.3 this was partially reverted — "file not found" warnings (exit code 107) are back to being warnings. Our config already sets `source_directories_must_exist: true`, and all four source directories were verified present on indri.

## Test plan
- [x] `borgmatic --version` confirms 2.1.3
- [x] `borgmatic config validate` passes
- [x] `borgmatic create --dry-run` succeeds against both repositories
- [x] All source directories verified present on indri
- [ ] Verify next scheduled backup (2:00 AM) completes successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #297
2026-03-16 11:05:24 -07:00
0f5377568d Review operations docs: add last-reviewed dates and improve troubleshooting
Mark run-1password-backup and troubleshooting as reviewed. Troubleshooting
gets inline wiki-links for all referenced services, a new ringtail/k3s
section, and a cross-reference to restart-indri.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 07:38:02 -07:00
f46a04b902 Restructure docs: consolidate, recategorize, and extract
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
- Consolidate 4 Authentik Nix derivation docs into one card
  (authentik-nix-build-components.md)
- Merge build-grafana-container + build-grafana-sidecar into
  build-grafana-images.md
- Move agent-change-process from how-to/ to explanation/ (it's a
  methodology doc, not a task guide)
- Extract Caddy custom build section from reference card into
  how-to/deployment/build-caddy-with-plugins.md
- Move expose-service-publicly from how-to/ to tutorials/ (it's a
  comprehensive walkthrough, not a quick task reference)
- Update all wiki-link references across affected docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:55:59 -07:00
ac01c2d6e2 Fix stale docs and shell quoting in devpi start script
- ArgoCD ref: correct Git Source URL to forge.ops.eblu.me:2222
- Authentik ref: add Zot as active OIDC client, blueprint, and secret
- Federated login: remove Zot from Future Work (completed in PR #236)
- devpi/start.sh: use bash array for command building (proper quoting)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 19:25:27 -07:00
5b008a6ab6 Document ai-sources in AI guide, change process, and mise-tasks ref
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 18:43:39 -07:00
d121006086 Exclude docs from ai-sources, mention ai-sources in CLAUDE.md
ai-sources now skips docs/ to avoid duplicating ai-docs output.
CLAUDE.md notes ai-sources as available for deep context on problems
with a large surface area.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 18:40:35 -07:00
9566d8fec9 Add ai-sources task, update ai-docs to include all docs
ai-sources: concatenates all git-tracked files (excluding lock files
and other non-useful artifacts) for full codebase AI context (~353K
tokens).

ai-docs: now concatenates all docs instead of a curated subset (~85K
tokens).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 18:37:50 -07:00
4ca2e39901 Externalize TeslaMate dashboards to forge mirror (#296)
## Summary
- Replaces 18 TeslaMate dashboard ConfigMaps (713 KB / 22,080 lines) with a Grafana init container
- Init container fetches dashboard JSON directly from `mirrors/teslamate` on forge, pinned to `v3.0.0`
- Grafana's file provider picks them up from `/tmp/dashboards/TeslaMate/` via `foldersFromFilesStructure`
- Non-TeslaMate dashboards remain as ConfigMaps (unchanged)

## How it works
- New `init-teslamate-dashboards` init container uses busybox `wget` to fetch each JSON file from `https://forge.eblu.me/mirrors/teslamate/raw/tag/v3.0.0/grafana/dashboards/`
- Files land in `/tmp/dashboards/TeslaMate/`, same emptyDir volume the sidecar uses
- The sidecar continues to handle ConfigMap-based dashboards; the init container handles TeslaMate
- Version pin is in the init container args (TESLAMATE_VERSION)

## Deployment and Testing
- [ ] Sync `grafana` app from branch — verify init container runs and fetches dashboards
- [ ] Sync `grafana-config` app from branch — verify TeslaMate ConfigMaps are pruned
- [ ] Check Grafana UI: TeslaMate folder should still contain all 18 dashboards
- [ ] Verify non-TeslaMate dashboards are unaffected
- [ ] After merge: sync both apps from main

Reviewed-on: #296
2026-03-15 18:31:19 -07:00
1f0308bbd2 Fix Caddy v2.11 Host header rewrite breaking proxied services
Caddy v2.11 (#7454) auto-rewrites the Host header to match the
upstream address for HTTPS backends. This causes services behind
Tailscale Ingress to see *.tail8d86e.ts.net instead of *.ops.eblu.me,
breaking Authentik OAuth flows, Homepage host validation, and other
services that check the Host header.

Only apply header_up for HTTPS backends (Tailscale Ingress); HTTP
backends (forge, registry, jellyfin, sifaka) are unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 18:28:18 -07:00
2bea048dbf Externalize Tailscale operator to forge mirror (#295)
## Summary
- Mirrors `tailscale/tailscale` on forge (`mirrors/tailscale`)
- Replaces vendored `operator.yaml` (495 KB / 5,386 lines) with ArgoCD apps sourcing the upstream static manifest, pinned via `targetRevision: v1.94.2`
- Adds `tailscale-operator-base` app for indri and `tailscale-operator-base-ringtail` for ringtail
- Local kustomization retains only ProxyClass and DNSConfig custom resources
- Updates `[[tailscale-operator]]` doc to reflect new sourcing

## Deployment and Testing
- [ ] Register `mirrors/tailscale` repo in ArgoCD (it needs to know about the new repo)
- [ ] Sync `apps` app to pick up the new `tailscale-operator-base` app definitions
- [ ] Sync `tailscale-operator-base` — verify CRDs, RBAC, operator Deployment come up
- [ ] Sync `tailscale-operator` — verify ProxyClass, DNSConfig still apply cleanly
- [ ] Verify existing Tailscale Ingresses still work (ProxyGroup pods healthy)
- [ ] Repeat for ringtail cluster
- [ ] After merge: apps already point at tags, no revision reset needed

Reviewed-on: #295
2026-03-15 17:44:35 -07:00
272ea1e767 Upgrade Caddy v2.10.2 → v2.11.2, fix forge mirrors (#294)
## Summary
- Upgrade Caddy from v2.10.2 to v2.11.2 (7 CVE fixes across v2.11.1 and v2.11.2)
- Create `mirrors/caddy-l4` forge mirror for Layer 4 plugin
- Migrate all `~/code/3rd` clones on indri from `localhost:3001` to HTTPS `forge.ops.eblu.me/mirrors/` remotes
- Remove stale clones (`apple-silicon-detector`, `whisper.cpp`)
- Update caddy docs and service-versions tracking

## CVEs Fixed
- CVE-2026-27585 through CVE-2026-27590 (path/host bypass, TLS fail-open, FastCGI issues)
- Forward auth identity injection (privilege escalation)
- `vars_regexp` placeholder secret exposure
- Built on Go 1.26.1 (patches Go-level CVEs)

## What was done on indri (not in repo)
- `xcaddy build` with Gandi DNS + Layer 4 plugins → `~/code/3rd/caddy/bin/caddy` now v2.11.2
- Remotes updated: caddy, forgejo-runner, zot → `https://forge.ops.eblu.me/mirrors/*.git`
- Deleted: `~/code/3rd/apple-silicon-detector`, `~/code/3rd/whisper.cpp`

## Deployment and Testing
- [x] Ansible dry-run passed (`--tags caddy --check --diff`)
- [ ] Restart caddy LaunchAgent to pick up the new binary
- [ ] Verify all proxied services respond via `*.ops.eblu.me`
- [ ] Run `mise run services-check`

Reviewed-on: #294
2026-03-15 10:33:48 -07:00
4d195f7fb4 Review restore-1password-backup doc: fix offsite TBD, clarify archive name, add BorgBase to backups
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-15 10:13:07 -07:00
Forgejo Actions
cb95db0bc9 Update docs release to v1.14.1
- Built changelog from towncrier fragments

[skip ci]
2026-03-14 10:11:06 -07:00
8b3b17d555 Review restart-indri doc: fix Caddy/Jellyfin service management, fix docs-preview path handling
- Caddy is now a mcquack LaunchAgent, not brew services
- Add missing Jellyfin and Caddy to shutdown commands and autostart list
- docs-preview: accept paths with or without docs/ prefix

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-14 10:09:38 -07:00
53d620365a Bump zot registry to v2.1.15 (#293)
## Summary
- Upgrade zot OCI registry from v2.1.13 to v2.1.15 on indri
- Addresses CVE-2025-30204 (golang-jwt memory) and open redirect via callback_ui
- No config template changes needed (externalUrl is auto-allowlisted)
- Requires Go 1.25.7 (bump from 1.25.6 via mise)

## Data Safety
- Data directory ~/erichblume/zot is NOT touched during build or deploy
- No schema migrations in v2.1.14 or v2.1.15
- Storage format remains OCI spec 1.1.0

## Deployment Steps
- [ ] SSH to indri: bump Go to 1.25.7 via `mise use go@1.25.7`
- [ ] Fetch and checkout v2.1.15 in ~/code/3rd/zot
- [ ] Build: `mise x -- make binary`
- [ ] Restart LaunchAgent
- [ ] Verify: `curl -s http://localhost:5050/v2/` returns 200
- [ ] Verify: `curl -s https://registry.ops.eblu.me/v2/_catalog` lists repos
- [ ] Verify: `mise run services-check`

Reviewed-on: #293
2026-03-14 10:00:40 -07:00
ab8ea6f301 Bump Grafana Alloy to v1.14.0 (#292)
## Summary
- Bump alloy-k8s, alloy-ringtail, and alloy-tracing-ringtail image tags from v1.13.1 to v1.14.0
- Mark indri alloy (ansible) as reviewed at v1.14.0 — source rebuild from forge mirror needed
- Add missing alloy-ringtail entry to service-versions.yaml
- Update alloy reference doc

## Breaking changes reviewed
- `loki.secretfilter` options removed — not used in our configs
- OTel Collector upgraded to v0.142.0 — Kafka receiver changes don't affect us
- Exporter queue default changes — our tracing pipeline (Beyla → batch → otlphttp) uses simple config, low risk

## Deployment and Testing
- [ ] Sync alloy-k8s: `argocd app set alloy-k8s --revision bump/alloy-v1.14.0 && argocd app sync alloy-k8s`
- [ ] Sync alloy-ringtail: `argocd app set alloy-ringtail --revision bump/alloy-v1.14.0 --server ringtail-argocd && argocd app sync alloy-ringtail`
- [ ] Sync alloy-tracing-ringtail similarly
- [ ] Verify metrics flowing in Grafana
- [ ] Verify traces flowing to Tempo (ringtail)
- [ ] Rebuild indri alloy from source (`v1.14.0` tag on forge mirror), SCP to indri, restart
- [ ] After merge: reset ArgoCD revisions to main, re-sync

Reviewed-on: #292
2026-03-13 16:25:27 -07:00
4c5e7d763d Review deploy-jobsync doc: add missing env var, update tag example
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 15:45:07 -07:00
30fbb87c22 Update ringtail flake inputs (nixpkgs, home-manager, disko)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 15:32:09 -07:00
c26026f4e9 Bump Ollama memory to 24Gi and enable flash attention
The 27B Q4_K_M model needs ~7.3 GiB system RAM for CPU-offloaded layers
but only 6.8 GiB was available within the 22Gi cgroup. Bumping to 24Gi
and enabling flash attention (reduces KV cache memory) should provide
enough headroom.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:33:22 -07:00
6d4929a66c Add qwen3.5:27b to Ollama and bump memory limit to 22Gi
The 27B Q4_K_M model is ~17 GB, exceeding the 16 GB VRAM on the RTX 4080
by ~1 GB. Ollama will offload a few layers to CPU RAM, so the pod memory
limit needs headroom beyond the previous 16Gi.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:55:51 -07:00
40f1568088 Remove unused Mosquitto MQTT broker from ringtail
Mosquitto has been dormant since frigate-notify switched from MQTT to
webapi polling (529ba10). Tear down live infra (ArgoCD app, namespace)
and remove all manifests, service-versions entry, services-check, and
doc references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:37:31 -07:00
009196f6c1 Fix op-backup: auto-detect .1pux exports with suffixed filenames
1Password adds account ID and timestamp to export filenames. The script
now globs ~/Documents for .1pux files instead of expecting a fixed name.
Also fixes a Rich markup error with bracket characters in the prompt.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:23:24 -07:00
8b9cc4effd Add how-to card for running 1Password backup
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:17:45 -07:00
d5a92fead8 Review build-jobsync-container, refine docs-preview tooling
- Review build-jobsync-container.md: fix nonexistent `mirror-sync` task
  reference (Forgejo mirrors sync automatically), mark reviewed
- Remove bat hint from docs-review checklist (output not visible in
  agent sessions), keep docs-preview hint as user-facing step
- Simplify review-documentation.md visual preview section
- Fix Python 3.14 tarfile deprecation warning in docs-preview

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:11:34 -07:00
d01a165b91 Add docs-preview task and visual preview step to doc review
New `mise run docs-preview <card>` task builds docs via Dagger and serves
them locally in the production quartz container (image parsed from ArgoCD
kustomization), opening the browser directly to the specified card.
Container auto-cleans after 1 hour.

Also updates docs-review checklist and review-documentation how-to to
reference the visual preview workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:04:01 -07:00
87d4de244b Review jobsync: add to services-check and homepage (#291)
## Summary
- Add jobsync pod check (ringtail k3s) and HTTP endpoint to `services-check`
- Add JobSync entry to homepage dashboard under new "Apps" group
- Mark jobsync as reviewed at v1.1.4 (current with upstream)
- Changelog fragment added

## Deployment and Testing
- [ ] Sync homepage app from branch: `argocd app set homepage --revision review/jobsync && argocd app sync homepage`
- [ ] Verify JobSync appears on go.ops.eblu.me dashboard
- [ ] Run `mise run services-check` to verify new checks pass
- [ ] After merge: `argocd app set homepage --revision main && argocd app sync homepage`

Reviewed-on: #291
2026-03-11 17:36:51 -07:00
Forgejo Actions
ebba3d6e5b Update docs release to v1.14.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-09 12:03:30 -07:00
0ef5fe5792 Update docs container to v1.28.2-4f0476a (SPA disabled)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 12:00:54 -07:00
4f0476a851 Fix spider trap: disable SPA mode, remove index files, relax wiki-links (#290)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container (Nix) / detect (push) Successful in 1s
Build Container (Nix) / build (quartz) (push) Successful in 1s
Build Container / build (quartz) (push) Successful in 10s
## Summary

Fixes the Facebook crawler spider trap that's been generating infinite recursive URLs like `/how-to/tutorials/tutorials/how-to/explanation/...` for several days.

**Root cause:** Quartz SPA mode + nginx `try_files` fallback to `index.html` meant any fabricated URL returned the root HTML shell with HTTP 200. Crawlers followed relative links from those fake URLs, creating infinite recursion.

**Fix:**
- Disable Quartz SPA mode (`enableSPA: false`) — all pages are now fully static HTML
- Replace nginx SPA fallback with `=404` + Quartz's static `404.html`
- Remove `robots.txt` exclusions (no longer needed)

**Docs cleanup (Obsidian.nvim compat no longer needed):**
- Delete hand-curated category index files (`tutorials.md`, `reference.md`, `how-to.md`, `explanation.md`) — Quartz auto-generates folder pages
- Delete `postgresql-storage.md` (redirect stub) and `migrate-forgejo-from-brew.md` (stale history)
- Drop `docs-check-index` and `docs-check-filenames` prek hooks
- Rewrite `docs-check-links` to allow path-based wiki-links (`[[path/to/file]]`) and only error on true ambiguity
- Add `ai-docs` doc tree listing to replace index files for AI context
- Add natural cross-links from reference cards to fix orphan docs

## Deployment and Testing

- [ ] Merge and let the build pipeline run
- [ ] Verify docs.eblu.me serves pages correctly with full page loads
- [ ] Verify non-existent URLs return 404
- [ ] Monitor crawler traffic — should drop to near zero for fabricated URLs

Reviewed-on: #290
2026-03-09 11:59:43 -07:00
953640d2b7 Deploy docs with fixed robots.txt (v1.28.2-ede9a51)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 20:21:05 -07:00
ede9a51394 Fix robots.txt: block /explore/ and /tags/ (was /explorer/)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 3s
Build Container (Nix) / build (quartz) (push) Successful in 1s
Build Container / build (quartz) (push) Successful in 16s
The previous robots.txt had a typo blocking /explorer/ instead of
/explore/, allowing Facebook's crawler to hit the spider trap.
Also block /tags/ which has the same infinite relative-link issue.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 19:57:45 -07:00
770a7b2d6a Add JobSync reference card, observability docs, and RAPIDAPI_KEY plumbing (#289)
## Summary
- Add JobSync service reference card (`docs/reference/services/jobsync.md`) with architecture, secrets, observability, and JSearch API docs
- Add JobSync and Ollama to ringtail's workloads table (both were missing)
- Add JobSync to the reference index
- Wire `RAPIDAPI_KEY` through ExternalSecret and deployment env var for JSearch job search automation
- Document Loki log queries for observability (no metrics endpoint exists)
- Update deploy-jobsync how-to with new env var, observability section, and reference card link

## Deployment and Testing
- [ ] Sign up for RapidAPI JSearch API (free tier: 500 req/month)
- [ ] Add `rapidapi_key` field to "JobSync" 1Password item
- [ ] Merge PR
- [ ] `argocd app sync jobsync` to pick up new env var
- [ ] Verify job search works at https://jobsync.ops.eblu.me/dashboard/automations

Reviewed-on: #289
2026-03-08 15:06:52 -07:00
c9270c7645 Update jobsync image to v1.1.4-3a811fb-nix (main build)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 11:13:34 -07:00
3a811fb188 Deploy JobSync — job search tracker on ringtail k3s (#288)
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container / build (jobsync) (push) Successful in 2s
Build Container (Nix) / build (jobsync) (push) Successful in 8s
## Summary

C2 Mikado chain to deploy [JobSync](https://github.com/Gsync/jobsync) — a self-hosted job application tracker — to ringtail's k3s cluster.

### Mikado Graph

```
deploy-jobsync (goal)
├── build-jobsync-container
│   └── mirror-jobsync
└── integrate-jobsync-ollama
```

### What is JobSync?

Next.js app with SQLite for tracking job applications. Features resume management, application pipeline tracking, and AI-powered resume review/job matching.

### Key Decisions

- **Ringtail k3s** (not minikube-indri) — colocates with Ollama for zero-latency AI
- **Nix container** via `buildLayeredImage` — no Dockerfile, mirrors upstream source on forge
- **Ollama for AI** — uses existing deployment, no API keys needed for AI features
- **No upstream fork** — vanilla JobSync, Anthropic AI deferred to future work if needed

### Current Status

Planning phase — cards committed, ready for review before implementation begins.

Reviewed-on: #288
2026-03-08 11:02:05 -07:00
1c3bf35dad Fix mikado invariant check rejecting close without impl
A close commit with zero preceding impl commits is valid — some leaf
nodes involve operational steps (e.g., creating a mirror) with no code
changes. Removed the false-positive check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 20:41:03 -08:00
14e931591b Fix 1Password Connect numeric log levels misclassified in Grafana (#287)
## Summary
- 1Password Connect uses non-standard numeric log levels (`1`=error, `2`=warn, `3`=info, `4`=debug, `5`=trace) per [1Password/connect#44](https://github.com/1Password/connect/issues/44)
- Alloy extracts the `level` JSON field as-is, so info-level health checks get `level="3"` in Loki
- Grafana expects string level labels — numeric values are unrecognized, causing misclassified log severity/coloring
- Adds a `stage.match` + `stage.template` in the Alloy pipeline scoped to `{namespace="1password"}` to normalize numeric levels to standard strings
- Other services are completely unaffected (scoped by namespace, not global)

## Deployment and Testing
- [ ] Sync alloy-k8s from branch: `argocd app set alloy-k8s --revision fix/onepassword-numeric-log-levels && argocd app sync alloy-k8s`
- [ ] Wait ~2 minutes for new logs to flow
- [ ] Verify level labels: `curl -sG "http://localhost:3100/loki/api/v1/label/level/values" --data-urlencode 'query={namespace="1password"}'` should show `"info"` and `"warn"` instead of `"3"` and `"2"`
- [ ] Check Grafana log panel for 1password namespace — logs should no longer appear as errors
- [ ] After merge: `argocd app set alloy-k8s --revision main && argocd app sync alloy-k8s`

Reviewed-on: #287
2026-03-07 13:57:04 -08:00
d3f9699c41 Review cv and docs services — both healthy, no upgrades needed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:10:16 -08:00
0e09521ce3 Review manage-flyio-proxy.md — no issues found
Add last-reviewed date. Content is accurate and complete.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:03:46 -08:00
6a033d55be Review and update review-services.md
- Add last-reviewed date
- Align service type sections with actual types (argocd/ansible/nixos)
- Remove nonexistent "Helm Chart" and "Hybrid" sections
- Fold custom container guidance into ArgoCD section
- Reference kustomization.yaml for image tags instead of Helm charts

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 09:03:08 -08:00
e47a3b2ebb Review and update review-documentation.md
- Add last-reviewed date
- Replace raw pulumi commands with mise task equivalents
- Reference C0/C1/C2 change classification for making changes
- Note that prek handles link validation automatically

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:59:51 -08:00
590cb1d25d Document required preview directory for Frigate NFS volume
Frigate 0.17 does not auto-create clips/previews/<camera>/, causing
review page previews to silently fail with 500 errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:46:23 -08:00
Forgejo Actions
2809ba6f50 Update docs release to v1.13.3
- Built changelog from towncrier fragments

[skip ci]
2026-03-06 20:49:01 -08:00
55013db124 Add changelog fragment for Dagger v0.20.1 upgrade
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 20:42:00 -08:00
b793299d6d Upgrade Dagger engine from v0.20.0 to v0.20.1
Phase 2 of Dagger upgrade: bump engine version, update runner
deployment to v0.20.1-24f7512, and fix docs reference card version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 20:41:02 -08:00
24f7512d59 Bump runner-job-image Dagger CLI from 0.20.0 to 0.20.1
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 3s
Build Container (Nix) / build (runner-job-image) (push) Successful in 2s
Build Container / build (runner-job-image) (push) Successful in 2m28s
Phase 1 of Dagger upgrade: update the CLI in the runner container first
so CI can build the new image with the old engine version. See
[[upgrade-dagger]] for the full procedure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 20:32:05 -08:00
ba7236ade0 Add how-to guide for upgrading Dagger
Documents the correct two-phase upgrade procedure to avoid the
chicken-and-egg problem where CI can't build its own replacement.
Also fixes outdated version references in the Dagger reference card.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 20:31:30 -08:00
Forgejo Actions
e95fb9a555 Update docs release to v1.13.2
- Built changelog from towncrier fragments

[skip ci]
2026-03-06 19:03:24 -08:00
a7c21bd8a6 Update docs quartz container to v1.28.2-b64010b
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 18:58:40 -08:00
b64010b3c7 Replace spider-trap nginx 404s with robots.txt disallowing /explorer/
All checks were successful
Build Container (Nix) / detect (push) Successful in 3s
Build Container / detect (push) Successful in 3s
Build Container (Nix) / build (quartz) (push) Successful in 2s
Build Container / build (quartz) (push) Successful in 9s
The /explorer/ SPA endpoints were the source of all spider-trap traffic.
A robots.txt Disallow is a better fix than serving 404s — it prevents
crawlers from entering the infinite URL tree in the first place, avoids
serving large numbers of 404s that hurt SEO, and doesn't break legitimate
deep links.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 18:34:37 -08:00
Forgejo Actions
8b0ff3d7a5 Update docs release to v1.13.1
- Built changelog from towncrier fragments

[skip ci]
2026-03-06 10:00:42 -08:00
1537412c09 Update docs quartz container to v1.28.2-6636576
Picks up spider-trap nginx guards from 6636576.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 09:52:31 -08:00
6636576cdc Add spider-trap guards to docs.eblu.me Quartz nginx config
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (quartz) (push) Successful in 1s
Build Container / build (quartz) (push) Successful in 12s
Block recursive crawler paths caused by SPA fallback + relative links:
/tags/ depth >1 returns 404, global depth ≥5 returns 404.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 09:43:41 -08:00
6e8d11c6bb Add :kustomized sentinel tag to manifest images, review devpi
Bare image references in manifests were ambiguous — unclear whether the
tag was intentionally omitted or managed by kustomize. Add :kustomized
sentinel to all 37 image refs overridden by kustomize images transformer.
Add sync notes for tailscale-operator proxyclass (CRD fields not processed
by kustomize). Mark devpi reviewed (6.19.1 is current).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 08:15:06 -08:00
2ac1a1abc2 Update ringtail flake inputs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 07:54:51 -08:00
6d84fcfb05 Review how-to index: strip prose, add last-reviewed
Removed descriptions, table formatting, and Mikado chain commentary
from the how-to index — it should be links only. Added last-reviewed
date.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 07:52:06 -08:00
Forgejo Actions
d98ef984ea Update docs release to v1.13.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-05 11:11:38 -08:00
46cc3fbc2e Update forgejo-runner job image to v0.20.0-448689b
Built locally to break the chicken-and-egg: the old runner couldn't
build its own replacement because it needed Dagger 0.20.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 11:05:21 -08:00
448689bf2a Bump runner-job-image Dagger CLI from 0.19.11 to 0.20.0
Some checks failed
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 1s
Build Container (Nix) / build (runner-job-image) (push) Successful in 2s
Build Container / build (runner-job-image) (push) Failing after 2s
The Dagger module was upgraded to v0.20.0 in d15071a but the runner job
image still had the old CLI, causing build-blumeops to fail with a
version mismatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 10:58:38 -08:00
c281fb5403 Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286)
## Summary

Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki).

- **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus
- **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
- **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
- **Prometheus** scrapes Tempo operational metrics

### Architecture

```
ringtail (k3s)                                indri (minikube)
┌──────────────────────┐                      ┌─────────────────────┐
│ Alloy+Beyla (eBPF)   │──OTLP HTTP────────→ │ Tempo               │
│  ↳ Frigate, ntfy,    │  via tailnet         │  ↳ trace storage    │
│    Ollama, Immich     │                      │  ↳ RED → Prometheus │
└──────────────────────┘                      │                     │
                                              │ Grafana             │
                                              │  ↳ Tempo datasource │
                                              └─────────────────────┘
```

### New files (12)
- `docs/reference/services/tempo.md` — reference doc
- `docs/changelog.d/feature-otel-tracing.feature.md`
- `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files)
- `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files)

### Modified files (6)
- `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields
- `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target
- `service-versions.yaml` — tempo + alloy-tracing-ringtail entries
- `docs/reference/services/grafana.md` — Tempo in datasources table
- `docs/reference/reference.md` — Tempo in services index
- `docs/reference/operations/observability.md` — Tempo in components list

## Deployment and Testing

- [ ] Sync `apps` app to pick up new Application definitions
- [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo`
- [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo`
- [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready`
- [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring`
- [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail`
- [ ] Check Beyla discovery in alloy-tracing logs on ringtail
- [ ] Sync grafana-config for updated datasources
- [ ] Sync prometheus for updated scrape config
- [ ] Test Grafana Tempo datasource connection
- [ ] Generate test traffic and search traces in Grafana Explore → Tempo
- [ ] After merge: reset all ArgoCD app revisions back to main

Reviewed-on: #286
2026-03-05 10:51:07 -08:00
d15071aaf9 Upgrade Dagger from v0.19.11 to v0.20.0 (#285)
## Summary
- Bump Dagger engine version from v0.19.11 to v0.20.0 in `dagger.json`
- Pin dagger CLI to `0.20.0` in `mise.toml` (was `"latest"`)
- Regenerated `.dagger/uv.lock` (new SDK deps: httpcore, beartype bump)

## Testing
- [x] `dagger call validate-workflows --src=.` passes on v0.20.0
- [ ] CI build workflow passes

Reviewed-on: #285
2026-03-05 09:32:13 -08:00
7bddc78c8a Add ExternalSecret default fields to prevent ArgoCD drift
The external-secrets operator adds conversionStrategy, decodingStrategy,
and metadataPolicy defaults to the live object, causing perpetual
OutOfSync in ArgoCD. Declare them explicitly to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 09:11:23 -08:00
405fc59c12 Add Authentik OIDC login for ArgoCD (#284)
## Summary
- Add Authentik OAuth2 provider + application blueprint for ArgoCD (ringtail side)
- Add OIDC config to ArgoCD ConfigMap with Authentik as identity provider (indri side)
- Map Authentik `admins` group to ArgoCD `role:admin` via RBAC policy
- ExternalSecrets on both sides pull `argocd-client-secret` from 1Password
- Local admin password remains as break-glass — both login methods coexist

## Pre-deployment manual step
Add `argocd-client-secret` field to "Authentik (blumeops)" in 1Password with a random value (e.g., `openssl rand -hex 32`).

## Deployment order
1. Sync Authentik app on ringtail first (blueprint + secret + worker env var)
2. Sync ArgoCD app on indri second (cm, rbac, ExternalSecret)

## Verification
- [ ] `argocd-client-secret` field added to 1Password
- [ ] Authentik app synced on ringtail — blueprint applied, provider created
- [ ] ArgoCD app synced on indri — OIDC config applied
- [ ] SSO login works: visit `https://argocd.ops.eblu.me` → "Log in via Authentik" → admin access
- [ ] Break-glass: local admin/password login still works

Reviewed-on: #284
2026-03-05 09:07:25 -08:00
c029e5851a Review migrate-forgejo-from-brew doc, fix stale Phase 3 reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:29:58 -08:00
91c755ddd6 Pin kiwix-serve image tag to v3.8.2-f6f0f79
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:17:40 -08:00
92364f7305 Remove suggestion to run prek manually from README
Hooks run automatically on git commit; no need to invoke separately.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:15:25 -08:00
f6f0f79a5b Bump kiwix-serve from 3.8.1 to 3.8.2
All checks were successful
Build Container (Nix) / detect (push) Successful in 4s
Build Container (Nix) / build (kiwix-serve) (push) Successful in 3s
Build Container / detect (push) Successful in 1m57s
Build Container / build (kiwix-serve) (push) Successful in 1m15s
Minor upstream release with doc and CI fixes. Also corrects kiwix.md
to reference the actual custom registry image and torrents.txt path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:12:32 -08:00
75814e032c Pin transmission-exporter image tag to v1.0.1-c93448f
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:05:17 -08:00
c93448f951 Bump transmission-exporter to v1.0.1
All checks were successful
Build Container / detect (push) Successful in 1s
Build Container (Nix) / detect (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s
Build Container / build (transmission-exporter) (push) Successful in 6s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:04:26 -08:00
797133b28e Fix per-torrent rate panels showing cumulative bytes instead of rates
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s
Build Container / build (transmission-exporter) (push) Successful in 38s
Dashboard "Download/Upload Rate by Torrent" panels were querying
transmission_torrent_download_bytes (total_size * percent_done) and
transmission_torrent_upload_bytes (uploaded_ever) — cumulative byte
gauges, not rates. Added new metrics using Transmission's native
rate_download/rate_upload and updated dashboard queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:01:37 -08:00
6ae18cde1e Pin transmission-exporter image tag to v1.0.0-f2704b2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 21:55:59 -08:00
f2704b26da Replace transmission-exporter with homegrown Python exporter (#283)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s
Build Container / build (transmission-exporter) (push) Successful in 19s
## Summary
- Replace unmaintained `metalmatze/transmission-exporter:master` sidecar with a homegrown Python exporter
- Uses `prometheus_client` + `transmission-rpc` with collect-on-scrape pattern (fresh metrics per scrape, no stale labels)
- Same metric names so existing Grafana Transmission dashboard works unchanged
- Container built with `uv` for dependency management, follows `grafana-sidecar` Dockerfile pattern

## Changes
- **New:** `containers/transmission-exporter/exporter.py` — single-file exporter (~130 lines)
- **New:** `containers/transmission-exporter/Dockerfile` — multi-stage Alpine build with uv
- **Modified:** `argocd/manifests/torrent/deployment.yaml` — swap sidecar image reference
- **Modified:** `argocd/manifests/torrent/kustomization.yaml` — add image tag entry
- **Modified:** `service-versions.yaml` — add transmission-exporter entry

## Deployment and Testing
- [ ] Build container: `mise run container-build-and-release transmission-exporter`
- [ ] Update kustomization.yaml newTag with build SHA
- [ ] Branch deploy: `argocd app set torrent --revision feature/transmission-exporter-python && argocd app sync torrent`
- [ ] Verify metrics: `kubectl -n torrent --context=minikube-indri port-forward svc/transmission 19091:19091` then `curl localhost:19091/metrics | grep transmission_`
- [ ] Verify Grafana Transmission dashboard panels populate
- [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent`

Reviewed-on: #283
2026-03-04 21:55:00 -08:00
91d84e54d5 Replace OOMKilled stat with detail table, shrink waiting reason panel
The count-only stat wasn't actionable. New table shows pod name, container,
restart count, and memory limit for each OOMKilled container. Waiting reason
panel narrowed to make room.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:58:11 -08:00
008da43736 Add OOMKill observability to Kubernetes Clusters dashboard
OOMKilled containers previously only appeared briefly in "Unhealthy Pods"
while dying, then vanished on restart. New panels use persistent metrics
(last_terminated_reason) and restart rate tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:53:07 -08:00
77a1ea15d2 Remove mikado frontmatter from closed chains, clarify finalization rules
During finalization, all mikado frontmatter (requires, status, branch) should
be removed — cards become plain documentation linked via wiki-links. Updated
agent-change-process docs and cleaned up 10 cards from closed chains. Also
fixed ai-docs referencing deleted plans/ files.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:43:19 -08:00
55a846eb25 Retire plans directory, convert migrate-forgejo-from-brew to mikado card
The plans/ directory predated the mikado method approach. Deleted all
completed and abandoned plans, converted the still-relevant
migrate-forgejo-from-brew into a lean mikado chain root card under
how-to/forgejo/, cleaned up dangling wiki-links across docs, and
fixed a stale "pre-commit" reference to "prek".

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:28:14 -08:00
e90c287504 Add qwen3.5:9b to Ollama model list
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 19:49:39 -08:00
6ca3c67705 Add Ollama reference card and update indexes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 19:43:14 -08:00
5ddb47de1c Review upgrade-grafana doc: fix image tag ref, add sidecar link
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 07:53:22 -08:00
b460333da0 Upgrade Transmission to 4.1.1 (#282)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container (Nix) / detect (push) Successful in 2s
Build Container (Nix) / build (transmission) (push) Successful in 2s
Build Container / build (transmission) (push) Successful in 6s
## Summary
- Upgrade Transmission from 4.0.6-r4 to 4.1.1-r1
- Uses Alpine edge community repo for transmission packages, keeping stable alpine:3.22 base
- Fix stale image reference in service doc (was linuxserver, now custom registry image)
- Mark transmission as reviewed in service-versions.yaml

## Context
Service review found Transmission two minor versions behind (4.0.6 → 4.1.1). Alpine 3.22 only packages 4.0.6, so transmission is installed from edge's community repo with an exact version pin.

4.1.0 added improved µTP performance, IPv6/dual-stack UDP tracker, JSON-RPC 2.0 API. 4.1.1 is a bugfix release (20+ fixes).

Dagger test build passed locally.

## Deployment and Testing
- [ ] Build container via Forgejo workflow (`mise run container-build-and-release transmission`)
- [ ] Update kustomization.yaml with new image tag
- [ ] `argocd app set torrent --revision feature/transmission-review && argocd app sync torrent`
- [ ] Verify web UI at https://torrent.ops.eblu.me
- [ ] Check Grafana Transmission dashboard still receives metrics
- [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent`

## Note
The transmission-exporter sidecar (OOMKilling every ~30min, 294 restarts) is being tracked separately as a future replacement project.

Reviewed-on: #282
2026-03-04 07:44:33 -08:00
b3d5478020 Use towncrier orphan fragment naming for C0 changes
C0 changes have no branch name, so `main.<type>.md` fragments collide.
Switch to towncrier's `+<slug>.<type>.md` orphan convention and rename
existing `main.*` fragments.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 15:30:00 -08:00
d7f0aa6f96 Fix Frigate database path to use persistent volume
The database was at /config/frigate.db (emptyDir, ephemeral) instead of
/db/frigate.db (PVC, persistent). Every pod restart wiped the database,
losing all recording history and leaving orphaned files on NFS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 15:18:16 -08:00
a4f5f7ce09 Add changelog fragment for Frigate memory limit bump
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:58:35 -08:00
135883079c Bump frigate memory limit from 2Gi to 3Gi
ONNX detector + CUDA ffmpeg + workers consume ~1.9Gi at steady state,
causing intermittent OOMKills at the 2Gi limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:57:15 -08:00
3d065b94f9 Pin grafana-sidecar to main build tag
v1.28.0-a2bb9ab (built from merge commit on main).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:51:01 -08:00
a2bb9abbdb Home-build grafana-sidecar container (#281)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (grafana-sidecar) (push) Successful in 2s
Build Container / build (grafana-sidecar) (push) Successful in 6s
## Summary
- Home-build the k8s-sidecar container (`grafana-sidecar`) from forge mirror, replacing upstream `quay.io/kiwigrid/k8s-sidecar:1.28.0`
- Pinned to v1.28.0 — v2.x deferred due to 135% memory regression and readOnlyRootFilesystem crashloop
- Adds Dockerfile, service-versions entry, docs, and changelog fragment
- Manifest switch to home-built image pending container build

## Deployment and Testing
- [ ] `mise run container-build-and-release grafana-sidecar`
- [ ] Update kustomization.yaml with built image tag
- [ ] `argocd app set grafana --revision feature/grafana-sidecar && argocd app sync grafana`
- [ ] Verify sidecar logs and dashboards at https://grafana.ops.eblu.me
- [ ] Post-merge: `argocd app set grafana --revision main && argocd app sync grafana`

Reviewed-on: #281
2026-03-03 13:48:24 -08:00
81a8ca24b9 Clarify that changelog fragments apply to all change levels (C0–C2)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:15:06 -08:00
876e51dd77 Allow implicit octals in yamllint and normalize k8s mode values
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:10:44 -08:00
4518fa3ac3 Add changelog fragment for Gandi bookmark
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:06:02 -08:00
eceea2126b Add Gandi bookmark to homepage dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:05:50 -08:00
51626e6630 Update Loki to v3.6.5-3dc4ed7 container image
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:01:49 -08:00
3dc4ed730b Build Loki container image locally (#280)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (loki) (push) Successful in 2s
Build Container / build (loki) (push) Successful in 7s
## Summary
- Add two-stage Dockerfile for Loki (Go build → Alpine runtime) in `containers/loki/`
- Rewrite kustomize image to `registry.ops.eblu.me/blumeops/loki`
- Tag is `v3.6.5-placeholder` until first CI build; will be updated post-build

## Details
- UID 10001 matches existing StatefulSet `securityContext` (runAsUser/fsGroup)
- CGO_ENABLED=0, ldflags embed version via `github.com/grafana/loki/v3/pkg/util/build`
- Clones from `forge.ops.eblu.me/mirrors/loki` (mirror created this session)
- Pattern follows miniflux (two-stage Go) + prometheus (ldflags)

## Deployment and Testing
- [ ] Trigger container build: `mise run container-build-and-release loki`
- [ ] Update kustomize tag to actual build tag
- [ ] Deploy from branch: `argocd app set loki --revision feature/loki-container && argocd app sync loki`
- [ ] Verify `/ready` endpoint and log ingestion
- [ ] After merge: update to `[main]` tag (C0 follow-up)

Reviewed-on: #280
2026-03-03 13:00:43 -08:00
f914a14653 Update teslamate to v3.0.0-eb9bc57 container image
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 12:02:26 -08:00
eb9bc57351 Upgrade TeslaMate v2.2.0 → v3.0.0 (#279)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container (Nix) / build (teslamate) (push) Successful in 2s
Build Container / detect (push) Successful in 26s
Build Container / build (teslamate) (push) Successful in 4m22s
## Summary
- Upgrade TeslaMate from v2.2.0 to v3.0.0 (first service review)
- Elixir 1.18 → 1.19.5, runtime base bookworm → trixie
- Adds zstd/brotli build deps for new static asset compression
- DB migration (BTREE → BRIN indexes) runs automatically via entrypoint

## Deployment and Testing
- [ ] Trigger container build: `mise run container-build-and-release teslamate`
- [ ] Update kustomization.yaml with new image tag
- [ ] Deploy from branch: `argocd app set teslamate --revision upgrade/teslamate-v3.0.0 && argocd app sync teslamate`
- [ ] Verify TeslaMate UI loads and data is intact
- [ ] Check logs for migration errors
- [ ] After merge: reset ArgoCD to main, update kustomization tag to `[main]` image

Reviewed-on: #279
2026-03-03 11:56:40 -08:00
f91b8d0d98 Fix mirror-update-pats corrupting all GitHub mirror URLs
The bash parameter expansion `${var/pat/rep}` treats `\/` in the
replacement as a literal backslash-slash, not an escaped delimiter.
This produced URLs like `https:\/\/eblume:...` instead of
`https://eblume:...`, breaking Forgejo's URL parser (500 on mirror
settings pages) and preventing mirror syncs.

Use prefix stripping (`${var#prefix}`) instead.

All 22 corrupted mirrors have been repaired on indri.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 11:46:41 -08:00
823a35bb9a Add changelog fragment for ringtail firewall fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 11:17:48 -08:00
c5d82b0942 Trust k3s CNI interfaces in ringtail NixOS firewall
The NixOS firewall was blocking pod-to-host TCP traffic because only
tailscale0 was trusted. Pods could ping the host but not reach the
API server (port 6443), breaking Tailscale Ingress TLS cert refresh
and all ringtail services (authentik, frigate, ntfy, ollama).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 11:15:02 -08:00
6c5a99883f Add pre-commit check for changelog fragment placement
Misfiled fragment from feature/ branch created a subdirectory under
changelog.d/ which towncrier doesn't support. Move the fragment to the
correct flat location and add a changelog-check mise task + prek hook
to prevent this from happening again.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:49:01 -08:00
01d3b4d1c7 Switch forgejo-runner ArgoCD app to internal SSH repo URL
Was the only app still using https://forge.eblu.me (public proxy) for
git polling. All other apps already use the internal SSH endpoint at
forge.ops.eblu.me.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:43:01 -08:00
82884436df Route runner polling through internal forge.ops.eblu.me
The k8s and ringtail runners were hitting forge.eblu.me (fly.io proxy)
for every FetchTask poll (~every 2s), round-tripping through the public
internet unnecessarily. Use forge.ops.eblu.me (Caddy on indri, tailnet)
for infrastructure workloads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:33:40 -08:00
7b68be2e80 Add fly.io proxy observability and app logs to Forgejo dashboard
Rename "Forgejo Repository Health" to "Forgejo" and add proxy metrics
(request rate, error rate, RPS, latency, bandwidth), proxy access logs,
and Forgejo application logs from Loki.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:24:53 -08:00
cf8736c73b Review kustomize-grafana-deployment: fix manifest table to match reality
The doc listed a nonexistent configmap.yaml instead of the actual raw
config files (grafana.ini, datasources.yaml, provider.yaml) consumed
by kustomization.yaml's configMapGenerator. Added last-reviewed date.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:14:41 -08:00
86a0dee000 Remove ollama LAN NodePort service
The sanctioned ingress is ollama.ops.eblu.me via tailnet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:00:05 -08:00
3af346f1cd Move ollama LAN NodePort to port 80
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 09:37:50 -08:00
2d5b78a4d7 Add Forge (public) to services-check
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 08:48:15 -08:00
a87c997ee1 Expose Forgejo publicly at forge.eblu.me (#278)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m28s
## Summary

Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service.

- **Forgejo hardening:** Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO)
- **Tailscale Ingress:** ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint
- **Fly.io proxy:** nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit
- **Authentik:** OAuth callback updated to forge.eblu.me
- **DNS/TLS:** CNAME record in Pulumi, cert in fly-setup
- **Rename:** ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is)

## Deployment Order

1. `mise run provision-indri -- --tags forgejo` (config changes)
2. Verify forge.ops.eblu.me still works
3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator`
4. Verify `curl https://forge.tail8d86e.ts.net`
5. `cd fly && fly deploy`
6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/`
7. `fly certs add forge.eblu.me -a blumeops-proxy`
8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik`
9. `mise run dns-preview && mise run dns-up`
10. Full verification (see below)
11. Rehearse `mise run fly-shutoff`
12. After merge: reset ArgoCD revisions to main, re-sync

## Verification Checklist

- [ ] forge.eblu.me loads, shows public repos
- [ ] forge.ops.eblu.me still works from tailnet
- [ ] SSH clone via forge.ops.eblu.me:2222 works
- [ ] HTTPS clone via forge.eblu.me works
- [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH
- [ ] /swagger returns 403
- [ ] Rapid login attempts trigger 429 rate limit
- [ ] fail2ban bans after 5 failed logins in 10 minutes
- [ ] ArgoCD can still sync (SSH unaffected)
- [ ] `mise run fly-shutoff` stops all public traffic
- [ ] `mise run services-check` passes

Reviewed-on: #278
2026-03-03 08:40:41 -08:00
a32c99a252 Limit ollama to one loaded model and one parallel request
Prevents OOM when switching between models — only one 14B model
fits in 16GB VRAM at a time with KV cache for context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 21:23:12 -08:00
203e3cd567 Add NodePort service for ollama LAN access
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 20:57:18 -08:00
31d925814f Deploy Ollama LLM server on ringtail (#277)
## Summary
- Deploy Ollama as a new ArgoCD-managed service on ringtail's k3s cluster with GPU acceleration
- Declarative model management via `models.txt` + sidecar sync script (mirrors kiwix torrent pattern)
- Initial models: `qwen2.5:14b`, `deepseek-r1:14b`, `phi4:14b`, `gemma3:12b`
- hostPath PV on `/mnt/storage1/ollama` for fast local model storage (200Gi)
- Tailscale ingress at `ollama.ops.eblu.me` for API access from tailnet
- Enable GPU time-slicing (`replicas: 2`) on nvidia-device-plugin so Frigate and Ollama share the RTX 4080

## Deployment and Testing
- [ ] Deploy nvidia-device-plugin changes first: `argocd app sync nvidia-device-plugin`
- [ ] Verify GPU time-slicing: `kubectl describe node ringtail --context=k3s-ringtail` shows `nvidia.com/gpu: 2`
- [ ] Sync `apps` app with `--revision feature/ollama-ringtail`
- [ ] Set ollama app to branch: `argocd app set ollama --revision feature/ollama-ringtail && argocd app sync ollama`
- [ ] Verify model-sync sidecar pulls models: `kubectl logs -n ollama deploy/ollama -c model-sync --context=k3s-ringtail`
- [ ] Test API: `curl https://ollama.ops.eblu.me/api/tags`
- [ ] Test inference: `curl https://ollama.ops.eblu.me/api/generate -d '{"model":"qwen2.5:14b","prompt":"Hello"}'`
- [ ] Verify Frigate still works after GPU sharing change
- [ ] After merge: `argocd app set ollama --revision main && argocd app sync ollama`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/277
2026-03-02 20:39:51 -08:00
Forgejo Actions
0f79c61c42 Update docs release to v1.12.1
- Built changelog from towncrier fragments

[skip ci]
2026-03-02 18:17:07 -08:00
7a1875936c Switch git hooks from pre-commit to prek (#276)
## Summary

- Replace pre-commit with [prek](https://github.com/j178/prek), a faster Rust-native drop-in alternative
- Migrate config from `.pre-commit-config.yaml` (YAML) to `prek.toml` (TOML)
- Add new built-in checks: case conflicts, private key detection, executable shebangs
- Install prek via mise native registry (`aqua:j178/prek`) instead of pipx
- Update all doc references across README, contributing guide, and how-to docs

## Notes

- `check-yaml` still uses the remote `pre-commit-hooks` repo because prek's builtin fast path doesn't support `--unsafe` yet (needed for Ansible custom YAML tags)
- All existing custom hooks (docs validation, container version check, mikado invariant, workflow validation) work unchanged
- Tested: all hooks pass on clean tree, deliberate doc link breakage is caught

## Test plan

- [x] `prek run --all-files` passes all checks
- [x] Broken wiki-link correctly caught by `docs-check-links`
- [x] taplo-format auto-fixes TOML formatting on commit
- [x] commit-msg hook (mikado invariant) fires correctly

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/276
2026-03-02 18:15:23 -08:00
940219338a Slightly less confusing output when mikado cards share a prereq 2026-03-02 17:58:18 -08:00
2d54f93c68 Add changelog for impl-card-guard feature
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 17:45:01 -08:00
6131f881a8 Enforce impl commits can't modify Mikado card files
The mikado-branch-invariant-check hook now inspects staged files (commit-msg
hook) and historical commit files (standalone mode) to reject impl commits
that touch markdown files with Mikado frontmatter (requires:, status:, or
branch: mikado/). Cards should only be modified by plan, close, or finalize.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 17:44:34 -08:00
10d0636cee Review navidrome and miniflux: both at latest versions
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:32:17 -08:00
9465b75815 Add changelog for authentik source chain doc review
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:28:09 -08:00
08b9570ac7 Review build-authentik-from-source Mikado chain docs
Fix go-server-derivation: wrong path target (webui not authentik-django)
and missing internal/web/static.go patch. Remove stale DRF fork content
from mirror-build-deps (no longer needed as of 2026.2.0). Add
last-reviewed to all 5 cards without it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 07:28:09 -08:00
Forgejo Actions
847e47eaf3 Update docs release to v1.12.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-01 17:24:09 -08:00
2a2811d7a5 Review authentik-api-client-generation doc: fix stale content
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:21:46 -08:00
c9d273dc81 Update authentik changelog fragment to mention version upgrade
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 17:11:41 -08:00
503775085d Deploy authentik 2026.2.0 with migration ordering fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 16:32:10 -08:00
2d4098e480 Fix authentik 2026.2.0 migration ordering bug (#275)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container (Nix) / detect (push) Successful in 1s
Build Container / build (authentik) (push) Successful in 1s
Build Container (Nix) / build (authentik) (push) Successful in 3m6s
## Summary

- Patch `authentik_rbac/0010` migration to depend on `authentik_core/0056`, fixing non-deterministic ordering that crashes startup with `FieldError: Cannot resolve keyword 'group_id'`
- Upstream bug: goauthentik/authentik#19616, #20634 — no fix released yet
- Document the issue in the lessons-learned table

## Deployment and Testing

- [ ] CI builds container image
- [ ] Deploy from branch: `argocd app set authentik --revision fix/authentik-migration-ordering && argocd app sync authentik`
- [ ] Pods reach Running/Ready without crash-looping
- [ ] `kubectl logs` show 0056 migrating before 0010
- [ ] authentik UI loads at authentik.ops.eblu.me
- [ ] `mise run services-check`
- [ ] After merge: `argocd app set authentik --revision main && argocd app sync authentik`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/275
2026-03-01 16:28:36 -08:00
90621e4155 Deploy authentik 2026.2.0 with entry_points fix
Update image tag to v2026.2.0-78027eb-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 16:04:29 -08:00
78027eb783 Fix authentik: entry_points API for Python 3.14
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container / build (authentik) (push) Successful in 2s
Build Container (Nix) / build (authentik) (push) Successful in 3m7s
Python 3.14's EntryPoints uses string keys, not integer indices.
eps[0] raises KeyError(0); use next(iter(eps)) instead.
Verified on ringtail with the actual venv python.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 16:02:49 -08:00
e2c650b027 Deploy authentik 2026.2.0 with BASE_DIR fix
Update image tag to v2026.2.0-e49d966-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:55:50 -08:00
e49d966229 Fix authentik: symlink authentik/ for BASE_DIR resolution
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container / build (authentik) (push) Successful in 2s
Build Container (Nix) / build (authentik) (push) Successful in 3m4s
Django's BASE_DIR is $out but source lives in site-packages. Code
like BASE_DIR / "authentik" / "sources" / "scim" / "schemas" / ...
needs a top-level symlink to find data files alongside Python source.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:55:26 -08:00
c0e29476f3 Deploy authentik 2026.2.0 with TMPDIR fix
Update image tag to v2026.2.0-b7bfb0b-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:53:09 -08:00
b7bfb0bfae Fix authentik container: set TMPDIR=/tmp
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container / build (authentik) (push) Successful in 1s
Build Container (Nix) / build (authentik) (push) Successful in 45s
lifecycle/ak uses ${TMPDIR}/authentik-mode — without TMPDIR set it
tries to write /authentik-mode in root, which user 65534 can't do.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:52:36 -08:00
38da372f94 Deploy authentik 2026.2.0 with /tmp fix
Update image tag to v2026.2.0-2ac353b-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:51:17 -08:00
2ac353b7bf Fix authentik container: create /tmp for unprivileged user
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container / build (authentik) (push) Successful in 1s
Build Container (Nix) / build (authentik) (push) Successful in 54s
buildLayeredImage doesn't create /tmp by default. The container runs
as user 65534 (nobody) which can't mkdir /tmp at runtime.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:48:05 -08:00
098f3e517c Deploy authentik 2026.2.0 (source-built) to ArgoCD
Update image tag to v2026.2.0-efa9806-nix — the first source-built
authentik container from the build-authentik-from-source chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:44:35 -08:00
efa9806bfa C2: Build authentik from source (Mikado chain) (#274)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container (Nix) / detect (push) Successful in 1s
Build Container / build (authentik) (push) Successful in 2s
Build Container (Nix) / build (authentik) (push) Successful in 22s
## Mikado Chain: build-authentik-from-source

Replace `pkgs.authentik` from nixpkgs with a custom Nix derivation built from source.
This removes the dependency on the nixpkgs packaging timeline and gives full version control.

Target version: **2025.12.4** (nixpkgs reference, upgrading from deployed 2025.10.1).

### Dependency Graph

```
build-authentik-from-source (goal)
├── authentik-go-server-derivation
│   ├── authentik-api-client-generation  ← IN PROGRESS
│   └── authentik-python-backend-derivation
├── authentik-web-ui-derivation
│   └── authentik-api-client-generation  ← IN PROGRESS
└── authentik-python-backend-derivation
```

### Ready Leaves
- `authentik-api-client-generation` — Go + TypeScript client generation from OpenAPI schema
- `authentik-python-backend-derivation` — Django backend with 60+ deps, 4 in-tree packages

### Architecture
Ported from [nixpkgs `pkgs/by-name/au/authentik/package.nix`](https://github.com/NixOS/nixpkgs/tree/master/pkgs/by-name/au/authentik):
- `source.nix` — shared version/source fetch
- `client-go.nix` — Go API client generation
- `client-ts.nix` — TypeScript API client generation
- `api-go-vendor-hook.nix` — Go vendor directory injection hook
- (more components to follow as leaves are closed)

### Related Cards
- [[build-authentik-from-source]] — Goal card
- [[authentik-api-client-generation]]
- [[authentik-python-backend-derivation]]
- [[authentik-web-ui-derivation]]
- [[authentik-go-server-derivation]]

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/274
2026-03-01 13:45:00 -08:00
0aaf9bb8b2 Add Dagger local build step to authentik source build goal
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 08:39:25 -08:00
7094ea7d3e Start C2 Mikado chain: build authentik from source
Create goal card and 4 prerequisite cards for building authentik from a
custom Nix derivation instead of using pkgs.authentik from nixpkgs. This
removes the dependency on the nixpkgs packaging timeline and gives full
version control over authentik releases.

Chain: mikado/authentik-source-build
Leaf nodes: authentik-api-client-generation, authentik-python-backend-derivation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 08:20:17 -08:00
922265c88f Add changelog fragment for grafana doc review
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 07:28:47 -08:00
8d1e98617b Review build-grafana-container docs: stamp reviewed, fix cross-links
Also fix stale grafana.md reference card (Helm → Kustomize).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 07:28:06 -08:00
02eb169403 Pin blumeops-pg to PostgreSQL 18.3
Replace floating :18 tag with pinned :18.3 (upstream out-of-cycle
release fixing 18.2 regressions). Stamps service as reviewed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 16:25:32 -08:00
2312e5fbf8 Update ringtail flake inputs
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 15:22:29 -08:00
7cecaf0471 Review forgejo-runner docs: stamp reviewed, fix cross-links
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 15:10:20 -08:00
776caa87f5 Sync Frigate zone coordinates from live API
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 07:52:09 -08:00
Forgejo Actions
fa223f8e3b Update docs release to v1.11.5
- Built changelog from towncrier fragments

[skip ci]
2026-02-26 07:56:02 -08:00
be3cdad1cb Add HA for CV and Docs: zero-downtime deploys (#273)
## Summary
- Set `replicas: 2` with `maxUnavailable: 0` / `maxSurge: 1` on CV and Docs deployments so rolling updates never drop below 2 ready pods
- Add PodDisruptionBudgets (`minAvailable: 1`) to protect against node drains and cluster maintenance
- Add Fly.io cache purge step to `cv-deploy.yaml` workflow (docs already had this) so CV deploys don't serve stale cached content

## Deployment and Testing
- [ ] `argocd app diff cv` / `argocd app diff docs` from branch
- [ ] Deploy from branch: `argocd app set cv --revision feature/ha-cv-docs-zero-downtime && argocd app sync cv`
- [ ] Verify 2 pods running: `kubectl get pods -n cv --context=minikube-indri`
- [ ] Test rolling restart: `kubectl rollout restart deployment/cv -n cv --context=minikube-indri`
- [ ] During rollout, confirm continuous availability via `curl -I https://cv.eblu.me`
- [ ] After merge: reset ArgoCD to main, re-sync both apps

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/273
2026-02-26 07:53:21 -08:00
8e9d89ca73 docs-review: print file path instead of content for LLM usage
The LLM should read the file itself using its tools rather than
receiving it inline in the task output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 07:24:37 -08:00
9a7acffa26 Review manage-forgejo-mirrors doc: clarify cron default, stamp reviewed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 07:17:18 -08:00
fb83c5c577 Add explicit ExternalSecret defaults for SSA sync parity
The external-secrets webhook injects conversionStrategy, decodingStrategy,
and metadataPolicy defaults on admission. Declaring them explicitly prevents
ArgoCD SSA from flagging the resource as OutOfSync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 07:02:54 -08:00
db561c6b0e Upgrade ArgoCD v3.2.6 → v3.3.2 with Server-Side Apply (#272)
## Summary
- Upgrade ArgoCD from v3.2.6 to v3.3.2
- Enable `ServerSideApply=true` sync option (required by v3.3 — ApplicationSet CRD exceeds client-side apply annotation limit)
- Update service-versions.yaml with review for argocd and 1password-connect

## Breaking changes reviewed
- **Server-Side Apply required**: Added to syncOptions 
- **Source Hydrator git notes**: Not used — N/A
- **Application path cleaning removed**: Not used — N/A
- **Settings API field restriction**: Authenticated access only — N/A

## Deployment and Testing
- [ ] Sync the `apps` app first (picks up SSA syncOption change)
- [ ] `argocd app set argocd --revision feature/argocd-v3.3.2`
- [ ] `argocd app sync argocd`
- [ ] Verify all argocd pods running with v3.3.2 images
- [ ] Verify other apps still sync correctly
- [ ] After merge: `argocd app set argocd --revision main && argocd app sync argocd`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/272
2026-02-26 06:51:50 -08:00
95c8424e62 Add Transmission metrics exporter and Grafana dashboard (#271)
## Summary
- Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091
- Add metrics port to the torrent service for Prometheus scraping
- Add Prometheus scrape job targeting the transmission exporter
- Create Grafana dashboard with:
  - Overview stats (download/upload speed, active/total torrents)
  - Transfer speed timeseries (download + upload over time)
  - Transfer volume stats (total downloaded/uploaded in selected range)
  - Per-torrent download and upload rate timeseries
  - Per-torrent details table (ratio, uploaded, percent done)

## Deployment and Testing
- [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts
- [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics`
- [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me
- [ ] Open Grafana, find "Transmission" dashboard, verify panels populate
- [ ] Sync ArgoCD `prometheus` app from branch
- [ ] Sync ArgoCD `grafana-config` app from branch

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271
2026-02-25 22:23:33 -08:00
03d71544ec Add multi-cluster observability with ringtail metrics and dashboards (#270)
## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
  - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
  - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels

## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods

## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
2026-02-25 22:01:00 -08:00
2243f2e0a1 Filter driveway zone to person/dog/cat only in Frigate
Parked car was being re-detected every few minutes at night due to IR
illumination noise triggering motion detection. Restrict the driveway
zone to [person, dog, cat] so cars and birds no longer create events
there. Cars still alert via the driveway_entrance zone.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 20:45:07 -08:00
84338c32c2 Add authenticated GitHub PAT for Forgejo mirror sync (#269)
## Summary

- **mirror-create**: Auto-includes GitHub PAT from 1Password for authenticated upstream fetches at mirror creation time
- **mirror-update-pats**: New mise task that SSHes into indri and rewrites the git remote URL in every GitHub mirror's bare repo config to embed the PAT. Idempotent, supports `--dry-run`
- **app.ini.j2**: Explicit `[mirror]` section with `DEFAULT_INTERVAL = 8h` and `MIN_INTERVAL = 10m` (bakes in the defaults for visibility)
- **manage-forgejo-mirrors**: New how-to doc covering mirror creation, PAT storage, the `mirror-update-pats` task, and the full 20-day PAT rotation procedure

## Context

GitHub tightened unauthenticated rate limits for git clone/fetch in May 2025. With 23 GitHub mirrors syncing every 8 hours, authenticated fetches avoid throttling. The PAT is stored in 1Password (`Forgejo Secrets` → `github-mirror-pat`) and has been applied to all existing mirrors.

## Deployment and Testing

- [x] `mirror-update-pats` dry-run verified (23 mirrors detected)
- [x] `mirror-update-pats` applied to all 23 GitHub mirrors on indri
- [x] Idempotency confirmed (re-run shows 0 updated, 23 skipped)
- [ ] Provision indri with `--tags forgejo` to apply `[mirror]` config
- [ ] Trigger a manual mirror sync and verify success in Forgejo UI

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/269
2026-02-25 20:20:23 -08:00
23dc79058e Bake default display options into ai-docs mise task
The --style=header --color=never --decorations=always flags are now built
into the script so callers can just run `mise run ai-docs`. Also adds a
note to CLAUDE.md to never truncate the output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 17:42:47 -08:00
de54b4e33d Port CloudNative-PG off Helm to direct release manifest (#268)
## Summary
- Point ArgoCD app directly at forge-mirrored upstream repo (`mirrors/cloudnative-pg`) instead of the Helm charts repo
- Use `directory.include` to select the specific release manifest (`cnpg-1.27.1.yaml`) from the `releases/` directory
- No vendored files, no Helm — upgrades are a two-line change (`targetRevision` + `directory.include`)
- Delete unused `values.yaml` (was empty, all Helm defaults)

## Deployment and Testing
- [ ] Register mirror repo in ArgoCD: `argocd repo add ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git --ssh-private-key-path <key>`
- [ ] `argocd app set cloudnative-pg --revision feature/cnpg-direct-source && argocd app sync cloudnative-pg`
- [ ] Verify operator pod running: `kubectl get pods -n cnpg-system --context=minikube-indri`
- [ ] Verify CRDs exist: `kubectl get crd --context=minikube-indri | grep cnpg`
- [ ] Verify existing clusters healthy: `kubectl get clusters -A --context=minikube-indri`
- [ ] After merge: `argocd app set cloudnative-pg --revision main && argocd app sync cloudnative-pg`

## Notes
- The forge mirror was created via `mise run mirror-create` from `https://github.com/cloudnative-pg/cloudnative-pg.git`
- ArgoCD may need the mirror repo added to its known repositories if the credential template doesn't already match `mirrors/*`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/268
2026-02-25 17:37:53 -08:00
285ad4141f Fix Frigate detection events rate metric name in Grafana dashboard
The panel queried frigate_camera_events but the actual metric exposed
by Frigate is frigate_camera_events_total with a "camera" label
(not "camera_name").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 16:51:57 -08:00
Forgejo Actions
4736c7e9bd Update docs release to v1.11.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-25 07:04:23 -08:00
e273f399ea Review 3 how-to docs and fix update-tailscale-acls inaccuracies
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 07:02:49 -08:00
5f9bc20345 Fix mirror org refs in ArgoCD apps and widen credential template (#266)
## Summary

- Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials)
- Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts)
- Fix stale alloy clone comment in Ansible defaults
- Bump immich v2.5.2 → v2.5.6 (bug-fix patches only)
- Update ArgoCD README bootstrap command and credential docs

## Context

Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit `cd57814`. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos.

## Deployment

1. Sync the ArgoCD `argocd` app first (picks up the widened credential template)
2. Sync the `apps` app (picks up new repo URLs for all 8 apps)
3. Verify immich resolves its ComparisonError: `argocd app get immich`
4. Sync immich to deploy v2.5.6: `argocd app sync immich`
5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266
2026-02-25 06:55:53 -08:00
5c31b6b42a Add changelog fragments for C0 commits in this session
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:21:00 -08:00
4f8f2985c1 Update prometheus and teslamate image tags after mirror migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:18:15 -08:00
c1f4f0169b Add mise-tasks reference card and include in ai-docs
Categorized reference of all mise tasks with descriptions. Added to
the tools section of the reference index and to the ai-docs context
priming script.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:17:34 -08:00
33e3e4e098 Add mirror-create mise task for upstream mirrors
Creates mirrors in the mirrors/ Forgejo org via API. Supports
GitHub, Codeberg, and generic git URLs with auto-detection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:15:17 -08:00
e0f9ebebdf Update homepage, navidrome, ntfy, miniflux image tags after mirror migration
Prometheus and teslamate builds still in progress — will update in a
follow-up commit once their 33b7f0f tags land.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:06:08 -08:00
33b7f0f353 Switch prometheus, teslamate, miniflux to forge mirrors
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container (Nix) / detect (push) Successful in 2s
Build Container (Nix) / build (miniflux) (push) Successful in 3s
Build Container (Nix) / build (prometheus) (push) Successful in 3s
Build Container (Nix) / build (teslamate) (push) Successful in 2s
Build Container / build (miniflux) (push) Successful in 1m14s
Build Container / build (teslamate) (push) Successful in 13m42s
Build Container / build (prometheus) (push) Successful in 15m20s
Created miniflux mirror at mirrors/miniflux. All three containers
now clone from forge.ops.eblu.me/mirrors/ instead of GitHub directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:01:08 -08:00
34a1314f8d Document AirPlay cross-VLAN firewall rules and fix rule ordering
AirPlay from Main to IoT VLAN (Samsung Frame TV) required adding
established/related, AirPlay port, and dynamic reverse port rules —
but the root cause was rule ordering (allows appended after blocks).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 20:49:31 -08:00
cd578144f7 Migrate upstream mirrors to mirrors/ Forgejo org (#265)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container (Nix) / build (homepage) (push) Successful in 3s
Build Container (Nix) / build (navidrome) (push) Successful in 3s
Build Container (Nix) / build (ntfy) (push) Successful in 8s
Build Container / detect (push) Successful in 42s
Build Container / build (navidrome) (push) Successful in 9m37s
Build Container / build (homepage) (push) Successful in 9m56s
Build Container / build (ntfy) (push) Successful in 2m35s
## Summary

- Created `mirrors` Forgejo organization for upstream mirror repos
- Transferred 22 mirror repos from `eblume/` to `mirrors/` (mirror sync config preserved)
- Deleted unused repos: hajimari, hister
- Updated all container build URLs (homepage, navidrome, ntfy Dockerfiles + nix)
- Updated documentation references (migrate-forgejo-from-brew, upstream-fork-strategy, fix-ntfy-nix-version)
- `dotfiles` intentionally kept under `eblume/` per user request
- `devpi` transferred to `mirrors/`

Repos remaining under `eblume/`: blumeops, cv, mcquack, dotfiles

## Cleanup TODO

- [ ] Delete temp Forgejo API token "claude-migration-temp" (Settings > Applications)

## Test Plan

- [x] Verified mirror config (mirror=true, original_url) survived transfer on test repo (tesla_auth)
- [x] All pre-commit hooks pass (including container-version-check, docs-check-links)
- [ ] Verify a mirror repo sync runs successfully after transfer (check mirrors/authentik or similar)
- [ ] Rebuild containers from branch to verify Dockerfile URLs resolve

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/265
2026-02-24 20:43:14 -08:00
5c75419c85 Reworked README 2026-02-24 15:20:26 -08:00
61ee6a4d38 Fix Grafana ConfigMap labels lost in configMapGenerator migration
The hand-written configmap.yaml had app.kubernetes.io/name and
app.kubernetes.io/instance labels; configMapGenerator dropped them.
Add options.labels to both generator entries to restore parity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 14:46:20 -08:00
9b44a8ec51 Add kustomize images: and configMapGenerator: across services (#264)
## Summary

- Move hardcoded image tags to kustomization.yaml `images:` transformer across **22 services** — image names in manifests become version-agnostic templates, with tags centralized in one place per service
- Replace hand-written ConfigMap manifests with `configMapGenerator:` in **12 services** — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes
- Create new `kustomization.yaml` for **forgejo-runner** and **nvidia-device-plugin** (switches ArgoCD from directory mode to kustomize mode, rendered output identical)

### Services modified

**Images only (8):** cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent

**Images + configMapGenerator (10):** alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus

**Images only, no configMapGenerator (4):** authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is)

**Skipped entirely (6):** argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml)

### What changes at deploy time

- **images:** — no functional diff, `kustomize build` produces identical output with tags
- **configMapGenerator:** — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change

## Test plan

- [x] `kubectl kustomize` builds all 30 service directories successfully
- [x] Image tags verified in rendered output for all modified services
- [x] ConfigMap hash suffixes verified in rendered output
- [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names
- [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.)
- [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes
- [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264
2026-02-24 14:25:19 -08:00
86aeb60ec9 Fix TeslaMate dashboards: add database to PostgreSQL jsonData
Grafana 12.x's grafana-postgresql-datasource plugin requires the
database name in jsonData, not just the top-level database field.
Without it, the frontend blocks all queries with "no default database
configured", causing all TeslaMate panels to show "No Data."

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:49:07 -08:00
495c3e8496 Fix Grafana OAuth role mapping from Authentik groups
The INI parser was stripping outer single quotes from
role_attribute_path = 'Admin', causing Grafana to evaluate 'Admin'
as a JMESPath field identifier instead of a string literal. This
resulted in all OAuth users getting the default Viewer role.

Replaced with a proper group-based expression that checks for the
'admins' Authentik group and maps to Admin/Viewer accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:41:08 -08:00
4acd2e58d4 Update prometheus and grafana to main-SHA container tags
Prometheus: v3.9.1-74029e1 [branch] -> v3.9.1-2ba5d8a [main]
Grafana: v12.3.3-09ac36b [branch] -> v12.3.3-d05d2fb [main]

These images were built during PR development and referenced branch
commits that won't survive branch cleanup. The [main] tags are
identical rebuilds from the squash-merge commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 09:58:09 -08:00
1b9f706a30 Document container tag provenance and enhance container-list (#263)
## Summary

After investigating deployed container images, confirmed that squash-merging PRs orphans the commit SHAs embedded in container image tags. Two of our currently deployed images (prometheus, grafana) reference branch commits not on main.

This PR:

- Documents the squash-merge SHA orphan problem and the post-merge workflow in [[build-container-image]]
- Adds step 9 to the C1 process: after merging a PR that changes `containers/`, do a follow-up C0 to point manifests at the rebuilt `[main]` tag
- Rewrites `container-list` as a `uv run --script` (typer + rich + httpx)
- Adds optional container name filter (`mise run container-list prometheus` shows 10 tags instead of 4)
- Annotates every tag with `[main]` or `[branch]` based on git commit ancestry

## Test plan

- [x] `mise run container-list` — all containers shown with `[main]`/`[branch]` hints
- [x] `mise run container-list prometheus` — filtered view, more tags, correctly shows `[main]` and `[branch]`
- [x] `mise run container-list nonexistent` — error message with exit code 1
- [x] Pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/263
2026-02-24 09:54:58 -08:00
2ba5d8a8aa Port Prometheus to local container build (#262)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (prometheus) (push) Successful in 2s
Build Container / build (prometheus) (push) Successful in 7s
## Summary
- Add three-stage Dockerfile for Prometheus v3.9.1 (Node UI → Go binaries → Alpine runtime)
- Produces `prometheus` and `promtool` binaries with embedded web UI assets
- Follows navidrome/ntfy pattern for supply chain control via Zot registry

## Deployment and Testing
- [ ] `dagger call build --src=. --container-name=prometheus` succeeds
- [ ] Container reports correct version via `prometheus --version`
- [ ] `promtool --version` works
- [ ] Update statefulset image reference after successful build
- [ ] Deploy from branch: `argocd app set prometheus --revision <branch> && argocd app sync prometheus`
- [ ] Health probes pass (`/-/healthy`, `/-/ready`)
- [ ] Web UI loads, scrape targets work, remote write functions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/262
2026-02-24 09:15:57 -08:00
b1ba96f6d6 Review migrate-grafana-to-authentik: fix file paths, add last-reviewed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 07:29:41 -08:00
588da6bbcb Review cloudnative-pg: v1.28.1 is current, no upgrade needed
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 07:27:25 -08:00
Forgejo Actions
2f78d180e8 Update docs release to v1.11.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-23 21:04:33 -08:00
9b4951bf94 Improve Mikado process: cycle discipline, reset rigor, --resume enhancements (#261)
## Summary
- **End-of-cycle prompting:** After closing a leaf node and pushing, the agent should prompt the user to review and suggest ending the session rather than rushing into the next leaf
- **Reset rigor:** Reinforced that errors during impl should trigger a branch reset + plan update (not fix-forward). Documented the `git log --oneline --not main` → `git reset --hard` → `git cherry-pick` pattern with clear threshold guidance
- **`--resume` shows PR number:** Queries the Forgejo API for open PRs matching the branch, displays number/title/URL and a hint to run `pr-comments`
- **`--resume` checks git stash:** Shows stash entries as a non-presumptive hint — informs without assuming they apply

## Test plan
- [ ] `mise run docs-mikado --resume` runs without errors (no active chains case)
- [ ] On a mikado branch with an open PR, verify PR info is shown
- [ ] With stashed work, verify stash entries are displayed
- [ ] Review agent-change-process.md for clarity

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/261
2026-02-23 21:03:27 -08:00
d05d2fbaff C2: Upgrade Grafana to 12.x with Nix container and Kustomize (#260)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 1s
Build Container (Nix) / build (grafana) (push) Successful in 2s
Build Container / build (grafana) (push) Successful in 7s
## Summary

Mikado chain to upgrade Grafana from 11.4.0 (Helm chart) to 12.x with:
- Home-built Nix container image (`forge.ops.eblu.me/eblume/grafana`)
- Kustomize manifests replacing the Helm chart
- Single-source ArgoCD app

## Chain

Goal: `upgrade-grafana`
Leaves: `build-grafana-container`, `kustomize-grafana-deployment`

Track with: `mise run docs-mikado upgrade-grafana`

## Test plan
- [ ] Container builds successfully via Nix
- [ ] Container pushed to registry
- [ ] Kustomize manifests produce equivalent resources to current Helm
- [ ] Pod runs, UI loads, OIDC works, datasources healthy
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/260
2026-02-23 18:07:18 -08:00
9b419abf24 Update RUNNER_LABELS to use runner-job-image:v0.19.11-4c5e0f0
Now that the image is built under the new name, point the forgejo
runner at it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 17:47:14 -08:00
4c5e0f0d16 Rename containers/forgejo-runner to runner-job-image
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (runner-job-image) (push) Successful in 2s
Build Container / build (runner-job-image) (push) Successful in 1m42s
The forgejo-runner container is the CI job execution environment (Dagger,
ArgoCD CLI, etc.), not the runner daemon itself. Rename to runner-job-image
to fix the version-check false positive (Dagger 0.19.11 vs daemon 12.7.0)
and clarify the distinction.

RUNNER_LABELS still references the old image name — will update after
building the image under the new name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 17:44:51 -08:00
179eca2070 Fix container-build-and-release to resolve refs to full SHA
actions/checkout treats short SHAs as branch name patterns, causing
fetch failures. Always resolve --ref to a full 40-char SHA before
dispatching the workflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 17:27:18 -08:00
7641018c6a Fix container build workflows to checkout dispatch ref
When manually dispatching a container build with --ref, the build job
now checks out the specified commit instead of the branch HEAD. This
allows building containers from feature branches before merging.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 17:24:32 -08:00
c70aff256a Fix mikado-branch-invariant-check not validating incoming commits
The commit-msg hook had `pass_filenames: false`, which prevented
pre-commit from passing the commit message file path. Without it,
the hook only validated existing branch history and never checked
the incoming commit against ordering rules — plan-after-impl was
silently accepted.

Remove `pass_filenames: false` so the pending commit is included
in invariant validation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 16:32:34 -08:00
66b5b32f1d Formalize C0/C1/C2 change classification (#259)
## Summary
- **C0 (Quick Fix):** Now explicitly allows direct-to-main commits with no PR required — for low-risk, fix-forward-safe changes
- **C1 (Human Review):** New docs-first workflow with branch deployment (ArgoCD `--revision`, Ansible from checkout). Includes upgrade criteria for escalation to C2
- **C2 (Mikado Chain):** Introduces the **Mikado Branch Invariant** — strict commit ordering where card-introducing commits come first, followed by code progress, followed by card closures. Branch resets required when new prerequisites are discovered

Updates CLAUDE.md rules (3, 4, 8, 9) to reflect that C0 bypasses branching/PR requirements. Also updates ai-assistance-guide, how-to index, and docs-mikado task description.

## Files changed
- `CLAUDE.md` — rules and classification table
- `docs/how-to/agent-change-process.md` — full process rewrite
- `docs/tutorials/ai-assistance-guide.md` — branching and pitfalls sections
- `docs/how-to/how-to.md` — index description
- `mise-tasks/docs-mikado` — task description
- `docs/changelog.d/formalize-change-classification.doc.md` — changelog fragment

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/259
2026-02-23 16:19:54 -08:00
f05e5cccdf Review Grafana: replace Helm upgrade plan with C2 Mikado chain (#258)
## Summary
- Delete the old 3-phase Helm chart upgrade plan (predates Mikado system)
- Create C2 Mikado chain with goal card `upgrade-grafana` and two leaf prereqs:
  - `kustomize-grafana-deployment` — convert Helm to kustomize manifests
  - `build-grafana-container` — home-built Grafana 12.x image (no upstream containers)
- Record first-ever Grafana review: currently at v11.4.0 on Helm chart 8.8.2
- Update service-versions.yaml, how-to index, and plans index

## Service Review Findings
- Grafana is healthy and synced in ArgoCD
- Running v11.4.0, latest upstream is 12.3.3
- Breaking changes for 12.x are low-risk (React panels only, UIDs compliant)
- PVC is disposable — dashboards and datasources are all config-provisioned

## Deployment and Testing
- [ ] No deployment needed — documentation-only change
- [ ] `docs-check-links` passes
- [ ] `docs-check-index` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/258
2026-02-23 15:06:00 -08:00
2865bf5c27 Review deploy-authentik: rewrite as process guide (#257)
## Summary
- Rewrites deploy-authentik from a historical changelog into a reproducible process guide
- Removes stale version info (`v1.1.2-nix`) and future work section (Forgejo federation is done, rest belongs elsewhere)
- Marks deploy-authentik as completed in plans index and completed archive
- Removes hardcoded image tag from authentik reference card (use `service-versions.yaml`)
- Adds `last-reviewed: 2026-02-23` frontmatter

## Test plan
- [x] All pre-commit hooks pass (docs-check-links, docs-check-index, etc.)
- [x] ArgoCD app verified synced and healthy
- [x] All wiki-links validated

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/257
2026-02-23 14:35:39 -08:00
260b7ca520 Fix dagger call hanging in mise tasks on interactive terminals (#256)
## Summary

- Add `--progress=plain` to all `dagger call` invocations in mise tasks to prevent SIGTTOU hangs

## Root cause

Mise runs task scripts in a child process group that is not the terminal's foreground group. When `dagger call` detects a TTY (inherited from the interactive shell), it tries to render its TUI progress display, which requires terminal ioctls. Since the process is not in the foreground group, the kernel sends SIGTTOU, stopping the process indefinitely.

This only manifests when running from an interactive terminal (e.g. `pre-commit run --all-files` in fish/wezterm). CI and piped contexts are unaffected since there's no TTY.

## Changes

- `mise-tasks/validate-workflows` — add `--progress=plain`
- `mise-tasks/frigate-export-model` — add `--progress=plain`
- `mise-tasks/provision-ringtail` — add `--progress=plain`

## Test plan

- [x] `pre-commit run --all-files` completes without hanging
- [ ] Verify in interactive fish/wezterm terminal

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/256
2026-02-23 14:15:58 -08:00
84d2cdcf14 Update tooling dependencies (Feb 2026 cycle)
Pre-commit: trufflehog v3.93.4, ruff v0.15.2, shellcheck v0.11.0.1,
prettier v3.8.1, actionlint v1.7.11

Fly.io: pin nginx 1.28.2-alpine, bump alloy v1.5.1 -> v1.13.1

Forgejo workflows: pin actions/checkout to SHA (v4.3.1)

Mise tasks: normalize httpx>=0.28.0, typer>=0.15.0 across all scripts

Add how-to doc for the monthly tooling dependency update cycle.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 13:22:09 -08:00
cb9a06bb75 Update tooling dependencies (Feb 2026 cycle) (#254)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m30s
## Summary

Monthly tooling dependency update cycle:

- **Pre-commit hooks**: trufflehog v3.92.5→v3.93.4, ruff v0.14.13→v0.15.2, shellcheck v0.10.0.1→v0.11.0.1, prettier v3.8.0→v3.8.1, actionlint v1.7.10→v1.7.11
- **Fly.io Dockerfile**: pin nginx to 1.28.2-alpine (was unpinned), bump alloy v1.5.1→v1.13.1
- **Mise tasks**: normalize httpx lower bound to >=0.28.0 and typer to >=0.15.0 across all scripts
- **Forgejo workflows**: actions/checkout@v4 is current, no changes needed
- **New how-to doc**: [[update-tooling-dependencies]] documenting this monthly cycle

## No changes needed

- pre-commit-hooks v6.0.0, yamllint v1.38.0, shfmt v3.12.0-2, taplo v0.9.3, ansible-lint 26.1.1 — all already at latest

## Test plan

- [x] `uvx pre-commit run --all-files` — all 24 hooks pass
- [ ] Fly.io deploy (triggered automatically on merge to main via deploy-fly workflow)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/254
2026-02-23 13:08:41 -08:00
75fde54355 Fix Grafana TeslaMate dashboard folder provisioning (#253)
## Summary
- `foldersFromFilesStructure` was `false` in Grafana's sidecar provider config, causing Grafana to ignore the subdirectory structure the sidecar creates from `grafana_folder` annotations
- All 18 TeslaMate dashboards were appearing in the root "Dashboards" folder despite having `grafana_folder: "TeslaMate"` annotations on their ConfigMaps
- Flipping to `true` makes Grafana replicate the sidecar's directory structure as UI folders

## Deployment and Testing
- [ ] Sync `grafana` app: `argocd app sync grafana`
- [ ] Verify TeslaMate dashboards appear under a "TeslaMate" folder in Grafana's dashboard list
- [ ] Verify other dashboards remain in the root "Dashboards" folder

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/253
2026-02-22 18:38:51 -08:00
6871bf32a8 Remove unused gpu panel in frigate dashboard 2026-02-22 18:26:47 -08:00
2c6c6a244a Fix Frigate Prometheus metrics & rebuild Grafana dashboard (#252)
## Summary

- **Prometheus scrape target:** Changed from `frigate.frigate.svc.cluster.local:5000` (broken after ringtail migration) to `nvr.ops.eblu.me` via HTTPS through Caddy on indri
- **Grafana dashboard:** Rebuilt for Frigate 0.17 metrics — 12 panels total:
  - Row 1 (stats): Uptime, Inference Speed, Camera FPS, Detection FPS, GPU Usage, GPU Temp
  - Row 2 (timeseries): CPU Usage, Memory Usage
  - Row 3 (timeseries): Camera FPS + Skipped FPS, GPU Usage + Memory over time
  - Row 4 (timeseries): Storage Usage, Detection Events (rate by camera/label)

## Deployment and Testing

1. Sync prometheus app on branch:
   ```
   argocd app set prometheus --revision fix/frigate-metrics-dashboard && argocd app sync prometheus
   ```
2. Check `prometheus.ops.eblu.me/targets` — frigate job should show UP
3. Sync grafana-config:
   ```
   argocd app sync grafana-config
   ```
4. Check `grafana.ops.eblu.me` — Frigate NVR dashboard should show live data
5. After merge: reset both apps to `--revision main` and sync

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/252
2026-02-22 18:14:17 -08:00
Forgejo Actions
dda7d719b3 Update docs release to v1.11.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-22 17:52:05 -08:00
e655f4556e Upgrade k8s forgejo-runner from v6.3.1 to v12.7.0 (#251)
## Summary

Completes the `upgrade-k8s-runner` mikado chain. Both prerequisites (workflow validation in Dagger, config review against v12 defaults) were resolved in #250.

- Bump runner image `code.forgejo.org/forgejo/runner:6.3.1` → `12.7.0`
- Update `service-versions.yaml` to track new version
- Mark goal card complete (remove `status: active`)

## Deployment and Testing

After merge:
1. `argocd app sync forgejo-runner`
2. Verify runner registers in Forgejo admin → runners
3. Trigger a test workflow (e.g. `branch-cleanup.yaml` manual dispatch)

Rollback: revert image tag to `6.3.1`, push, sync.
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/251
2026-02-22 17:43:39 -08:00
0f6a1898f0 Prepare forgejo-runner v12 upgrade (leaf nodes) (#250)
## Summary
- Review runner config against v12.7.0 defaults — added `shutdown_timeout: 3h`, no breaking changes found
- Add `validate_workflows` Dagger function using `forgejo-runner validate --directory .` inside upstream container
- All 6 workflows pass v12.7.0 schema validation
- Wire `mise run validate-workflows` task and pre-commit hook on `.forgejo/workflows/` changes
- Mark both leaf Mikado cards (`review-runner-config-v12`, `validate-workflows-against-v12`) complete

## Mikado State
After merge, `upgrade-k8s-runner` goal card has no unmet dependencies — ready to execute the actual image bump in a follow-up PR.

## Test Plan
- [x] `dagger call validate-workflows --src=.` passes (all 6 workflows OK)
- [x] Pre-commit hooks pass
- [ ] Reviewer: confirm `shutdown_timeout: 3h` addition to ConfigMap looks reasonable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/250
2026-02-22 17:38:32 -08:00
00b0287bcc Upgrade k8s forgejo-runner from v6.3.1 to v12.x (#249)
## Summary
- C2 Mikado chain for upgrading the k8s forgejo-runner daemon (6 major versions behind)
- Root goal card with two leaf prerequisites: workflow validation and config review
- Ringtail runner is already at ~v12.6.4 via nixpkgs, no work needed there

## Mikado Chain

```
upgrade-k8s-runner (goal)
├── validate-workflows-against-v12 (leaf)
└── review-runner-config-v12 (leaf)
```

Both leaves are actionable now. The biggest risk is workflow schema validation
(introduced in v8/v9) rejecting our existing workflows.

## Next Steps
Work the leaf nodes in a follow-up session, then attempt the goal.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/249
2026-02-22 17:12:45 -08:00
198c810e86 Fix branch-cleanup: fall back to head.label for deleted branches (#248)
## Summary
- Forgejo rewrites `head.ref` to `refs/pull/N/head` once a PR's source branch is deleted from the remote
- The original branch name is preserved in `head.label`
- This was causing 188 out of 246 merged PRs to go undetected by the cleanup script
- Fix: fall back to `head.label` when `head.ref` starts with `refs/pull/`

## Test plan
- [x] Dry run correctly identifies 18 previously-missed local branches
- [x] Live run successfully deleted all 18

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/248
2026-02-22 16:15:51 -08:00
abb7e6fe88 Add branch-cleanup mise task (#247)
## Summary
- New `mise run branch-cleanup` task that finds branches merged into main and deletes them locally and on the Forgejo remote
- Configurable `--cutoff` (default 30 days) skips branches with recent HEAD commits
- Supports `--dry-run`, `--local-only`, `--remote-only` flags
- Interactive confirmation before any deletion

## Test plan
- [x] `mise run branch-cleanup -- --dry-run` shows correct table of candidates
- [ ] Run without `--dry-run` to confirm actual deletion works

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/247
2026-02-22 16:00:09 -08:00
d51c180fe6 Switch Frigate detection model from YOLO-NAS-S to YOLOv9-c (#246)
## Summary
- Replace abandoned YOLO-NAS-S (320x320, `yolonas`) with YOLOv9-c (640x640, `yolo-generic`)
- YOLOv9-c benefits from CUDA Graphs in Frigate 0.17 on the RTX 4080
- Add `export_yolov9` Dagger pipeline and `frigate-export-model` mise task for reproducible model exports
- Model already deployed to `sifaka:/volume1/frigate/models/yolov9-c-640.onnx`

## Config changes
- `model_type: yolonas` → `yolo-generic`
- `input_dtype: int` → `float`
- `width/height: 320` → `640`
- `path:` → `yolov9-c-640.onnx`

## Deployment and Testing
- [ ] Merge and sync Frigate ArgoCD app: `argocd app sync frigate`
- [ ] Verify Frigate starts and detects objects at https://nvr.ops.eblu.me
- [ ] Confirm GPU inference via Frigate system metrics

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/246
2026-02-22 15:14:45 -08:00
2c081eed28 Add Forgejo repository health metrics and Grafana dashboard (#245)
## Summary
- New `forgejo_metrics` Ansible role that queries the Forgejo REST API every 60s and writes Prometheus textfile metrics (open PRs, issues, languages, releases, commits, Actions runs/duration/success)
- Grafana dashboard "Forgejo Repository Health" with 12 panels across 4 rows: overview stats, CI/CD health, repository info, and staleness tracking
- Deletes superseded `forgejo-actions-dashboard` plan doc (this implementation covers a broader scope)

## Deployment and Testing
- [ ] `mise run provision-indri -- --tags forgejo_metrics` to deploy the collector
- [ ] `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/forgejo.prom'` to verify metrics
- [ ] `argocd app sync grafana-config` to deploy the dashboard
- [ ] Check Grafana dashboard "Forgejo Repository Health" loads with data
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/245
2026-02-22 11:16:03 -08:00
Forgejo Actions
c21cf54847 Update docs release to v1.11.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-22 10:21:19 -08:00
e41c28ed90 Replace indri-runner-logs with general-purpose runner-logs Typer CLI (#244)
## Summary
- Replace bash `indri-runner-logs` with a Python Typer CLI `runner-logs` that supports filtering by runner host (`indri`, `ringtail`, or `all`) with rich table output
- Add missing `#USAGE` declarations to `docs-review`, `docs-review-stale`, and `service-review` so flags work without the `--` separator
- Update docs references in `review-documentation.md` and `review-services.md` to use the new flag syntax

## Test plan
- [x] `mise run runner-logs all` lists runs from both runners
- [x] `mise run runner-logs ringtail` filters to ringtail-only runs
- [x] `mise run docs-review-stale --threshold 90` works without `--`
- [x] `mise run docs-review --limit 5` works without `--`
- [x] `mise run service-review --limit 3` works without `--`
- [x] Pre-commit hooks pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/244
2026-02-22 10:20:11 -08:00
c897fc8e1f Use Zot registry icon on homepage dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 09:19:46 -08:00
Forgejo Actions
627caeb61f Update docs release to v1.11.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-22 09:16:00 -08:00
c427f04ec4 Review 3 docs: agent-change-process, build-authentik-container, create-authentik-secrets (#243)
## Summary
- Stamped `last-reviewed: 2026-02-22` on three never-reviewed docs
- `agent-change-process.md`: accurate, no content changes
- `build-authentik-container.md`: accurate, container image verified in registry
- `create-authentik-secrets.md`: added note about additional OIDC client secret fields added since original card was written

## Changelog
- `docs/changelog.d/doc-review/agent-change-process.doc.md` (not added — stamp-only, no user-visible change)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/243
2026-02-22 09:12:31 -08:00
529ba10939 Fix frigate-notify: webapi polling, dedup, hi-res snapshots (#242)
## Summary
- Switch from MQTT to webapi polling (v0.5.4 requires only one method)
- Poll every 15s for responsive alerts
- **`notify_once: true`** — one notification per event instead of repeats as object changes zones
- **`nosnap: drop`** — skip events without snapshots (was causing all events to be dropped on v0.3.5)
- **`snap_hires: true`** — use recording stream for higher quality snapshot images

## Deployment and Testing
- [ ] Sync: `argocd app set frigate --revision fix/frigate-notify-config && argocd app sync frigate`
- [ ] Verify pod starts: `kubectl --context=k3s-ringtail -n frigate get pods -l app=frigate-notify`
- [ ] Check logs for successful startup and event processing (no "No snapshot" drops)
- [ ] Wait for a motion event and confirm single ntfy notification with hi-res snapshot
- [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/242
2026-02-22 09:05:45 -08:00
7dcab826fa Upgrade frigate-notify from v0.3.5 to v0.5.4 (#241)
## Summary
- Service review: upgrade frigate-notify from v0.3.5 to v0.5.4
- No breaking changes for current MQTT + ntfy config
- Notable additions: high-res snapshots, MQTT topic parsing fixes, env var parsing fixes

## Deployment and Testing
- [ ] Sync frigate app on ringtail: `argocd app set frigate --revision review/frigate-notify-v0.5.4 && argocd app sync frigate`
- [ ] Verify pod starts cleanly: `kubectl --context=k3s-ringtail -n frigate get pods`
- [ ] Trigger a test alert (motion event) and confirm ntfy notification arrives
- [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/241
2026-02-22 08:42:47 -08:00
a5429d5a34 Update ringtail flake inputs, add flake-update pipeline (#240)
## Summary
- Update all ringtail NixOS flake inputs (nixpkgs, disko, home-manager) to latest
- Add `flake_update` Dagger function (`nix flake update`) alongside existing `flake_lock` (`nix flake lock`)
- Add how-to guide for managing the ringtail lockfile
- Update dagger and ringtail reference cards

## Deployment and Testing
- [x] `mise run provision-ringtail` — deployed successfully, `changed=2` (repo + rebuild)
- [x] `mise run services-check` — all services healthy
- [x] Doc link and index checks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/240
2026-02-22 08:17:52 -08:00
b4015153c6 No navidrome authentikation 2026-02-21 20:33:48 -08:00
a82c705bf6 Add SSO login button to Jellyfin login page
Deploy branding.xml with a "Sign in with Authentik" button in the
login disclaimer. Local password login remains available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 20:08:57 -08:00
07fb48626d Add Authentik SSO integration for Jellyfin (#239)
## Summary
- Add Authentik OIDC provider + application for Jellyfin via blueprint (all authenticated users allowed, no policy binding)
- Wire `jellyfin-client-secret` through ExternalSecret and Authentik worker deployment
- Install [jellyfin-plugin-sso](https://github.com/9p4/jellyfin-plugin-sso) v4.0.0.3 via Ansible, with OIDC config template
- Authentik `admins` group maps to Jellyfin administrator role
- Local login left enabled; SSO is additive

## Deployment and Testing
- [ ] Sync ArgoCD `authentik` app on branch — verify provider + application appear in Authentik admin
- [ ] `mise run provision-indri -- --tags jellyfin --check --diff` (dry run)
- [ ] `mise run provision-indri -- --tags jellyfin` (deploy plugin + config)
- [ ] Test SSO flow: `https://jellyfin.ops.eblu.me/sso/OID/start/authentik`
- [ ] Verify `eblume` account auto-links via `preferred_username` match
- [ ] Verify admins group → Jellyfin admin
- [ ] Reset ArgoCD app revision to main after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/239
2026-02-21 20:05:44 -08:00
e1c2892878 Fix container tags deleted during old-tag cleanup
Five container manifests were removed when deleting old-style tags
(shared digests). Rebuild on a72a0d8 and update references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 16:26:29 -08:00
a72a0d8e8e Update all container images to new upstream-version tagging scheme (#238)
## Summary
- Updates all 15 container image references across 14 ArgoCD manifest files
- Migrates from old internal `vX.Y.Z` tags to new `v<upstream-version>-<sha>` format
- Covers: authentik, cv, devpi, forgejo-runner, homepage, kiwix-serve, kubectl, miniflux, navidrome, ntfy, quartz, teslamate, transmission

## Deployment and Testing
- [ ] Sync all ArgoCD apps on branch revision
- [ ] Verify all services come up healthy
- [ ] Merge and re-sync on main
- [ ] Clean up old-style tags from zot registry

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/238
2026-02-21 15:58:11 -08:00
55d31c9c0b Docs pass: update zot Mikado chain for completion
- harden-zot-registry: fix Authentik hostname, check off all
  verified items, add metrics config to "what was done"
- enforce-tag-immutability: fix admins permissions (was missing
  update)
- agent-change-process: clarify that requires: is permanent and
  status: active is the only completion marker
- zot reference: update modified date
- wire-ci-registry-auth fragment: add metrics fix
- Remove stale harden-zot-mikado-cards.ai.md planning fragment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 15:32:34 -08:00
8775c3841a Allow anonymous access to zot /metrics endpoint
Add accessControl.metrics.users with empty string to allow
unauthenticated Prometheus/Alloy scraping. Zot represents
anonymous users with an empty username internally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 12:37:59 -08:00
ff63679efb Enable zot registry auth + wire CI credentials (#237)
## Summary

- Enable OIDC + API key authentication on zot registry with three-tier accessControl
  - `anonymousPolicy: ["read"]` — anyone can pull
  - `artifact-workloads` group: `["read", "create"]` — CI push, no overwrite/delete
  - `admins` group: `["read", "create", "update", "delete"]` — break-glass
- Wire both CI push paths (Dagger and Nix/skopeo) with `ZOT_CI_API_KEY` credentials
- Add `artifact-workloads` PolicyBinding in Authentik blueprint for zot app access
- Add `ZOT_CI_API_KEY` to Forgejo Actions secrets via existing ansible role

Completes the `wire-ci-registry-auth` and `harden-zot-registry` Mikado cards.

## Manual Deployment Steps (after merge)

1. Deploy Authentik blueprint: `argocd app sync authentik`
2. In Authentik admin UI: set a password for the `zot-ci` service account
3. Deploy zot config: `mise run provision-indri -- --tags zot`
4. Log in to `https://registry.ops.eblu.me` as `zot-ci` via OIDC → generate API key
5. Store API key in 1Password as `zot-ci-apikey` in blumeops vault
6. Sync Forgejo secrets: `mise run provision-indri -- --tags forgejo_actions_secrets`
7. Trigger a test container build to verify CI push
8. Verify anonymous pull: `curl -sf https://registry.ops.eblu.me/v2/_catalog`

## Uncertainties

- **Zot `accessControl` group matching with OIDC:** Groups from Authentik's `profile` scope claim should map to zot policy groups, but the exact claim-to-group matching needs runtime verification
- **`http.auth.apikey: true`:** This config key is documented but needs verification against the specific zot version built from source on indri
- **API key permissions:** Need to confirm zot API keys inherit the generating user's group for accessControl evaluation

## Test Plan

- [ ] `mise run provision-indri -- --check --diff --tags zot` shows expected config changes
- [ ] Anonymous pull works after deploy
- [ ] Unauthenticated push fails (401)
- [ ] OIDC browser login redirects to Authentik and back
- [ ] API key push works after key generation
- [ ] CI push succeeds with both Dagger and skopeo paths
- [ ] `mise run services-check` passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/237
2026-02-21 12:20:29 -08:00
30a7c4de9b Close register-zot-oidc-client Mikado card
Completed in PR #236. Updated card to reflect what was actually
implemented, including deviations (worker env var wiring, manual
service account setup).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-21 08:49:32 -08:00
21b6533aea Register Zot as OIDC client in Authentik (#236)
## Summary
- Add Authentik blueprint (`zot.yaml`) with OAuth2 provider, application, `artifact-workloads` group, and `zot-ci` service account
- Wire `zot-client-secret` through ExternalSecret → worker Deployment env var → blueprint `!Env`
- Add Ansible pre_task to fetch OIDC secret from 1Password (item ID `oor7os5kapczgpbwv7obkca4y4`)
- Add `oidc-credentials.json.j2` template and deploy task in zot role (with `when` guard)

## Manual Steps Required Before Deploy
1. Generate client secret: `openssl rand -hex 32`
2. Store in 1Password: add field `zot-client-secret` to "Authentik (blumeops)" item in vault `blumeops`

## What This Does NOT Do
- Does NOT modify `config.json.j2` (that's the root goal `harden-zot-registry`)
- Does NOT wire CI auth (that's `wire-ci-registry-auth`)
- Does NOT set service account password or API keys (manual post-deploy)

## Verification
After ArgoCD sync:
- [ ] Authentik admin UI shows "Zot Registry" application
- [ ] OIDC discovery at `https://authentik.ops.eblu.me/application/o/zot/.well-known/openid-configuration` returns valid JSON
- [ ] Blueprint status is `successful`
- [ ] `artifact-workloads` group exists with `zot-ci` service account

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/236
2026-02-21 08:45:06 -08:00
04e036c603 Fold enforce-tag-immutability into harden-zot-registry (#235)
## Summary

- Removed `status: active` from `enforce-tag-immutability` card — its requirements are folded into the parent `harden-zot-registry` goal's `accessControl` configuration
- Updated `harden-zot-registry` with three-tier access control spec (anonymous read, artifact-workloads read+create, admins full)
- Added `artifact-workloads` group creation step to `register-zot-oidc-client`
- Added service account context to `wire-ci-registry-auth`

## Rationale

Tag immutability requires authentication to be meaningful. Without auth, everyone is anonymous and gets the same policy. Rather than client-side push checks, the registry enforces immutability server-side: CI gets `["read", "create"]` (no update/delete), so pushing an existing tag is rejected by zot itself.

## Test plan

- [ ] `mise run docs-check-links` passes
- [ ] `mise run docs-mikado` shows enforce-tag-immutability as resolved

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/235
2026-02-21 08:05:16 -08:00
64691da4fb Complete adopt-commit-based-container-tags Mikado card
All 14 containers now have commit-SHA-based tags in the registry.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:28:45 -08:00
96a2d420fb Update install-dagger-on-nix-runner card with actual resolution
Dagger can't run on the bare nix runner (needs container runtime).
Used nix eval directly instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:23:06 -08:00
b8bc0bffd8 Update ringtail flake.lock (remove dagger input)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:21:36 -08:00
02f2be5523 Use nix eval instead of dagger for nix runner version extraction
Dagger CLI needs a container runtime (Docker/containerd) to start its
engine, which the bare nix runner doesn't have. Use nix eval directly
instead — it's already available and more appropriate for a nix host.

Reverts the dagger flake input since it's not usable here.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:21:16 -08:00
de037ae51f Update ringtail flake.lock with dagger/nix input
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:15:19 -08:00
a2d2808270 Add dagger CLI to nix runner via github:dagger/nix flake
Dagger was removed from nixpkgs due to trademark concerns. Use the
official dagger/nix flake as a flake input instead, passing the package
through via specialArgs.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 23:14:46 -08:00
e0d5f28147 Add dagger to nix-container-builder runner (#234)
## Summary
- Add `dagger` to `hostPackages` for the ringtail nix-container-builder runner
- Needed for `dagger call nix-version` fallback in the nix build workflow (authentik)
- `hostPackages` is scoped to the runner's systemd unit PATH, not system-wide
- Marks `install-dagger-on-nix-runner` Mikado card complete

## Deployment and Testing
- [ ] Merge, then `mise run provision-ringtail`
- [ ] `mise run container-build-and-release authentik` to verify nix build succeeds

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/234
2026-02-20 23:09:01 -08:00
a68a170a10 Add install-dagger-on-nix-runner Mikado card (#233)
## Summary
- New Mikado card: the ringtail nix-container-builder runner lacks dagger, which the nix workflow needs for `dagger call nix-version` (authentik version extraction fallback)
- Re-opens `adopt-commit-based-container-tags` with this new prerequisite
- All other containers (11 Dockerfile-only, nettest + ntfy with nix) build fine — only authentik's nix build is blocked

## Deployment and Testing
- Docs only, no deployment needed

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/233
2026-02-20 23:03:12 -08:00
ffa8727660 Adopt commit-based container tags (#232)
## Summary
- Replace git-tag-triggered container builds with path-based triggers on main and workflow_dispatch
- Image tags now encode upstream app version + commit SHA (`vX.Y.Z-<sha>`) for full traceability
- Replace `container-tag-and-release` task with `container-build-and-release` (dispatches workflows via Forgejo API)
- Update dagger `publish()` to accept `commit_sha` parameter
- Update all docs and references to the new workflow

## Deployment and Testing
- [ ] Merge to main
- [ ] `mise run container-build-and-release <name>` for each container to populate new-format tags
- [ ] Verify tags in registry via `mise run container-list`
- [ ] Existing images untouched — old tags remain available

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/232
2026-02-20 22:56:20 -08:00
0e2c10176d Harden zot registry, pt 1 (#231)
## Summary
- Enable OIDC + API key authentication on zot with anonymous pull preserved
- Enforce tag immutability for version tags
- Adopt commit-SHA-based container image tagging

Details in the [[harden-zot-registry]] Mikado chain (`mise run docs-mikado harden-zot-registry`).

## Test plan
- [ ] Anonymous pull still works
- [ ] Unauthenticated push fails (401)
- [ ] CI container builds pass with new auth and tagging
- [ ] `mise run services-check` passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/231
2026-02-20 22:50:01 -08:00
6d7071e5ec Add commit-based container tagging prereq to harden-zot-registry chain (#230)
## Summary

- New Mikado card: `adopt-commit-based-container-tags` — replaces git-tag-triggered container builds with path-based main-branch triggers and manual workflow dispatch
- Image tags become `vX.Y.Z-<sha>` (with `-main` suffix for main branch builds, `-nix` for Nix builds), tying versions to the actual bundled app version and exact source commit
- `container-tag-and-release` mise task to be renamed to `container-build-and-release`, triggering workflow dispatch with the current HEAD SHA
- Added as soft prereq to `harden-zot-registry` Mikado chain

## Test plan

- [x] Pre-commit hooks pass (docs-check-index, docs-check-links, etc.)
- [ ] Review card content for completeness

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/230
2026-02-20 18:26:27 -08:00
379bcb98af Create C2 Mikado cards for harden-zot-registry (#229)
## Summary
- Replace the old pre-Mikado plan doc (`docs/how-to/plans/harden-zot-registry.md`) with a proper C2 Mikado chain in `docs/how-to/zot/`
- Root goal: `harden-zot-registry` — enable OIDC + API key auth on zot with anonymous pull preserved
- Three leaf prereqs: `register-zot-oidc-client`, `wire-ci-registry-auth`, `enforce-tag-immutability`
- Add Zot section to `how-to.md` index, remove plan entry from plans index
- All doc checks pass (`docs-check-links`, `docs-check-index`, `docs-mikado`)

## Changes
- **New:** `docs/how-to/zot/harden-zot-registry.md` — C2 Mikado root goal
- **New:** `docs/how-to/zot/register-zot-oidc-client.md` — Register OIDC client in Authentik
- **New:** `docs/how-to/zot/wire-ci-registry-auth.md` — Wire CI push paths with registry auth
- **New:** `docs/how-to/zot/enforce-tag-immutability.md` — Prevent version tag overwrites
- **Deleted:** `docs/how-to/plans/harden-zot-registry.md` — Old plan doc (content absorbed into Mikado cards)
- **Updated:** `docs/how-to/how-to.md` — Add Zot section, remove plan entry
- **Updated:** `docs/how-to/plans/plans.md` — Remove plan entry

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/229
2026-02-20 17:56:25 -08:00
cd50c1454a Integrate Forgejo with Authentik OIDC (#228)
## Summary

- Refactor Authentik blueprints: extract shared `admins` group into `common.yaml`, add `groups` scope mapping to all providers for group-based admin propagation
- Add Forgejo OAuth2 provider and application blueprint (`forgejo.yaml`)
- Add `forgejo-client-secret` to ExternalSecret and worker deployment env
- Configure Forgejo `[oauth2_client]` with `ACCOUNT_LINKING=login` to safely link existing accounts
- Update documentation (forgejo.md, authentik.md, federated-login.md)

## Deployment and Testing

After merge, deployment requires these steps in order:

1. **Authentik (ArgoCD):**
   - `argocd app set authentik --revision feature/forgejo-authentik-oidc && argocd app sync authentik`
   - Verify: Forgejo app/provider visible in Authentik admin UI
   - Verify: Grafana SSO still works (blueprint refactor)

2. **Forgejo app.ini (Ansible):**
   - `mise run provision-indri -- --tags forgejo --check --diff` (dry run)
   - `mise run provision-indri -- --tags forgejo` (apply, restarts Forgejo)

3. **Create Forgejo auth source (CLI on indri):**
   ```
   ssh indri 'sudo -u forgejo /opt/homebrew/bin/forgejo admin auth add-oauth \
     --name authentik \
     --provider openidConnect \
     --key forgejo \
     --secret "$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/Authentik (blumeops)/forgejo-client-secret")" \
     --auto-discover-url https://authentik.ops.eblu.me/application/o/forgejo/.well-known/openid-configuration \
     --scopes "openid email profile groups" \
     --group-claim-name groups \
     --admin-group admins'
   ```

4. **Link eblume account:** Sign in with Authentik on Forgejo, confirm link with local password

5. **Verify:** `tea repo list`, Forgejo Actions, local password break-glass

After merge: `argocd app set authentik --revision main && argocd app sync authentik`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/228
2026-02-20 17:39:50 -08:00
e0c6b7df99 Add Authentik to homepage dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 13:03:48 -08:00
71cb256527 Deploy Authentik identity provider (C2 Mikado) (#227)
## Summary
C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex.

This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved.

## Current Mikado State
- **Goal:** `deploy-authentik` (active)
- **Leaf prerequisites:**
  - `build-authentik-container` — Build Nix container image
  - `provision-authentik-database` — Create PostgreSQL database on CNPG cluster
  - `create-authentik-secrets` — Create 1Password item with credentials

## Process refinements
- Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early

## Test plan
- [ ] `mise run docs-mikado` shows correct dependency chain
- [ ] Leaf nodes can be worked independently
- [ ] Container builds on ringtail
- [ ] Authentik starts and reaches healthy state
- [ ] Forgejo OAuth2 connector works

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227
2026-02-20 12:55:59 -08:00
174c6414ac Convert deploy-authentik plan to C2 Mikado chain (#226)
## Summary
- Strip detailed phase instructions from deploy-authentik plan (400→50 lines)
- Retain architecture decisions (ringtail, CNPG on indri, Nix containers, kustomize, Tailscale+Caddy) and open questions
- Add `status: active` frontmatter — now visible as a root goal in `mise run docs-mikado`
- Update plans index to reflect Active (C2) status

This is the first real use of the C2 Mikado chain system from #225. Future sessions will discover prerequisites, create sub-cards with `requires`, and work leaf nodes first.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/226
2026-02-20 08:22:19 -08:00
c1f7b2a9a3 Add agent change process (C0/C1/C2) and docs-mikado tool (#225)
## Summary
- Introduce C0/C1/C2 change classification based on the Mikado method, where documentation cards serve as persistent context for agents across sessions
- Add `docs-mikado` mise task to visualize active Mikado dependency chains from `status: active` and `requires` frontmatter fields
- Rename `zk-docs` task to `ai-docs`

## Changes
- **New:** `docs/how-to/agent-change-process.md` — methodology card
- **New:** `mise-tasks/docs-mikado` — Python uv script for dependency graph visualization
- **Renamed:** `mise-tasks/zk-docs` → `mise-tasks/ai-docs`
- **Updated:** `CLAUDE.md` — added Change Classification section, updated references
- **Updated:** `ai-assistance-guide.md`, `exploring-the-docs.md`, `how-to.md` — updated references and index

## Verification
- [x] `mise run ai-docs` works
- [x] `mise run docs-mikado` runs (no active chains yet, as expected)
- [x] `docs-check-links` — all valid
- [x] `docs-check-index` — all indexed
- [x] `docs-check-frontmatter` — all valid
- [x] All pre-commit hooks pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/225
2026-02-20 08:15:20 -08:00
c3748a0638 Add Authentik deployment plan (#224)
## Summary
- New plan document at `docs/how-to/plans/deploy-authentik.md` covering full Authentik deployment
- 6 phases: Helm analysis, prerequisites (CNPG/Redis/1Password), Nix containers, kustomize manifests, networking, monitoring
- Authentik replaces Dex as the identity provider for central user management and multi-protocol SSO
- Updated plans index and how-to index

## Deployment and Testing
- [x] Pre-commit hooks pass (docs-check-links, docs-check-index, docs-check-frontmatter)
- [ ] Review plan content for accuracy and completeness
- No deployment needed — documentation only

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/224
2026-02-20 07:06:56 -08:00
Forgejo Actions
18f1ac61fc Update docs release to v1.10.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-19 20:45:43 -08:00
d21798b1f3 Document Dex OIDC and add services-check integration (#223)
## Summary
- Create Dex reference card (`docs/reference/services/dex.md`) with quick reference, architecture, identity source, storage, OIDC clients, secrets, and endpoints
- Write federated login explanation article (`docs/explanation/federated-login.md`) covering the Dex + Forgejo two-layer auth model, login flow, and break-glass access
- Add Dex to `services-check` (HTTP health endpoint + k3s pod check)
- Update Grafana docs with new Authentication section documenting SSO via Dex
- Update Forgejo docs with OAuth2 Provider section documenting its role as upstream identity source
- Add Dex to ringtail workloads table and reference service index
- Move `adopt-oidc-provider` plan to `completed/` with final design reflecting actual implementation

## Test plan
- [ ] `mise run services-check` passes (includes new Dex checks)
- [ ] `docs-check-links` passes (all wiki-links resolve)
- [ ] `docs-check-index` passes (new docs are indexed)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/223
2026-02-19 20:44:23 -08:00
0cdc143227 Deploy Dex OIDC identity provider with Grafana SSO (#222)
## Summary
- Deploys Dex OIDC identity provider on ringtail k3s cluster as central authentication service
- Integrates Grafana as first SSO client via `auth.generic_oauth`
- Uses Kubernetes CRD storage backend (no PVC needed)
- All secrets (bcrypt hash, client secrets) injected via ExternalSecrets from 1Password item "Dex (blumeops)"
- NixOS-built container image via `containers/dex/default.nix`

## Pre-requisites (manual, before deployment)
1. Create 1Password item "Dex (blumeops)" in `blumeops` vault with fields:
   - `password`: strong generated password for Dex login
   - `static-password-hash`: bcrypt hash of above (`htpasswd -BnC 10 eblume`, copy hash after `eblume:`)
   - `grafana-client-secret`: random 32-char hex (`openssl rand -hex 16`)
2. Build container: `mise run container-tag-and-release dex v1.0.0`

## Deployment sequence
1. Build container: `mise run container-tag-and-release dex v1.0.0`
2. Deploy Caddy: `mise run provision-indri -- --tags caddy`
3. Sync ArgoCD: `argocd app sync apps` → `argocd app sync dex`
4. Verify Dex: `curl https://dex.ops.eblu.me/.well-known/openid-configuration`
5. Sync Grafana: `argocd app sync grafana-config` → `argocd app sync grafana`
6. Test SSO: Visit `https://grafana.ops.eblu.me/login`, click "Sign in with Dex"

## Verification
- [ ] Container image exists: `mise run container-list` shows `dex:v1.0.0-nix`
- [ ] `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid OIDC discovery
- [ ] `curl https://dex.ops.eblu.me/healthz` returns healthy
- [ ] Grafana login shows "Sign in with Dex" button alongside local login
- [ ] OIDC flow: click Dex → enter credentials → redirect back → logged in as Admin
- [ ] Break-glass: local admin login still works
- [ ] `mise run services-check` passes

## Files changed
| File | Action | Purpose |
|------|--------|---------|
| `containers/dex/default.nix` | Create | NixOS container build |
| `argocd/apps/dex.yaml` | Create | ArgoCD app targeting ringtail |
| `argocd/manifests/dex/*` (8 files) | Create | K8s manifests (RBAC, ExternalSecret, Deployment, Service, Ingress) |
| `argocd/manifests/grafana-config/external-secret-dex-oauth.yaml` | Create | Grafana OIDC client secret |
| `argocd/manifests/grafana-config/kustomization.yaml` | Modify | Add new ExternalSecret resource |
| `argocd/manifests/grafana/values.yaml` | Modify | Add `auth.generic_oauth` config + envFromSecrets |
| `ansible/roles/caddy/defaults/main.yml` | Modify | Add `dex.ops.eblu.me` reverse proxy entry |
| `docs/changelog.d/feature-dex-oidc.feature.md` | Create | Changelog fragment |

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/222
2026-02-19 20:24:24 -08:00
b876e39981 Replace Homepage Helm chart with kustomize manifests and custom Dockerfile (#221)
## Summary
- Replace third-party Helm chart (jameswynn/homepage v2.1.0, pinned at app v1.2.0) with plain kustomize manifests and a custom Dockerfile building from forge mirror at v1.10.1
- Adds Dockerfile (`containers/homepage/`) with multi-stage build (node:22-slim builder, node:22-alpine runtime)
- Creates kustomize manifests: Deployment, Service, ConfigMap (6 config files), ServiceAccount, ClusterRole, ClusterRoleBinding
- Keeps existing ingress-tailscale.yaml and all 6 ExternalSecret resources unchanged
- Updates ArgoCD app definition from multi-source Helm to single directory source

## Prerequisite
- Homepage source mirrored at forge.ops.eblu.me/eblume/homepage.git 
- Container must be built and pushed before syncing: `mise run container-release homepage v1.10.1`

## Deployment and Testing
- [ ] Build and push container image: `mise run container-release homepage v1.10.1`
- [ ] Branch-test via ArgoCD: `argocd app set homepage --revision feature/homepage-kustomize && argocd app sync homepage`
- [ ] Verify dashboard loads at go.ops.eblu.me / go.tail8d86e.ts.net
- [ ] Verify k8s autodiscovery works (services appear on dashboard)
- [ ] Verify widgets load (weather, Forgejo, Jellyfin, etc.)
- [ ] After merge: `argocd app set homepage --revision main && argocd app sync homepage`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/221
2026-02-19 18:29:19 -08:00
869f6bd20d Review: update-documentation doc (#220)
## Summary
- Add missing workflow step 8: Fly.io proxy cache purge after deploy
- Remove misleading "Directory" column from changelog fragment types table (all fragments use flat `<name>.<type>.md` pattern, not subdirectories)
- Stamp `last-reviewed: 2026-02-19`

## Review notes
Verified all claims against actual workflow YAML, Dagger pipeline, ArgoCD manifests, towncrier config, and Quartz config files. Everything else checks out.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/220
2026-02-19 17:40:05 -08:00
cabd0bc9cf Update Frigate zone masks and expand alert notifications (#219)
## Summary
- Synced driveway_entrance zone coordinates from live Frigate config (adjusted mask boundaries)
- Added `inertia: 3` and `loitering_time: 0` to driveway_entrance zone
- Expanded review alerts to require either `driveway_entrance` or `driveway` zone (was entrance only)
- Updated frigate-notify config to allow alerts from both `driveway_entrance` and `driveway` zones

## Deployment and Testing
- [ ] Merge and sync frigate ArgoCD app on ringtail
- [ ] Sync frigate-notify (restart pod to pick up ConfigMap change)
- [ ] Verify alerts fire for person/car in driveway zone

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/219
2026-02-19 17:32:02 -08:00
291fff345c Fix services-check and update docs for Frigate migration to ringtail (#218)
## Summary
- Move mosquitto, ntfy, frigate, frigate-notify pod checks from `minikube-indri` to `k3s-ringtail` context in `services-check`
- Add `nvidia-device-plugin` pod check for ringtail k3s
- Rename "Kubernetes pods" section to "Indri minikube pods" for clarity
- Update 8 documentation files to reflect the migration completed in PRs #216/#217

## Files Changed
| File | Change |
|------|--------|
| `mise-tasks/services-check` | Move 4 pod checks to k3s-ringtail, add nvidia-device-plugin |
| `docs/reference/services/frigate.md` | Image→tensorrt, detector→ONNX/CUDA, shm→512Mi |
| `docs/reference/infrastructure/ringtail.md` | List actual k3s workloads |
| `docs/reference/infrastructure/indri.md` | Note frigate migration |
| `docs/explanation/architecture.md` | Add ringtail to diagram + compute layer |
| `docs/reference/kubernetes/cluster.md` | Note two clusters, add k3s section |
| `docs/reference/reference.md` | Update frigate/ntfy location |
| `docs/how-to/plans/completed/operationalize-reolink-camera.md` | Add post-completion migration note |
| `CLAUDE.md` | Add k3s-ringtail context guidance |

## Test plan
- [ ] `mise run services-check` — all checks pass
- [ ] Review each doc for accuracy against deployed state

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/218
2026-02-19 14:38:21 -08:00
d5d32fe91f Port Frigate NVR to ringtail k3s with GPU acceleration (#217)
## Summary

- Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime
- Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose `nvidia.com/gpu` resources
- Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX)
- Switch Frigate image to `-tensorrt` variant with GPU resource limits and increased shared memory

## Manual Prerequisites

1. **NFS access**: Verify ringtail can mount `sifaka:/volume1/frigate`
   ```fish
   ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1'
   ```
2. **YOLO model**: Verify `/volume1/frigate/models/yolov9m.onnx` exists on sifaka

## Deployment Steps

1. Provision ringtail: `mise run provision-ringtail`
2. Sync ArgoCD apps: `argocd app sync apps --prune`
3. Deploy NVIDIA device plugin: `argocd app sync nvidia-device-plugin`
4. Verify GPU: `kubectl --context=k3s-ringtail get nodes -o json | jq '.items[].status.capacity'`
5. Deploy Frigate: `argocd app sync frigate`

## Verification

- [ ] `nvidia.com/gpu: 1` visible in node capacity
- [ ] Frigate pod running with GPU allocated
- [ ] Frigate UI loads at `https://nvr.ops.eblu.me`
- [ ] Detector shows ONNX/TensorRT on System page
- [ ] Camera feed with bounding boxes in live view
- [ ] TensorRT engine build completes (watch logs on first start)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/217
2026-02-19 14:27:04 -08:00
16a4a9a616 Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216)
## Summary
- Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired
- Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s
- Move ntfy ArgoCD app from indri minikube to ringtail k3s
- Update Frigate docs to reflect detector removal and planned RTX 4080 migration
- Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail

## Deployment

After merge:
1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps:
   ```
   argocd app sync apps --prune
   ```
2. Sync new ringtail apps:
   ```
   argocd app sync mosquitto-ringtail
   argocd app sync ntfy-ringtail
   ```
3. Manually clean up the detector LaunchAgent on indri:
   ```
   ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ```

## Notes
- Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns)
- ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup
- Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed
- Frigate + frigate-notify will be ported to ringtail in a follow-up PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216
2026-02-19 11:22:44 -08:00
61ca1ca305 Deploy Tailscale operator on ringtail k3s cluster (#215)
## Summary
- Extract shared Tailscale operator resources (CRDs, RBAC, Deployment, ProxyClass, DNSConfig) into `tailscale-operator-base/` so both clusters reference the same manifests
- Add `tailscale-operator-ringtail/` overlay with 1-replica ProxyGroup and ExternalSecret for the shared OAuth client
- Add ArgoCD Application targeting `ringtail.tail8d86e.ts.net:6443`
- Update `.yamllint.yaml` ignore path for the moved `operator.yaml`

## Deployment and Testing
- [ ] Sync `apps` app to pick up the new Application definition
- [ ] `argocd app sync tailscale-operator-ringtail`
- [ ] Verify ExternalSecret syncs: `kubectl --context=k3s-ringtail -n tailscale get externalsecret`
- [ ] Verify operator pod runs: `kubectl --context=k3s-ringtail -n tailscale get pods`
- [ ] Verify ProxyGroup ready: `kubectl --context=k3s-ringtail -n tailscale get proxygroups`
- [ ] Verify indri operator still works: `argocd app diff tailscale-operator`
- [ ] Check Tailscale admin for new operator device with `tag:k8s-operator`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/215
2026-02-19 09:33:05 -08:00
695089499e Nix container build for nettest (#214)
## Summary
- Add `containers/nettest/default.nix` using `dockerTools.buildLayeredImage` with curl, jq, dnsutils, cacert, and bash — equivalent to the existing Dockerfile
- Update `container-tag-and-release` to require `--nix` or `--dockerfile` flag when both build types exist for a container
- Update `container-list` to show `[dockerfile+nix]` label when both exist

## Deployment and Testing
- [ ] SSH to ringtail, run `nix build -f containers/nettest/default.nix -o result` to verify the nix expression builds
- [ ] Tag `nettest-nix-v1.0.0`, confirm `build-container-nix` workflow runs on `nix-container-builder` runner and pushes to registry
- [ ] Smoke test on ringtail k3s: `kubectl run nettest --image=registry.ops.eblu.me/blumeops/nettest:v1.0.0 --restart=Never && kubectl logs nettest`
- [ ] Verify `mise run container-list` shows `[dockerfile+nix]` for nettest
- [ ] Verify `mise run container-tag-and-release nettest v1.1.0` prompts for build type

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/214
2026-02-19 08:42:58 -08:00
b475a1fcd7 Fix 1Password secret tasks always reporting changed in ringtail playbook (#213)
## Summary
- Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml`
- Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created`

## Test plan
- [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213
2026-02-19 07:25:24 -08:00
8f89239c78 Inhibit idle lock for fullscreen windows on ringtail (#212)
## Summary
- Adds `inhibit_idle fullscreen` window commands to sway config on ringtail
- Covers both Wayland-native (`app_id`) and XWayland (`class`) windows
- Prevents swayidle from locking the screen during gamepad-only gaming sessions where controller input isn't detected by the Wayland idle tracker

## Notes
This is a blanket fullscreen inhibit. A more targeted approach (daemon monitoring `/dev/input` gamepad events) may be desired later to allow idle lock during long-running fullscreen apps like Factorio.

## Deployment and Testing
- [ ] `mise run provision-ringtail` to deploy
- [ ] Run a fullscreen app and verify swayidle doesn't lock after 15 minutes
- [ ] Verify lock still activates when no fullscreen window is present

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/212
2026-02-19 07:20:05 -08:00
9829a6f971 Add screen lock and idle management to ringtail (#211)
## Summary
- Configure **swayidle** to lock screen (swaylock) after 15 minutes of inactivity
- Turn off display (DPMS) after 60 minutes, auto-restore on activity
- **swaylock** themed with Catppuccin Macchiato to match existing Sway config
- Add `Mod4+l` keybinding for manual screen lock
- Add PAM service for swaylock authentication
- Disable system suspend/hibernate entirely (workstation should never sleep)

## What changes
All changes in `nixos/ringtail/configuration.nix`:
- `security.pam.services.swaylock` — required for swaylock to authenticate on NixOS
- `systemd.sleep.extraConfig` — blocks all sleep/hibernate modes
- `programs.swaylock` (home-manager) — lock screen appearance config
- `services.swayidle` (home-manager) — idle timeout daemon with lock + DPMS events
- New keybinding `Mod4+l` for manual lock

## Deployment and Testing
- [ ] `mise run provision-ringtail`
- [ ] Verify swayidle is running: `systemctl --user status swayidle`
- [ ] Test manual lock with `Super+l`
- [ ] Verify display DPMS off after idle (can lower timeout temporarily to test)
- [ ] Confirm machine does not suspend: `systemctl status sleep.target`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/211
2026-02-19 06:46:37 -08:00
630ebcd12d Add ringtail DeviceTags and homelab-to-homelab SSH rule (#210)
## Summary
- Add `ringtail` DeviceTags Pulumi resource with `tag:homelab` + `tag:blumeops` (matching indri/sifaka pattern)
- Remove the bootstrap `ringtail_key` auth key — ringtail is already on the tailnet
- Add SSH ACL rule allowing `tag:homelab` → `tag:homelab` SSH, unblocking cross-host management (e.g., ringtail running ansible against indri)

## Deployment and Testing
- [ ] `mise run tailnet-preview` — dry run, confirm diff
- [ ] `mise run tailnet-up` — apply
- [ ] From ringtail: `ssh indri 'hostname'` — should succeed

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/210
2026-02-18 21:48:11 -08:00
aa04618829 Fix k3s health check to use explicit KUBECONFIG path
k3s kubectl on ringtail needs KUBECONFIG set since the eblume
user doesn't have it in their default environment.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 21:26:00 -08:00
1f2134bf0a Fix provision-ringtail ls-remote matching with mirror refs
git ls-remote returns multiple lines when a mirror ref exists
(e.g. refs/remotes/remote_mirror_*/main). Take only the first
line to avoid a false mismatch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 21:22:46 -08:00
918df9e642 Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209)
## Summary

  Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions
  runner for building containers with Nix.

  ### K3s cluster
  - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint)
  - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale
  - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`)
  - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access
  - `kubectl` added to system packages

  ### 1Password Connect + External Secrets Operator
  - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`,
  `external-secrets-config-ringtail`
  - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server
  - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password`
  namespace in post_tasks

  ### Systemd Forgejo Actions runner
  - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host
  - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH
  - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env`
  - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access

  ### Nix container build workflow
  - New `.forgejo/workflows/build-container-nix.yaml` triggers on `*-nix-v[0-9]*` tags (e.g. `nettest-nix-v1.0.0`)
  - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy`
  - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering

  ### Mise task updates
  - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`)
  - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`)

  ## Post-merge

  1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild
  2. Register k3s cluster in ArgoCD (first time only):
     ```fish
     ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' | \
       sed 's|127.0.0.1|ringtail.tail8d86e.ts.net|' > /tmp/k3s-ringtail.yaml
     set -x KUBECONFIG /tmp/k3s-ringtail.yaml
     argocd cluster add default --name k3s-ringtail
  3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail ->
  external-secrets-config-ringtail
  4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder'
  5. Check Forgejo admin panel for ringtail-nix-builder runner online
  6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209
2026-02-18 21:15:30 -08:00
535f897054 Polish ringtail NixOS config and add documentation (#208)
## Summary
- Fix Super+Return keybinding to launch wezterm in sway
- Set fish as default login shell
- Remove `initialPassword` (real password already set)
- Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep)
- Add ringtail reference card, update host inventory and reference index
- Changelog fragment

## Post-merge deployment
- `mise run provision-ringtail` to rebuild NixOS
- On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration)
- Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208
2026-02-18 17:53:47 -08:00
b76f2314c2 Add force: true to ringtail git task
nixos-rebuild can dirty the tree (e.g. flake.lock updates), which
blocks the Ansible git module. Force ensures we always reset to the
upstream state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:32:23 -08:00
7bf46f4e28 Add flake.lock for ringtail NixOS config
Prevents 'Git tree is dirty' warnings during nixos-rebuild.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:31:21 -08:00
5a087c10df Fix deprecated greetd.tuigreet package reference
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:30:01 -08:00
4b7491c58f Add python3 to ringtail for Ansible compatibility
NixOS doesn't include Python by default. Ansible needs it on the
managed host for module execution.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:29:09 -08:00
b08ed98881 Enable passwordless sudo for wheel group on ringtail
Required for Ansible unattended provisioning via become: true.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:25:32 -08:00
8ee6c1271a Add --accept-routes and --ssh to tailscale config
Makes tailscale settings declarative so they persist across rebuilds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:24:17 -08:00
aaf7e73c27 Fix sway on NVIDIA proprietary drivers
Sway/wlroots refuses to start on proprietary NVIDIA by default.
Add --unsupported-gpu flag and disable hardware cursors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:08:26 -08:00
104e49d337 Allow unfree packages for NVIDIA drivers and Steam
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 08:56:27 -08:00
b9d813cde1 Add NixOS configuration for ringtail workstation (#207)
## Summary
- NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/`
- Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe)
- NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam
- Tailscale integration for tailnet connectivity
- Ansible playbook + `mise run provision-ringtail` for ongoing management
- Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap

## Deployment Order
1. **Merge PR**
2. `pulumi up` in tailscale stack → creates auth key
3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets`
4. On ringtail NixOS installer:
   - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo)
   - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail`
5. Reboot, `tailscale up --auth-key=<key>`
6. Verify: `tailscale status`, SSH from gilbert

## Test plan
- [ ] Review NixOS configuration for completeness
- [ ] Verify disko partition layout matches ringtail hardware
- [ ] Run `pulumi preview` for tailscale stack
- [ ] Install NixOS on ringtail
- [ ] Confirm tailscale connectivity
- [ ] Confirm sway desktop works
- [ ] Test `mise run provision-ringtail` for ongoing management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207
2026-02-18 08:24:25 -08:00
5f9b024b4a Add Apple Silicon ZMQ detector for Frigate (#206)
## Summary

- New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri
- Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms)
- Removes detect FPS cap (no longer needed with fast inference)
- Updates Frigate docs and adds changelog fragment

## Deployment

### Phase 1: Deploy detector on indri (one-time setup + ansible)
```fish
ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector'
ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install'
mise run provision-indri -- --tags frigate_detector --check --diff  # dry run
mise run provision-indri -- --tags frigate_detector                 # apply
ssh indri 'launchctl list mcquack.eblume.frigate-detector'          # verify running
ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log'    # verify bound
```

### Phase 2: Test connectivity
```fish
kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555
```

### Phase 3: Deploy Frigate config (branch workflow)
```fish
argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate
```

### Phase 4: Post-deploy checks
- [ ] Pod starts, no config errors
- [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms
- [ ] detect_fps uncapped
- [ ] Recordings and MQTT events flowing
- [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206
2026-02-17 19:03:28 -08:00
f45897b7c7 Upgrade Frigate 0.16.4 → 0.17.0-rc2 (#205)
## Summary

- Bump Frigate image from `0.16.4-standard-arm64` to `0.17.0-rc2-standard-arm64`
- Adapt `record` config to 0.17 schema: `retain.days`/`mode: all` → `continuous.days`
- Update service docs and version tracker

This is the first step toward the Apple Silicon ZMQ detector. The existing ONNX detector is kept so we can validate the upgrade independently.

## What is NOT changing

- Detector config (still `type: onnx` with YOLO-NAS-s)
- go2rtc streams, MQTT, cameras, zones, review rules
- frigate-notify, storage PVs, Grafana dashboard

## Deployment and Testing

- [ ] `argocd app set frigate --revision upgrade-frigate-0.17 && argocd app sync frigate`
- [ ] Pod starts, `/api/version` returns `0.17.0-rc2`
- [ ] No config errors in pod logs
- [ ] Frigate web UI loads at `https://nvr.ops.eblu.me`
- [ ] Live view works, detection running (`/api/stats` shows `detection_fps > 0`)
- [ ] Recordings being created (`/api/recordings/summary`)
- [ ] MQTT events flowing (check frigate-notify logs)
- [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/205
2026-02-17 16:56:12 -08:00
acd213559e Fix frigate live view by capping detect FPS (#204)
## Summary
- Cap detect FPS to 2 to prevent recording segment backlog from ONNX inference bottleneck (~750ms/frame on ARM64 CPU)
- Sync motion masks from live config (added second mask area)
- Update driveway_entrance zone coordinates from live config
- Add explicit alert labels `[person, car]` while keeping `required_zones: [driveway_entrance]`

## Context
The "No frames have been received" error on the gablecam live view was caused by the detect stream falling behind — ONNX YOLO-NAS-s takes ~750ms per inference on ARM64 CPU, but the sub-stream sends 5 FPS. This caused recording segments to pile up and the ffmpeg watchdog to repeatedly kill/restart the process, creating gaps in the live view.

## Test plan
- [ ] Sync ArgoCD `frigate` app to branch and verify pod restarts cleanly
- [ ] Check `/api/stats` — `skipped_fps` should drop significantly, `process_fps` should be close to 2
- [ ] Verify live view at https://nvr.ops.eblu.me/#gablecam no longer shows "No frames" error
- [ ] Verify detections and alerts still work in the driveway_entrance zone

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/204
2026-02-17 16:18:02 -08:00
1e96866dd3 Grafana helm chart upgrade plan 2026-02-17 11:15:34 -08:00
b9d1acaf3a Service review for external-secrets 2026-02-17 10:48:09 -08:00
105a2c8c08 Update External Secrets Helm chart 1.3.1 → 2.0.0 (#203)
## Summary
- Bump External Secrets Operator Helm chart from `helm-chart-1.3.1` to `helm-chart-2.0.0` (operator v1.3.2)
- Updates both the operator app and CRDs app `targetRevision`
- No Helm values changes needed — `installCRDs`, `resources`, `webhook`, `certController` keys are unchanged

## Breaking changes in chart 2.0.0
- **Removed providers:** Alibaba and Device42 (unmaintained) — does not affect our 1Password setup
- **Templating engine v1 deprecated** — our ExternalSecrets don't set `engineVersion`, so they use the default (v2)
- **Webhook `failurePolicy`** for SecretStore is now dynamic

## Deployment
1. Sync CRDs first: `argocd app set external-secrets-crds --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets-crds`
2. Sync operator: `argocd app set external-secrets --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets`
3. Verify: `kubectl --context=minikube-indri -n external-secrets get pods`
4. After merge, set both apps back to `--revision main`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/203
2026-02-17 10:43:21 -08:00
5fbe70d1ba Port ntfy to locally built container image (#202)
All checks were successful
Build Container / build (push) Successful in 6m28s
## Summary
- Add `containers/ntfy/Dockerfile` — three-stage build (Node web UI, Go+CGO server, Alpine runtime) pinned to commit SHA `a03a37fe` (v2.17.0), sourced from forge mirror
- Update ntfy deployment image from `binwiederhier/ntfy:v2.17.0` to `registry.ops.eblu.me/blumeops/ntfy:v1.0.0`
- Note fish shell in CLAUDE.md

## Deployment
After merge, release the container image:
```fish
mise run container-tag-and-release ntfy v1.0.0
```
Then sync:
```fish
argocd app sync ntfy
```

## Test plan
- [x] `docker build` succeeds
- [x] `dagger call build --src=. --container-name=ntfy` succeeds (exit 0, container ID printed)
- [x] `ntfy --help` works in built container
- [ ] Tag and release `ntfy-v1.0.0` after merge
- [ ] Verify ntfy pod starts with new image
- [ ] Verify health endpoint responds at `ntfy.ops.eblu.me/v1/health`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/202
2026-02-17 10:18:20 -08:00
3e604d8fdc Review ntfy: upgrade to v2.17.0 and add reference docs (#201)
## Summary
- Upgrade ntfy from v2.11.0 to v2.17.0 (6 minor releases, no breaking changes)
- Add reference doc for ntfy service
- Add reference doc for frigate service (ntfy's sole producer via frigate-notify)
- Update reference index and service-versions.yaml tracking

## Notable upstream changes (v2.12.0–v2.17.0)
- **v2.14.0:** Declarative users/ACL config in files
- **v2.15.0:** `require-login` flag for topic-level auth
- **v2.16.0:** Dead man's switch (heartbeat) notifications, notification update/delete
- **v2.17.0:** Priority templating, crash fixes (nil pointer panics)

## Deployment and Testing
- [ ] ArgoCD sync ntfy after merge
- [ ] Verify ntfy pod healthy with new image
- [ ] Send a test notification via `curl -d "test" https://ntfy.ops.eblu.me/test`
- [ ] Verify frigate-notify still delivers alerts to ntfy

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/201
2026-02-17 09:51:40 -08:00
54c3b0a5f3 Expanded some CLAUDE.md stuff manualy 2026-02-17 07:54:34 -08:00
2f599a15bd Fix zk-docs broken path after how-to reorg
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-17 07:32:54 -08:00
Forgejo Actions
530460171a Update docs release to v1.9.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-17 07:30:39 -08:00
27d8f3cf1f Review gandi-operations doc and reorganize how-to guides (#200)
## Summary
- **Doc review:** Reviewed `gandi-operations.md` — added `last-reviewed` frontmatter, verified all wiki-links, confirmed Pulumi state has no drift
- **Gandi reference fix:** Added missing `cv.eblu.me` CNAME row to `gandi.md` DNS records table (was present in Pulumi but undocumented)
- **Pulumi comment fix:** Updated stale `README.md` reference in `__main__.py` to point to `docs/how-to/gandi-operations.md`
- **How-to reorg:** Moved 14 how-to guides into 3 subdirectories (`deployment/`, `configuration/`, `operations/`), collapsed the Documentation and Database index sections into Configuration and Operations respectively

## Verification
- `docs-check-links` — all 180 wiki-links valid
- `docs-check-filenames` — all 90 filenames unique
- `dns-preview` — 5 resources unchanged, no drift
- All pre-commit hooks pass

## Test plan
- [ ] Verify docs site builds correctly with new paths
- [ ] Spot-check a few wiki-links from other pages to moved how-to guides

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/200
2026-02-17 07:29:33 -08:00
Forgejo Actions
8a48171acf Update docs release to v1.9.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-16 21:25:47 -08:00
779b7d6709 Eliminate double towncrier run in release workflow (#199)
## Summary

- Added a new `build_quartz` Dagger function that builds the Quartz site from a pre-processed source tree (no towncrier)
- Reordered the release workflow so towncrier runs **once** on the runner, then passes the updated working tree to `build-quartz`
- `build_docs` and `build_changelog` are preserved for standalone use — `build_docs` now delegates to `build_quartz` internally

## Motivation

Previously towncrier ran twice per release: once inside a Dagger container (via `build_docs` → `build_changelog`) and once on the runner to capture CHANGELOG.md changes for the git commit. This was wasteful and fragile — if towncrier behavior changed, the two runs could produce different results.

## Test plan

- [ ] Review diff to confirm workflow step ordering is correct
- [ ] Trigger a release and confirm towncrier runs only once
- [ ] Verify the docs tarball contains the updated CHANGELOG.md
- [ ] `dagger call build-quartz --src=. --version=vX.Y.Z` should work standalone

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/199
2026-02-16 21:24:34 -08:00
627e2b7894 Add UniFi admin link to homepage dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-16 19:15:46 -08:00
d35c26d2b0 Fix mosquitto image tag: use 2.0.22 instead of nonexistent 2.1.2 (#198)
## Summary
- The `eclipse-mosquitto:2.1.2` tag doesn't exist on Docker Hub — the 2.1.x series only publishes `-alpine` variants
- Corrects the pinned tag to `2.0.22`, the latest non-alpine version (matching what the old floating `:2` tag was resolving to)
- Updates tracking file and changelog fragment accordingly

## Context
The previous PR #197 pinned mosquitto from floating `:2` to `2.1.2`, but the new pod failed with `ErrImagePull` ("manifest unknown"). The old pod is still running on `:2`.

## Test plan
- [ ] Verify `eclipse-mosquitto:2.0.22` pulls successfully
- [ ] Verify mosquitto pod restarts and passes readiness/liveness probes
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/198
2026-02-16 17:19:32 -08:00
0aab73af40 Bump mosquitto to 2.1.2 and tailscale-operator to v1.94.2 (#197)
## Summary
- Pin mosquitto from floating `:2` tag to `2.1.2` (latest upstream, released Feb 9 2026)
- Bump tailscale k8s-operator and proxy images from `v1.94.1` to `v1.94.2`
- Record 7 reviewed services in `service-versions.yaml` (first service review pass)

## Services reviewed (11 total)
| Service | Deployed | Latest | Status |
|---------|----------|--------|--------|
| prometheus | v3.9.1 | v3.9.1 | Current |
| loki | 3.6.5 | 3.6.5 | Current |
| kube-state-metrics | v2.18.0 | v2.18.0 | Current |
| mosquitto | :2 (floating) | 2.1.2 | **Pinned in this PR** |
| frigate | 0.16.4 | 0.16.4 | Current |
| alloy-k8s | v1.13.1 | v1.13.1 | Current |
| tailscale-operator | v1.94.1 | v1.94.2 | **Bumped in this PR** |
| ntfy | v2.11.0 | v2.17.0 | Stale (future PR) |
| frigate-notify | v0.3.5 | v0.5.4 | Stale (future PR) |
| homepage | chart 2.1.0 | app v1.10.1 | Stale (future PR) |
| grafana | chart 8.8.2 | chart 10.5.15 | Stale (future PR) |

## Deployment and Testing
- [ ] `argocd app sync apps`
- [ ] `argocd app set mosquitto --revision service-review/mosquitto-tailscale-operator && argocd app sync mosquitto`
- [ ] `argocd app set tailscale-operator --revision service-review/mosquitto-tailscale-operator && argocd app sync tailscale-operator`
- [ ] Verify mosquitto pod restarts with pinned image
- [ ] Verify tailscale operator and proxy pods update
- [ ] `mise run services-check`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/197
2026-02-16 17:14:38 -08:00
faf9682b55 Add service version review system (#196)
## Summary

- Add `service-versions.yaml` tracking file with 33 services and upstream release URLs
- Add `mise run service-review` task (Python uv script) mirroring the docs-review UX
- Add `review-services` how-to article covering the review process by service type
- Add `[[review-services]]` link to the how-to index Knowledge Base table

## Deployment and Testing

- [x] `mise run service-review` displays 33 services, all "never reviewed"
- [x] `mise run service-review -- --type ansible` filters to 7 Ansible services
- [x] `mise run service-review -- --limit 5` shows 5 rows
- [x] `mise run docs-check-links` — no broken wiki-links
- [x] `mise run docs-check-frontmatter` — new doc passes validation
- [x] All pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/196
2026-02-16 17:02:56 -08:00
Forgejo Actions
994bed0693 Update docs release to v1.9.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-16 15:51:12 -08:00
2c55c2316e Review expose-service-publicly doc (#195)
## Summary
- Replace stale inline code listings (fly.toml, Dockerfile, start.sh, nginx.conf, mise tasks, CI workflow) with brief descriptions pointing readers to the actual `fly/` and `mise-tasks/` files — prevents future drift
- Add observability sidecar section documenting the Alloy integration (logs → Loki, metrics → Prometheus)
- Fix broken internal wiki-link (`[[#7. Update Tailscale ACLs if needed]]` → correct heading)
- Update per-service nginx templates to current patterns (deferred DNS resolution via `set $upstream` variable, `proxy_intercept_errors`, error pages)
- Add `cv.eblu.me` to verification steps (live service not previously documented)
- Add `last-reviewed: 2026-02-16` frontmatter
- Net -187 lines (58 added, 245 removed)

## Deployment and Testing
- [x] All pre-commit hooks pass (link validation, frontmatter, filenames)
- [ ] Docs site renders correctly after merge

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/195
2026-02-16 15:49:55 -08:00
74294094e3 Fix navidrome custom container image v1.0.2 (#194)
## Summary
- Switch navidrome deployment from upstream `deluan/navidrome:0.60.3` back to custom image `registry.ops.eblu.me/blumeops/navidrome:v1.0.2`
- The v1.0.1 image was tagged before the `USER 65534` removal commit, so it still ran as a non-root user that couldn't write to the SQLite data directory
- v1.0.2 is built from current main which includes both the `zlib-dev` build fix and the non-root user removal

## Deployment and Testing
- [ ] Wait for CI to build `navidrome:v1.0.2` image
- [ ] Sync via ArgoCD and verify pod starts without CrashLoopBackOff
- [ ] Verify navidrome UI accessible at https://navidrome.ops.eblu.me

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/194
2026-02-16 08:24:33 -08:00
7ffbd12ac8 Fix Frigate parked car re-detection and enable writable config (#193)
All checks were successful
Build Container / build (push) Successful in 12s
## Summary
- Remove car-specific `max_frames: 150` which was causing a forget-and-re-detect loop on parked cars (every ~30 seconds at 5fps)
- Set `stationary.interval: 0` so Frigate never re-runs detection on stationary objects
- Replace read-only configmap subPath mount with initContainer + emptyDir, so Frigate UI changes (zones, masks) persist at runtime

## Context
Frigate was spamming notifications because `max_frames` for cars caused it to "forget" a parked car after 150 frames, then immediately re-detect it as a brand new object. The fix follows [Frigate's official parked cars guide](https://docs.frigate.video/guides/parked_cars/).

The writable config change also unblocks using `required_zones` for car alerts — zones can now be drawn in the Frigate UI and will survive until pod reschedule (at which point they should be baked into the configmap via IaC).

## Test plan
- [ ] Sync frigate app via ArgoCD and verify pod starts with initContainer
- [ ] Confirm parked cars no longer trigger repeated alerts
- [ ] Draw a zone/mask in Frigate UI, save, verify it persists after Frigate restart
- [ ] Set up `driveway_entrance` required zone for car alerts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/193
2026-02-15 17:48:14 -08:00
6c41338b36 Revert navidrome to upstream image pending container fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 08:22:04 -08:00
ad3ffbbf87 Remove non-root user from navidrome container
The SQLite data directory needs write access, matching upstream behavior.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 08:20:58 -08:00
accbb80683 Update navidrome image to v1.0.1
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 08:18:17 -08:00
982ce3dff3 Add zlib-dev to navidrome build stage for taglib linking
All checks were successful
Build Container / build (push) Successful in 2m23s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-15 08:17:20 -08:00
996441876d Document container build pattern and port navidrome (#192)
Some checks failed
Build Container / build (push) Failing after 4m28s
## Summary
- Add how-to guide (`docs/how-to/build-container-image.md`) covering the full container build workflow: directory layout, Dagger local builds, mise release task, and common patterns with links to existing containers
- Port navidrome from upstream `deluan/navidrome:0.60.3` to a custom three-stage build (`containers/navidrome/Dockerfile`) using Node + Go + Alpine
- Update navidrome deployment to use `registry.ops.eblu.me/blumeops/navidrome:v1.0.0`

## Deployment and Testing
- [x] `dagger call build --src=. --container-name=navidrome` builds successfully
- [ ] After merge: `mise run container-tag-and-release navidrome v1.0.0`
- [ ] After image published: `argocd app sync navidrome` and verify pod starts

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/192
2026-02-15 08:05:11 -08:00
Forgejo Actions
26c1ff5ce6 Update docs release to v1.9.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-15 07:43:00 -08:00
22f418d0dc Doc review: connect-to-postgres, create-release-artifact-workflow, deploy-k8s-service (#191)
## Summary

Review session covering 3 docs, plus a codebase-wide cleanup:

### Docs reviewed
- **connect-to-postgres** — verified end-to-end (psql connection tested), stamped
- **create-release-artifact-workflow** — clarified that `build-blumeops.yaml` is only a version bump example (not a packages API example)
- **deploy-k8s-service** — fixed stale repoURL (`indri:2200` → `forge.ops.eblu.me:2222`), wrong Caddy config keys (`upstream` → `backend`, added missing `host`), updated Homepage group to "Services", added Tailscale tag documentation

### Codebase cleanup
- Migrated all remaining `op item get --fields` calls to `op read` URI syntax across 7 files (docs, READMEs, YAML comments)
- Simplified the `op read` vs `op item get` guidance in CLAUDE.md

## Side findings (not addressed)
- New `immich-pg` CNPG cluster not yet documented in the postgresql reference card

## Test plan
- [x] `psql` connection to `pg.ops.eblu.me` verified
- [x] All pre-commit hooks pass
- [x] `docs-check-links`, `docs-check-index`, `docs-check-frontmatter` pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/191
2026-02-15 07:42:01 -08:00
Forgejo Actions
b2b5879e3c Update docs release to v1.9.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-14 21:32:27 -08:00
04c7f3c45a Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190)
## Summary

Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159):

- **Mosquitto** — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth)
- **Ntfy** — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me`
- **Frigate** — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me`
- **frigate-notify** — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT

Also includes:
- Prometheus scrape target for Frigate metrics
- Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage)
- Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me`

## Prerequisites

- [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri)
- [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields

## Deployment (after merge)

```bash
argocd app sync apps
argocd app sync mosquitto
argocd app sync ntfy
argocd app sync frigate
argocd app sync grafana-config
argocd app sync prometheus
mise run provision-indri -- --tags caddy
mise run services-check
```

## Verification

- [ ] Mosquitto pod running, accepting connections on 1883
- [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me`
- [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed
- [ ] Object detection working (ONNX, person/car/dog/cat)
- [ ] Recordings appearing in NFS share on sifaka
- [ ] frigate-notify sending detection alerts to Ntfy
- [ ] Prometheus scraping Frigate metrics
- [ ] Grafana dashboard showing Frigate data

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190
2026-02-14 21:27:44 -08:00
2fad37f500 Add missing doc index files to zk-docs context priming
Adds docs/index.md, explanation/explanation.md, how-to/plans/plans.md,
and how-to/plans/completed/completed.md so AI sessions get full
awareness of all doc sections and in-flight plans.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 12:08:22 -08:00
f376c02b76 Move segment-home-network to completed plans
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 10:48:32 -08:00
2252e5e60d Update segmentation plan: mark completed, fix firewall details
Reflect actual UX7 zone-based firewall UI, correct streaming port
(8096 not 443), note indri DHCP reservation, mark plan as completed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-14 10:43:17 -08:00
657bb28fd1 Abandon UniFi IaC, add manual network segmentation plan (#189)
## Summary

- Abandon the UniFi Pulumi IaC approach after provider bugs caused a network outage (no-op update reset undeclared properties on the default LAN network)
- Remove untracked IaC artifacts (`pulumi/unifi/`, `mise-tasks/unifi-preview`, `mise-tasks/unifi-up`) locally
- Mark `add-unifi-pulumi-stack` plan as Abandoned with explanation
- Create new `segment-home-network` plan for manual three-network segmentation (Main/IoT/Guest) via UX7 web UI
- Rewrite UniFi reference card to remove all Pulumi/IaC references
- Update plan and how-to indexes

## Test plan

- [x] `docs-check-links` passes
- [x] `docs-check-index` passes
- [x] Pre-commit hooks pass
- [ ] Review segmentation plan for completeness before executing manually

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/189
2026-02-14 09:47:04 -08:00
eec1edf43d Add how-to guide for connecting to PostgreSQL via psql (#188)
## Summary
- Add new how-to guide (`connect-to-postgres.md`) with the `psql` command using `op read` for 1Password credentials
- Add "Database" section to the how-to index linking to the new guide
- Link the new guide from the PostgreSQL reference card's Related section

## Test plan
- [x] Verified `psql` connection works from gilbert using the documented command
- [ ] Review doc formatting and content

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/188
2026-02-14 07:18:06 -08:00
49ec05041c Update UniFi Pulumi plan: switch to ubiquiti-community provider (#187)
## Summary
- Switch provider from filipowm/unifi (inactive maintainer, showstopper bug #94 wiping firewall rules) to ubiquiti-community/unifi (actively maintained, API key auth)
- Add UX7 config backup prerequisite before adopting IaC
- Fix safety guard: check default route interface instead of hostname (runs from gilbert, not indri)
- Update 1Password paths to match actual item (`op://blumeops/unifi/credential`)
- Fix ringtail references: not a Raspberry Pi, stays on WiFi (removed from wired topology)
- Update doc steps for already-existing reference files

## Test plan
- [x] Pre-commit hooks pass
- [x] `docs-check-links` pass
- [x] `docs-check-index` pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/187
2026-02-13 20:02:16 -08:00
b77ae19f20 Fix 1Password Connect credentials for chart 2.3.0
Chart 2.3.0 mounts credentials as a file with standard k8s base64
encoding. The old double-encoding workaround (credentials-base64 in
stringData) now produces invalid JSON. Use raw JSON (credentials-file)
instead.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 17:30:45 -08:00
8f4708e26f Fix navidrome image tag: remove v prefix (0.60.3 not v0.60.3)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 17:23:12 -08:00
b3747f6c95 Tier 1 version bumps (#186)
All checks were successful
Build Container / build (push) Successful in 8s
## Summary

Audit and upgrade of all deployed images, helm charts, and custom container Dockerfiles to latest stable versions. This PR covers Tier 1 (low-risk minor/patch bumps only).

### Upstream images
| Image | Old | New |
|-------|-----|-----|
| kube-state-metrics | v2.13.0 | v2.18.0 |
| prometheus | v3.2.1 | v3.9.1 |
| loki | 3.3.2 | 3.6.5 |
| alloy | v1.5.1 | v1.13.1 |
| tailscale (proxy + operator) | v1.92.5 | v1.94.1 |
| navidrome | :latest | v0.60.3 (pinned) |

### Helm charts
| Chart | Old | New |
|-------|-----|-----|
| CloudNativePG | v0.27.0 | v0.27.1 |
| 1Password Connect | 2.2.1 | 2.3.0 |

### Custom containers (Dockerfiles updated, images not yet tagged)
| Container | Changes | New tag |
|-----------|---------|---------|
| miniflux | 2.2.16→2.2.17 (security), alpine 3.22 | v1.1.0 |
| kubectl | v1.34.1→v1.34.4, alpine 3.22 | v1.1.0 |
| kiwix-serve | alpine 3.22 | v1.1.0 |
| nettest | alpine 3.22 | v0.14.0 |
| transmission | alpine 3.22, pkg 4.0.6-r4 | v1.1.0 |

All custom containers verified with local `dagger call build`.

### Deferred to Tier 2 (separate PRs)
- Forgejo runner 6→12 (major version scheme change)
- Docker DinD 27→29
- Grafana chart 8→11 (repo migration)
- External Secrets 1→2 (breaking changes)
- Python 3.12→3.13, Elixir 1.18→1.19, Node 22→24
- Transmission 4.0.6→4.1.0 (not in Alpine yet)

## Deployment

After merge:
1. Tag custom containers: `mise run container-tag-and-release <name> <version>` for each
2. Wait for CI builds to complete
3. `argocd app sync apps` then sync individual apps, or let ArgoCD auto-detect

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/186
2026-02-13 17:16:37 -08:00
81690dae0f Review add-ansible-role doc (#185)
## Summary
- Replace `op item get --fields` with `op read` for secrets (matches playbook and CLAUDE.md guidance)
- Change `tags: [<role>]` to `tags: <role>` to match actual playbook style
- Remove redundant `listen:` from handler example, add `changed_when: true`
- Name handler after specific service (e.g. `Restart <service>`) to match real roles
- Add `last-reviewed: 2026-02-13` frontmatter

## Also noted (not fixed here)
Two other docs still use the old `op item get` pattern:
- `docs/how-to/troubleshooting.md:72` (ArgoCD login command)
- `docs/how-to/gandi-operations.md:35` (Gandi token export)

These can be fixed in their own review cycles.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/185
2026-02-13 16:54:42 -08:00
5b91a1c315 Review why-gitops doc (#184)
## Summary
- Fix misleading `[[tailscale|Pulumi]]` wiki-link → `[[pulumi]]`
- Simplify `[[ansible|Ansible]]` and `[[argocd|ArgoCD]]` to plain wiki-links per convention
- Rename "Tailnet" layer to "Network" to reflect Pulumi's full scope (Tailscale ACLs + Gandi DNS)
- Fix `apt install` → `brew install` (indri is macOS)
- Add `[[pulumi]]` to Related section
- Add `last-reviewed: 2026-02-13` frontmatter

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/184
2026-02-13 16:48:06 -08:00
d5c00192d5 Configure DinD to use Zot as pull-through registry mirror (#183)
## Summary
- Add `daemon.json` with `registry-mirrors` to the forgejo-runner ConfigMap, pointing DinD at `http://host.minikube.internal:5050`
- Mount `daemon.json` into the DinD sidecar at `/etc/docker/daemon.json` via `subPath`
- Docker Hub pulls during Dagger CI builds will now route through Zot's pull-through cache, reducing bandwidth and avoiding rate limits

## Deployment and Testing
- [ ] `argocd app sync forgejo-runner`
- [ ] Exec into DinD container: `docker info` should show the registry mirror
- [ ] Trigger a workflow build and check Zot logs for cache hits

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/183
2026-02-13 12:36:03 -08:00
ba9b251759 Update forgejo-runner image to v3.2.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 12:16:52 -08:00
d0c18043b7 Revert forgejo-runner image to v3.1.0
v3.2.0 build failed (GitHub download timeout), rolling back to
working image while it rebuilds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 12:07:51 -08:00
fdd3f6483a Update forgejo-runner image to v3.2.0
All checks were successful
Build Container / build (push) Successful in 7m31s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 11:08:57 -08:00
e364bdd238 Upgrade Node.js from 20 to 22 LTS (#182)
Some checks failed
Build Container / build (push) Failing after 11m14s
## Summary
- Upgrade Dagger docs build image from `node:20-slim` to `node:22-slim`
- Upgrade forgejo-runner container from Node 20 to Node 22
- Fixes Quartz 4.5.2 `EBADENGINE` warning (requires Node >= 22)
- Node 20 EOL is 2026-04-30

Both builds verified locally via Dagger.

## Deployment
1. Merge this PR
2. Tag and release forgejo-runner v3.2.0: `mise run container-tag-and-release forgejo-runner v3.2.0`
3. Update RUNNER_LABELS version in `argocd/manifests/forgejo-runner/deployment.yaml` from `v3.1.0` to `v3.2.0`
4. `argocd app sync forgejo-runner`

The Dagger docs build change takes effect immediately on merge (no container release needed).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/182
2026-02-13 11:07:41 -08:00
Forgejo Actions
02b1397f1a Update docs release to v1.8.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-13 10:36:04 -08:00
0098ac37e0 Move non-secret runner env vars to deployment spec (#181)
## Summary
- Move FORGEJO_URL, RUNNER_NAME, and RUNNER_LABELS from ExternalSecret template to deployment env vars
- ExternalSecret now only contains the actual secret (RUNNER_TOKEN)
- Image version changes in RUNNER_LABELS now trigger automatic pod rollouts

## Deployment
1. Merge this PR
2. `argocd app sync forgejo-runner` — the deployment spec change will auto-roll the pod

No manual restart needed — that's the whole point :)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/181
2026-02-13 10:29:23 -08:00
52bbf88aa6 Update forgejo-runner image to v3.1.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 10:21:43 -08:00
2fad8db639 Add yq to forgejo-runner and replace sed YAML edits (#180)
All checks were successful
Build Container / build (push) Successful in 1m31s
## Summary
- Install yq in the forgejo-runner container image for structured YAML editing
- Replace fragile `sed` regex patterns with `yq` in `build-blumeops.yaml` and `cv-deploy.yaml` workflows

## Deployment
1. Merge this PR
2. Tag and release forgejo-runner v3.1.0: `mise run container-tag-and-release forgejo-runner v3.1.0`
3. Update runner label in `argocd/manifests/forgejo-runner/external-secret.yaml` from `v3.0.2` to `v3.1.0`
4. Sync the forgejo-runner app: `argocd app sync forgejo-runner`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/180
2026-02-13 10:20:27 -08:00
4942dee182 Update homepage layout for new Content/Misc groups
Replace old Apps/Observability/Infrastructure layout entries with
Content and Misc to match the recategorized ingress annotations.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 09:16:40 -08:00
ca6a845604 Move ArgoCD to Misc homepage group and rename ingress file
ArgoCD's tailscale ingress was missed in the recategorization (filed as
service-tailscale.yaml instead of ingress-tailscale.yaml). Fix the group
annotation and rename the file to match the convention used by all other
services.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-13 09:13:32 -08:00
48ce5b4120 Recategorize homepage into Content and Misc groups (#179)
## Summary
- Replace the three homepage groups (Apps, Observability, Infrastructure) with two cleaner groups
- **Content**: Immich, Kiwix, Miniflux, DJ, Grafana
- **Misc**: CV, TeslaMate, Transmission, Docs, Prometheus, PyPI

## Deployment and Testing
- [ ] Sync affected ingresses via ArgoCD (all 11 services)
- [ ] Verify homepage shows the two new groups correctly

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/179
2026-02-13 09:09:22 -08:00
Forgejo Actions
e21277ae83 Update docs release to v1.8.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-12 19:20:27 -08:00
517080aeab Add reference/tools/ category with Dagger, ArgoCD CLI, Ansible, and Pulumi cards (#178)
## Summary

- Create `docs/reference/tools/` with four reference cards: Dagger (build engine), ArgoCD CLI (deployment workflows), Ansible (config management), and Pulumi (DNS/Tailscale IaC)
- Move `ansible/roles.md` → `tools/ansible.md`, broadened with CLI patterns and dry-run usage
- Update `reference.md` index: add "Tools" section, remove old "Ansible" section
- Update `update-documentation.md` to reflect Dagger build process (workflow steps, manual build recipe, runner environment)
- Update `adopt-dagger-ci.md` plan to note how-to articles were handled via reference card + existing how-to updates
- Fix all broken `[[roles]]` wiki-links across 5 files → `[[ansible]]`

## Verification

- `docs-check-links` ✓ — no broken wiki-links
- `docs-check-index` ✓ — all docs referenced in category index
- `docs-check-filenames` ✓ — no duplicate filenames
- All pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/178
2026-02-12 19:18:46 -08:00
9c789a1868 Fix cache hit rate on APM and Fly.io dashboards (#177)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m19s
## Summary
- Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`)
- Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites
- Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red

## Root cause
Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background).

## Deployment
- Fly.io proxy redeploy needed for Alloy config change
- ArgoCD sync for dashboard ConfigMap changes

## Test plan
- [ ] Redeploy Fly.io proxy
- [ ] Sync grafana-config in ArgoCD
- [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%)
- [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited
- [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177
2026-02-12 18:40:48 -08:00
9717863f65 Update CV release to v1.0.3, add X-Clacks-Overhead header (#176)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m5s
## Summary
- Update CV release URL from v1.0.2 to v1.0.3
- Add `X-Clacks-Overhead: GNU Terry Pratchett` header to both `docs.eblu.me` and `cv.eblu.me` server blocks in the Fly.io proxy nginx config

## Deployment and Testing
- [ ] Sync CV app: `argocd app sync cv`
- [ ] Verify CV is serving v1.0.3 content
- [ ] Deploy fly proxy (workflow or `mise run fly-deploy`)
- [ ] Verify header: `curl -sI https://docs.eblu.me | grep -i clacks`
- [ ] Verify header: `curl -sI https://cv.eblu.me | grep -i clacks`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/176
2026-02-12 17:08:22 -08:00
ed5c9c9b48 Update CV release to v1.0.2 (#175)
## Summary
- Update `CV_RELEASE_URL` in cv deployment from v1.0.1 to v1.0.2

## Deployment and Testing
- [ ] `argocd app sync cv` after merge
- [ ] Verify cv.eblu.me serves updated content

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/175
2026-02-12 16:18:55 -08:00
Forgejo Actions
70d8881959 Update docs release to v1.7.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-12 14:13:12 -08:00
7dc03c0af1 Add CV to services-check, update homepage link (#174)
## Summary
- Add CV to services-check (tailnet endpoint + public cv.eblu.me)
- Update CV homepage annotation to point to cv.eblu.me instead of cv.ops.eblu.me

## Deployment and Testing
- [ ] `argocd app sync cv` (homepage link change)
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/174
2026-02-12 14:10:03 -08:00
df372fccb6 Expose CV publicly at cv.eblu.me (#173)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m57s
## Summary
- Add nginx server block for `cv.eblu.me` (static site, same pattern as docs)
- Add DNS CNAME record in Pulumi (`cv.eblu.me` → `blumeops-proxy.fly.dev`)
- Add `cv.eblu.me` cert to `fly-setup` mise task
- Tag CV Tailscale ingress with `tag:flyio-target` for ACL access
- Remove `/_error` test endpoint from docs proxy

## Deployment and Testing
- [ ] `argocd app set cv --revision cv/public-cv-eblu-me && argocd app sync cv`
- [ ] `fly certs add cv.eblu.me -a blumeops-proxy`
- [ ] `mise run fly-deploy`
- [ ] Verify proxy: `curl -I -H "Host: cv.eblu.me" https://blumeops-proxy.fly.dev/`
- [ ] `mise run dns-preview` then `mise run dns-up`
- [ ] Verify live: `curl -I https://cv.eblu.me`
- [ ] Merge, then `argocd app set cv --revision main && argocd app sync cv`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/173
2026-02-12 14:05:00 -08:00
a68542a602 Update CV release to v1.0.1 (#172)
## Summary
- Update `CV_RELEASE_URL` from v0.1.0 to v1.0.1 in the CV deployment manifest

## Deployment and Testing
- [ ] `argocd app sync cv` after merge
- [ ] Verify cv.ops.eblu.me serves updated resume

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/172
2026-02-12 13:38:05 -08:00
Forgejo Actions
200be39492 Update docs release to v1.7.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-12 11:46:38 -08:00
63e67b927a Add CV service reference card and docs updates (#171)
## Summary
- Add CV service reference card (`docs/reference/services/cv.md`)
- Add CV to services index, ArgoCD apps registry, and Caddy proxied services table
- Changelog fragment

## Files changed
- `docs/reference/services/cv.md` (new) — CV reference card
- `docs/reference/reference.md` — Add CV to services table
- `docs/reference/kubernetes/apps.md` — Add CV to app registry
- `docs/reference/services/caddy.md` — Add CV to proxied k8s services
- `docs/changelog.d/cv-docs.doc.md` (new) — Changelog fragment

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/171
2026-02-12 11:45:32 -08:00
545433055e Add release artifact workflow how-to and changelog fragments (#170)
All checks were successful
Build Container / build (push) Successful in 17s
## Summary
- How-to guide for creating release artifact workflows with Forgejo packages
- Changelog fragment for the multi-repo forgejo_actions_secrets Ansible role change
- Changelog fragment for the new docs

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/170
2026-02-12 11:20:05 -08:00
cd25dae8f9 Extend forgejo_actions_secrets role to support multiple repos
Uses subelements loop to sync secrets across repos. Adds FORGE_TOKEN
to the cv repo for package uploads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 11:15:28 -08:00
01e19023ee Add CV/resume web app at cv.ops.eblu.me (#169)
## Summary
- nginx container (`containers/cv/`) downloads and serves a content tarball at startup (same pattern as quartz)
- ArgoCD app + k8s manifests (deployment, service, Tailscale ingress)
- Caddy route for `cv.ops.eblu.me`
- Deploy workflow: resolves "latest" or specific version from Forgejo packages, updates deployment, syncs ArgoCD
- Content is built and released from the separate [cv repo](https://forge.ops.eblu.me/eblume/cv)

## Deployment steps (after merge)
1. `mise run container-tag-and-release cv v1.0.0`
2. Run "Release CV" workflow in cv repo (SPECIFIC_VERSION `v0.1.0`)
3. Run "Deploy CV" workflow in blumeops (default: latest)
4. `mise run provision-indri -- --tags caddy`
5. Verify at `https://cv.ops.eblu.me/`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/169
2026-02-12 11:09:41 -08:00
Forgejo Actions
a800bdc8b9 Update docs release to v1.6.9
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 21:28:40 -08:00
1fc0c1b18b Fix Dagger towncrier container using UTC instead of local time (#167)
## Summary
- The `build_changelog` Dagger container (`python:3.12-slim`) defaults to UTC, causing towncrier to stamp tomorrow's date when releases are cut in the evening PST.
- This is the root cause of the docs website (built via Dagger) showing Feb 12 while the repo CHANGELOG (built directly on the runner) correctly showed Feb 11.
- Fix: set `TZ=America/Los_Angeles` in the Dagger container before running towncrier.

## Verified
- `docker run --rm python:3.12-slim` → `towncrier _get_date()` returns `2026-02-12` (wrong)
- `docker run --rm -e TZ=America/Los_Angeles python:3.12-slim` → returns `2026-02-11` (correct)

## Test plan
- [ ] Merge, then trigger a build-blumeops release
- [ ] Verify the CHANGELOG date on https://docs.eblu.me/CHANGELOG matches the repo

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/167
2026-02-11 21:27:35 -08:00
Forgejo Actions
b36b30ef7a Update docs release to v1.6.8
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 21:12:50 -08:00
f3319c753c Update deploy-k8s-service doc with ProxyGroup ingress pattern (#166)
## Summary
- Updated the "Configure Ingress" section to use the current ProxyGroup pattern (`proxy-group: "ingress"`, `defaultBackend`, `tls.hosts`)
- Replaced the old per-ingress proxy example that used `rules:` with `host:` (which breaks ProxyGroup routing)
- Added key points explaining why `defaultBackend` is required and what each annotation does
- Updated checklist to mention ProxyGroup

## Test plan
- [ ] Review rendered doc for accuracy against existing ingress manifests

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/166
2026-02-11 21:10:42 -08:00
Forgejo Actions
0528a6f712 Update docs release to v1.6.7
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 18:07:11 -08:00
1fd54dbab7 Close Dagger CI plan (Phases 1-3 complete) (#165)
## Summary

- Move `adopt-dagger-ci.md` to new `docs/how-to/plans/completed/` archive
- Update plan status to "Phases 1–3 complete" with all verification checklists checked
- Update runner description to reflect what actually stayed (Docker CLI, Node.js needed)
- Document known issue: changelog dates still show UTC
- Create completed plans index, update plans and how-to indexes
- Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/165
2026-02-11 18:04:39 -08:00
Forgejo Actions
9e0487b523 Update docs release to v1.6.6
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 17:57:59 -08:00
20a25557d6 Bump runner image to v3.0.2 (restore Docker CLI)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 17:53:04 -08:00
24d2a55f9e Restore Docker CLI to runner image for Dagger engine (#164)
Some checks failed
Build Container / build (push) Failing after 3s
## Summary

- Dagger shells out to the `docker` binary to provision its BuildKit engine container
- Phase 3 removed `docker-ce-cli`, breaking all `dagger call` invocations in CI
- This restores `docker-ce-cli` (without buildx/skopeo — those aren't needed)

## Test plan

- [ ] Build locally, release as v3.0.2, update manifest, sync
- [ ] Trigger docs build workflow and verify Dagger engine starts

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/164
2026-02-11 17:49:33 -08:00
0006e6bf17 Bump runner image to v3.0.1 (restore Node.js)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 17:38:45 -08:00
23fb036b92 Restore Node.js to runner image for JavaScript Actions (#163)
Some checks failed
Build Container / build (push) Failing after 2s
## Summary

- Restores Node.js 20 LTS to the Forgejo runner job image
- `actions/checkout@v4` and other JavaScript Actions require `node` in the job container
- The Phase 3 simplification (PR #162) accidentally removed it, breaking all CI runs

## Changes

- `containers/forgejo-runner/Dockerfile`: Add `gnupg` (for nodesource GPG key) and Node.js 20 via nodesource
- Changelog fragment

## Test plan

- [ ] Merge, release as `forgejo-runner-v3.0.1`
- [ ] Update runner manifest to v3.0.1, sync, restart pod
- [ ] Trigger a workflow_dispatch and verify `actions/checkout` succeeds

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/163
2026-02-11 17:35:33 -08:00
3d84483513 Update runner job image to forgejo-runner:v3.0.0
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 17:27:50 -08:00
95364dcb48 Simplify runner image (Dagger Phase 3) (#162)
All checks were successful
Build Container / build (push) Successful in 1m13s
## Summary

With Phases 1 and 2 complete, the runner image no longer needs most of its bundled tools. This PR strips it down and adds what was missing.

**Removed** (now inside Dagger containers):
- Node.js 24.x
- Docker CLI + buildx plugin
- skopeo
- gnupg, lsb-release, xz-utils

**Added:**
- `tzdata` — fixes the TZ env var (#159, #160, #161) so `TZ=America/Los_Angeles` actually works
- `flyctl` — was being installed from scratch every release

**Workflow changes:**
- Remove "Ensure Dagger CLI" bootstrap steps from both workflows (Dagger is in the image)
- Remove "Install flyctl" step from build-blumeops (flyctl is in the image)
- Remove job-level `TZ` from build-blumeops (moved to runner configmap `runner.envs`)
- Set `TZ: America/Los_Angeles` in runner configmap so all job containers inherit it

## Deployment

After merge:
1. Build and release the new runner image: `mise run container-release forgejo-runner v2.0.0`
2. Sync the runner: `argocd app sync forgejo-runner`
3. Verify: `kubectl -n forgejo-runner exec deploy/forgejo-runner -c runner -- date` (but the real test is running a docs release and checking the changelog date)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/162
2026-02-11 17:24:20 -08:00
Forgejo Actions
996afbcf6f Update docs release to v1.6.5
[skip ci]
2026-02-11 17:10:29 -08:00
e84ffb7d7f Set TZ on build-blumeops workflow job (#161)
## Summary

The runner pod's `TZ` env var (#159, #160) doesn't propagate to workflow job containers — jobs run inside Docker containers spawned by the DinD sidecar, not in the runner process itself. Set `TZ: America/Los_Angeles` at the job level so `uvx towncrier build` uses the correct timezone.

This is the actual fix for the Feb 12 changelog dates. The runner pod TZ is still useful for runner daemon logs but doesn't affect job execution.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/161
2026-02-11 17:06:44 -08:00
Forgejo Actions
6ce03df819 Update docs release to v1.6.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-12 01:01:23 +00:00
2a04ab26b7 Mount host zoneinfo into runner for TZ support (#160)
## Summary

The `TZ=America/Los_Angeles` env var from #159 has no effect because the `forgejo/runner` image doesn't ship tzdata. Mount the node's `/usr/share/zoneinfo` into the container so the timezone database is available.

## Deployment

After merge, sync forgejo-runner and verify:
```
argocd app sync forgejo-runner
kubectl -n forgejo-runner exec deploy/forgejo-runner -c runner -- date
# Should show PST/PDT, not UTC
```

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/160
2026-02-11 16:57:11 -08:00
42ebc2b122 Fix Forgejo runner timezone (UTC -> America/Los_Angeles) (#159)
## Summary

- Set `TZ=America/Los_Angeles` on the Forgejo runner container

The runner pod defaults to UTC. When releases are cut in the evening PST, towncrier stamps changelog entries with tomorrow's date (e.g., v1.6.2 shows 2026-02-12 despite being released on the evening of Feb 11 PST).

## Deployment

After merge, sync the forgejo-runner ArgoCD app:
```
argocd app sync forgejo-runner
```

The runner pod will restart with the new timezone. Note: the v1.6.2 changelog entry will remain dated 2026-02-12; future entries will use PST dates, so dates may appear non-sequential once.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/159
2026-02-11 16:53:41 -08:00
Forgejo Actions
e5d1e795e0 Update docs release to v1.6.3
[skip ci]
2026-02-12 00:46:35 +00:00
b0bac91ca9 Fix frontmatter field name for Quartz date display (#158)
## Summary

- Rename `date-modified` -> `modified` in all 80 docs and the `docs-check-frontmatter` task

Quartz's `CreatedModifiedDate` plugin recognizes `modified`, `lastmod`, `updated`, and `last-modified` — but not `date-modified`. The wrong field name caused Quartz to ignore frontmatter dates entirely and fall through to filesystem timestamps (UTC inside Dagger), showing Feb 12 on pages built late on Feb 11 PST.

## Test plan

- [x] `mise run docs-check-frontmatter` passes
- [ ] Kick off docs release after merge — verify rendered dates match frontmatter values

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/158
2026-02-11 16:45:12 -08:00
Forgejo Actions
a75089d8ef Update docs release to v1.6.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-12 00:35:02 +00:00
b197bd5f58 Adopt Dagger CI for docs build (Phase 2) (#157)
## Summary

Migrates the docs build pipeline to Dagger (Phase 2 of the Dagger CI adoption plan).

- **Backfill `date-modified` frontmatter** on all 80 docs — Dagger's `--src=.` excludes `.git`, so Quartz can't use git history for page dates. Frontmatter dates work with or without git.
- **New `docs-check-frontmatter` mise task + pre-commit hook** — validates all docs have `title`, `tags`, and `date-modified`
- **New Dagger functions** — `build_changelog` (towncrier in Python container) and `build_docs` (chains changelog → Quartz build in Node container, returns tarball)
- **Simplified CI workflow** — the ~44-line inline Quartz build (clone, npm ci, build, tar, cleanup) is replaced by `dagger call build-docs`. Changelog step remains local on the runner since towncrier needs to modify the host working tree for the git commit.

### Design decisions

- **Towncrier runs twice in CI**: once inside Dagger (for the docs tarball) and once on the runner (for the git commit). This is intentional — Dagger's directory export is additive and can't delete the consumed changelog fragments from the host.
- **Artifact hosting stays on Forgejo Releases** (not migrated to Forgejo Packages as the plan doc originally suggested). That migration can happen independently.
- **`date-modified` frontmatter** preserved even though `build_changelog` installs git — the git there is only for towncrier's `git add` call, not for history. The local iteration story (`dagger call build-docs --src=. --version=dev` with uncommitted changes) depends on frontmatter dates.

### Local iteration

```bash
dagger call build-docs --src=. --version=dev export --path=./docs-dev.tar.gz
tar tf docs-dev.tar.gz | head -20
```

## Deployment and Testing

- [x] `dagger call build-docs --src=. --version=dev` produces valid 1.1MB tarball (149 HTML pages)
- [x] Pre-commit hooks pass (including new `docs-check-frontmatter`)
- [ ] Full `workflow_dispatch` run after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/157
2026-02-11 16:33:16 -08:00
738b39a321 Mark Phase 1 verification checklist items as complete
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 15:49:49 -08:00
1bc2b421a8 Adopt Dagger CI for container builds (Phase 1) (#156)
All checks were successful
Build Container / build (push) Successful in 13s
## Summary

- Add Dagger Python module (`.dagger/`) with `build` and `publish` functions for container images
- Replace Docker buildx + skopeo composite action with `dagger call publish` in `build-container.yaml`
- BuildKit's native push is compatible with Zot — **skopeo workaround eliminated**
- Add Dagger CLI (v0.19.11) to forgejo-runner Dockerfile, bump runner to v2.6.0
- Bootstrap step in workflow curl-installs dagger if not in runner (for first build on v2.5.1 runner)
- Delete old `.forgejo/actions/build-push-image/` composite action
- Add GPLv3 LICENSE

## Verified locally

- `dagger call build --src=. --container-name=nettest` — builds ✓
- `dagger call publish --src=. --container-name=nettest --version=dagger-test` — pushed to Zot ✓
- `dagger call build --src=. --container-name=forgejo-runner` — new runner image builds ✓
- Dagger CLI accessible inside built runner image ✓

## Deployment sequence (after merge)

1. `mise run container-tag-and-release forgejo-runner v2.6.0` — old runner bootstraps dagger via curl, builds new runner
2. `argocd app sync forgejo-runner` — runner restarts with v2.6.0 (dagger baked in)
3. `mise run container-tag-and-release nettest v0.13.0` — end-to-end test of new pipeline
4. `mise run container-list` — verify tags

## Not included (future phases)

- Phase 2: docs build + Forgejo packages migration
- Phase 3: runner simplification (remove skopeo, Node.js, etc.)
- Phase 4: future workflows

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/156
2026-02-11 15:38:31 -08:00
faebc98c3c Fix blumeops-tasks for Todoist API v1 migration (#155)
## Summary
- Migrate from deprecated Todoist REST API v2 (`410 Gone`) to new unified API v1
- Add cursor-based pagination for project and task listing endpoints
- Switch 1Password credential retrieval from `op item get --fields` to `op read`

## Testing
- [x] `mise run blumeops-tasks` returns all 9 tasks successfully
- [x] Pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/155
2026-02-11 14:33:37 -08:00
Forgejo Actions
362ae22ab7 Update docs release to v1.6.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 21:37:34 +00:00
3c4b5b6c10 Add changelog fragment for cache purge BusyBox fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 13:36:44 -08:00
cef7611cba Wrap fly ssh cache purge in sh -c for BusyBox
fly ssh console -C doesn't run through a shell, so && was passed as
literal arguments to rm. Wrap in sh -c to get proper shell parsing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 13:35:11 -08:00
Forgejo Actions
eca01a9546 Update docs release to v1.6.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 21:33:57 +00:00
0efcce2984 Purge Fly.io proxy cache after docs release (#154)
## Summary
- The Fly.io nginx proxy caches docs responses for 24h (`proxy_cache_valid 200 1d`)
- After a release, docs.eblu.me kept serving stale content until the cache expired
- This caused v1.5.4 to show v1.5.3 on the CHANGELOG page
- Adds `flyctl` install and `fly ssh console` cache purge steps to the build workflow, running after the ArgoCD deploy completes

## Test plan
- [ ] Next release should show the correct version on docs.eblu.me/CHANGELOG immediately
- [ ] Verify the `fly ssh console` command succeeds in the workflow logs

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/154
2026-02-11 13:33:26 -08:00
Forgejo Actions
ab6661f5dd Update docs release to v1.5.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 20:17:12 +00:00
a59ff04249 Review security-model.md (#153)
## Summary
- Fix Ansible secret example: replaced incorrect `op item get --fields` with `op read` to match project convention
- Add new "Tailscale Operator Privileges" section documenting the operator's namespaced RBAC and OAuth client permissions
- Stamp `last-reviewed: 2026-02-11`

## Review Notes
First review of this doc (previously never reviewed). Verified:
- All wiki-links resolve
- ACL structure matches actual `pulumi/tailscale/policy.hujson`
- TruffleHog pre-commit config exists as documented
- Ansible `op read` pattern matches actual usage in playbooks/roles

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/153
2026-02-11 12:16:32 -08:00
834c9fa57b Bump Fly.io proxy VM to 512MB, fix TruffleHog scanning (#152)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s
## Summary
- Bump Fly.io proxy VM memory from 256MB to 512MB — Alloy was OOM-killed, causing the Grafana Fly.io dashboard to lose metrics
- Fix TruffleHog pre-commit hook to scan only staged changes (`--since-commit HEAD`) instead of full repo history
- Sanitize example credential URL in Reolink camera plan doc

## Deployment and Testing
- [ ] Fly.io deploy triggers automatically on merge (workflow watches `fly/**`)
- [ ] After deploy, verify Alloy is running: `fly ssh console -a blumeops-proxy -C "ps aux"` should show alloy process
- [ ] Grafana Fly.io dashboard should start populating within ~1 minute

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/152
2026-02-11 12:03:51 -08:00
651fed8f1a Transcribe backlog tasks into plan documents (#151)
## Summary
- **adopt-oidc-provider:** Dex-based OIDC identity provider for SSO across services (status: Planning — service dependency/recovery design needed)
- **harden-zot-registry:** OIDC + API key auth and tag immutability for zot (depends on OIDC provider + Dagger CI)
- **forgejo-actions-dashboard:** Custom textfile Prometheus exporter + Grafana dashboard for Forgejo Actions CI metrics
- **operationalize-reolink-camera:** Cloud-free Frigate NVR with ONNX detection, NFS ring buffer recording to sifaka (depends on network segmentation)
- **add-unifi-pulumi-stack:** Expanded with NFS security motivation, BlumeOps Services subnet, IoT/appliance segregation, firewall rules

## Test plan
- [x] Pre-commit hooks pass (all 3 commits)
- [x] `docs-check-links` passes
- [x] `docs-check-index` passes
- [x] `docs-check-filenames` passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/151
2026-02-11 11:47:23 -08:00
430f2c6ec5 Add plans for Dagger CI/CD and upstream fork strategy (#150)
## Summary

Two new plan documents in `docs/how-to/plans/`:

- **adopt-dagger-ci** — Migrate CI/CD build logic from Forgejo Actions YAML to Dagger (Python SDK). Forgejo Actions stays as a thin trigger layer. Covers:
  - Container builds with local iteration (`dagger call build ... terminal`)
  - Docs builds with Forgejo packages migration (replacing Forgejo releases)
  - Runner simplification (only Docker + dagger CLI needed)
  - Secrets handling via Dagger's `Secret` type
  - Future: forked project builds, Python packages, pre-merge validation

- **upstream-fork-strategy** — Stacked-branch pattern for maintaining forks of upstream projects. Covers:
  - Daily automated rebase with conflict detection and issue creation
  - Branch model: `upstream/main` → `blumeops` → `feature/*`
  - Quartz fork as first instance, enabling `last-reviewed` frontmatter rendering in docs
  - Upstream PR path for contributing changes back

## Context

These plans emerged from evaluating alternatives to the GHA ecosystem (BuildKite, Concourse, Earthly) for CI/CD. Dagger was chosen for its local iteration story, Python-native pipelines, and zero-infrastructure requirements. The fork strategy is a prerequisite for customizing Quartz and other upstream tools.

Neither plan is ready for execution yet — they are design documents for future work.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/150
2026-02-11 10:20:14 -08:00
Forgejo Actions
a106f92c38 Update docs release to v1.5.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 15:53:49 +00:00
aab19c97fe Restore docker buildx build (#149)
All checks were successful
Build Container / build (push) Successful in 40s
## Summary
- Switch build action back to `docker buildx build` now that runner v2.5.1 (with `docker-buildx-plugin`) is deployed

## Test plan
- [ ] Merge and tag `nettest-v0.12.0` to verify buildx works end-to-end

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/149
2026-02-10 21:21:19 -08:00
f0ac04fb8a Bootstrap buildx: revert to docker build, bump runner to v2.5.1 (#148)
All checks were successful
Build Container / build (push) Successful in 1m56s
## Summary
- Temporarily revert composite action to `docker build` so we can build the runner image (chicken-and-egg: current runner v2.5.0 doesn't have buildx)
- Bump runner label to `v2.5.1` so after sync the new runner image (with buildx) gets used

## Deployment plan
1. Merge this PR
2. Tag `forgejo-runner-v2.5.1` — builds with legacy `docker build` (one last time)
3. Sync forgejo-runner in ArgoCD to pick up the v2.5.1 label
4. Follow-up PR: switch action back to `docker buildx build`
5. Tag `nettest-v0.12.0` to verify buildx works end-to-end

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/148
2026-02-10 21:17:14 -08:00
2fc5aa82b1 Add docker-buildx-plugin to forgejo-runner (#147)
Some checks failed
Build Container / build (push) Failing after 3s
## Summary
- Install `docker-buildx-plugin` alongside `docker-ce-cli` in the forgejo-runner image
- Fixes `docker buildx build` failing with "unknown flag: --tag" from #146

## Test plan
- [ ] Merge and release `forgejo-runner-v2.5.1`
- [ ] Update runner configmap/labels if needed to use new image
- [ ] Re-tag `nettest-v0.11.1` (or `v0.12.0`) to verify build-container workflow succeeds

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/147
2026-02-10 21:14:29 -08:00
cb36f1784f Switch CI builds to docker buildx (#146)
Some checks failed
Build Container / build (push) Failing after 4s
## Summary
- Replace deprecated `docker build` with `docker buildx build` in the build-push-image composite action
- Remove redundant build/run comments from nettest Dockerfile

## Test plan
- [ ] Merge and tag `nettest-v1.1.0` (or similar) to trigger the build-container workflow
- [ ] Verify the build succeeds without the deprecation warning

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/146
2026-02-10 21:03:41 -08:00
19741021d6 Doc review for explanation/explanation.md 2026-02-10 16:00:01 -08:00
0dce806107 Add plan and reference card for UniFi Express 7 Pulumi stack (#145)
## Summary
- Rewrites the UniFi Pulumi plan doc to use filipowm/unifi Terraform provider via `pulumi package add terraform-provider` (replaces pulumiverse_unifi approach)
- Adds network segmentation goals (main/guest/IoT WiFi zones) and API key auth
- Creates UniFi reference card (`docs/reference/infrastructure/unifi.md`) with topology diagram
- Updates all documentation indexes (plans.md, how-to.md, hosts.md, reference.md)

## What's Deferred
Actual stack scaffolding (`pulumi/unifi/`), mise tasks, and `pulumi import` are blocked on switch purchase and cabling. The plan doc captures everything needed for a future execution session.

## Verification
- `docs-check-links` passes (all wiki-links resolve)
- `docs-check-index` passes (unifi.md referenced in reference.md)
- Pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/145
2026-02-10 15:36:13 -08:00
f65d11d55b Update BorgBase repo ID after recreation (#144)
## Summary
- Previous BorgBase repo (k04ljcd7) had corrupted segments from interrupted backup attempts
- Recreated as u3ugi1x1 (same US region, same SSH key, same append-only settings)
- Updates repo path in Ansible defaults and known_hosts hostname in tasks

## Post-merge
1. `mise run provision-indri -- --tags borgmatic`
2. `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'`
3. `mise x -- borgmatic create --repository borgbase-offsite --verbosity 1 --progress`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/144
2026-02-10 13:19:15 -08:00
0d5f48e2c2 Document op read vs op item get convention (#143)
## Summary
- Adds guidance to CLAUDE.md: use `op read` for secret values, `op item get` only for metadata
- Fixes the argocd login example which used `op item get --fields`
- `op item get --fields` wraps multi-line values in quotes, which corrupts keys and other secrets

Discovered while verifying the sifaka borg repokey in 1Password — hashes didn't match until we switched to `op read`.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/143
2026-02-10 13:09:55 -08:00
d045a5d76a Add BorgBase offsite backup repository (#142)
## Summary
- Adds BorgBase as a second borgmatic repository for offsite backups (US region, append-only)
- SSH key managed via 1Password, deployed to indri by Ansible
- Borgmatic `ssh_command` configured to use the dedicated BorgBase key
- BorgBase host key pinned in known_hosts via Ansible

## Post-merge deployment steps
1. Provision borgmatic: `mise run provision-indri -- --tags borgmatic`
2. Initialize the BorgBase repo: `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'`
3. Export and store the borg repokey: `ssh indri 'borg key export ssh://k04ljcd7@k04ljcd7.repo.borgbase.com/./repo'` → save to 1Password
4. Verify first backup: `ssh indri 'mise x -- borgmatic create --repository borgbase-offsite --verbosity 1'`

## BorgBase setup (already done)
- Account created, API token in 1Password (`borgbase` item in blumeops vault)
- SSH keypair generated, stored in 1Password, public key uploaded to BorgBase (ID: 200815)
- Repository `indri-borgmatic` created (ID: k04ljcd7, US region, append-only, 2-day alert)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/142
2026-02-10 12:47:02 -08:00
54afa0750b Add how-to guide for restoring 1Password backup from borgmatic (#141)
## Summary
- New how-to guide at `docs/how-to/restore-1password-backup.md` with step-by-step procedure for extracting and decrypting a 1Password `.1pux` export from borgmatic backup
- **End-to-end verified**: extracted from today's borg archive, decrypted age key with openssl, decrypted .1pux with age → valid 31MB zip with vault data
- Cross-links added from: disaster-recovery, 1password, borgmatic, backups policy, and how-to index
- Updated disaster-recovery.md from TBD stub to include a procedures table

## Deployment and Testing
- [x] Verified full extraction + decryption flow against live borgmatic archive
- [x] `docs-check-links` passes — all wiki-links valid
- [ ] Review guide for clarity and completeness

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/141
2026-02-10 10:55:00 -08:00
b5746e62c2 Add migration plan for Forgejo brew-to-source transition (#140)
## Summary
- Add `docs/how-to/plans/migrate-forgejo-from-brew.md` — full Diataxis-style plan covering background, one-time migration steps, Ansible role changes (with exact code), verification checklist, and future considerations
- Add `docs/how-to/plans/plans.md` — new plans subdirectory index for upcoming migration/transition plans
- Update `docs/how-to/how-to.md` with a Plans section
- Update `docs/tutorials/exploring-the-docs.md` to mention plans in the doc structure table and quick-path sections for Owner and AI audiences

## Test plan
- [x] `docs-check-links` passes
- [x] `docs-check-index` passes
- [x] All pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/140
2026-02-10 10:18:53 -08:00
41dfae1f80 Add CNI conflict troubleshooting to restart-indri how-to (#139)
## Summary
- Documents a troubleshooting procedure for broken pod networking after unclean shutdown
- During minikube recovery, a stale `1-k8s.conflist` CNI config can override kindnet's `10-kindnet.conflist`, causing new pods to use bridge+firewall networking instead of kindnet's ptp — breaking pod-to-pod communication
- Covers symptoms (DNS failures, liveness probe timeouts), diagnosis steps, and the fix

## Context
Encountered this during the 2026-02-10 power outage. Immich, kiwix, and transmission were all crash-looping for ~8 hours due to the CNI conflict. The minikube ansible role's clean boot detection has been improved (#137) so this may not recur, but the troubleshooting guide is valuable if it does.

## Test plan
- [x] Documentation only — no code changes
- [x] Pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/139
2026-02-10 07:24:42 -08:00
e5d1e979b8 Add power infrastructure reference card (#138)
## Summary
- New `power.md` reference card documenting the full power chain: AC grid → Anker SOLIX F2000 battery → CyberPower CP1000PFCLCD UPS → homelab
- Lists all devices on the UPS (indri, sifaka, UniFi Express 7, Starlink)
- Replaced inline UPS entry on indri card with link to the new power card
- Added power card to reference index

Context: power chain was upgraded — the Anker battery station now sits between grid power and the UPS, providing extended runtime for grid outages.

## Test plan
- [x] docs-check-links, docs-check-index, docs-check-filenames all pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/138
2026-02-09 23:03:13 -08:00
d76d675b29 Fix minikube role skipping start when kubelet/apiserver are stopped (#137)
## Summary
- After a power loss, minikube's Docker container (host) restarts but kubelet/apiserver remain stopped
- The ansible role's status check used `--format='{{.Host}}'` which only examined the host VM state
- When host=Running but kubelet/apiserver=Stopped, the role skipped `minikube start`
- Fixed to use full `minikube status` exit code (returns non-zero when any component is unhealthy)
- Simplified all downstream conditions to use exit code instead of string matching

## Test plan
- [x] Verified the fix correctly skips `minikube start` when cluster is already fully running
- [x] Pre-commit hooks pass (ansible-lint, yamllint, etc.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/137
2026-02-09 23:03:01 -08:00
a5765f9cf2 Add op-backup mise task for encrypted 1Password disaster recovery (#136)
## Summary
- Adds `mise run op-backup` task that encrypts a 1Password .1pux export with `age` using the master password + secret key as passphrase, SCPs to indri for borgmatic pickup, then deletes the plaintext
- Adds `age` to the Brewfile
- Borgmatic already backs up `/Users/erichblume/Documents` on indri, which covers the `1password-backup/` subdirectory — no config change needed

## Disaster recovery
1. Restore borgmatic archive to retrieve the `.age` file
2. Open Emergency Kit from safety deposit box
3. `age --decrypt <file>.age > export.1pux` (passphrase: `{master_password}:{secret_key}`)
4. Open `.1pux` with 1Password or unzip to inspect

## Usage
```
# Export all vaults from 1Password desktop app as .1pux, then:
mise run op-backup ~/Documents/1Password-export.1pux

# Or run without args for interactive prompt:
mise run op-backup
```

## Test plan
- [ ] `brew install age`
- [ ] Export a test vault from 1Password as .1pux
- [ ] Run `mise run op-backup` with the export path
- [ ] Verify encrypted file appears on indri at `~/Documents/1password-backup/`
- [ ] Verify plaintext .1pux is deleted from gilbert
- [ ] Test decryption: `age --decrypt <file>.age > test.1pux` with password:secret_key
- [ ] Verify decrypted .1pux can be opened/unzipped

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/136
2026-02-09 20:37:39 -08:00
85e36cd807 Operations and observability for sifaka NAS (#135)
## Summary
- Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring
- Formalize existing `node_exporter` container under Ansible management
- Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus
- Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime)
- Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS
- Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles

## Prerequisites before deploy
- [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP)
- [ ] Verify `ssh eblume@sifaka 'docker ps'` works
- [ ] Run `mise run provision-sifaka` to deploy containers
- [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes
- [ ] `argocd app sync prometheus` + `argocd app sync grafana-config`

## Test plan
- [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics`
- [ ] Verify Prometheus targets page shows both sifaka jobs as UP
- [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135
2026-02-09 17:44:05 -08:00
4ee643a81d Serve friendly error page when Fly.io proxy upstreams are unreachable (#133)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m50s
## Summary
- Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.)
- Stale cache is still served first when available (`proxy_cache_use_stale` takes priority)
- Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams
- `proxy_intercept_errors on` also catches error responses returned by the upstream itself

## Files Changed
- `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo)
- `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors`
- `fly/Dockerfile` — COPY error.html into image

## Test Plan
- [ ] Deploy to Fly.io
- [ ] Visit `docs.eblu.me/_error` to verify the page renders
- [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133
2026-02-09 12:01:24 -08:00
959b6842bc Zero-downtime Fly.io deploys (#132)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m40s
## Summary
- Start nginx after Tailscale connects (community best practice for Tailscale sidecars)
- Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy
- Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379))
- Remove sentinel file and nginx if-check (no longer needed)

Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm)
- [ ] Confirm `fly checks list` shows the new service-level check passing

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132
2026-02-09 11:34:19 -08:00
bd61da4f85 Fix 502 errors during Fly.io proxy deploys (#131)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m20s
## Summary
- Health check (`/healthz`) now returns 503 until Tailscale is connected
- `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds
- Fly.io keeps the old machine serving traffic during the ~7s startup window

Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard)
- [ ] Confirm health check transitions from 503 → 200 in `fly logs`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131
2026-02-09 11:07:36 -08:00
3415cad38c Log real client IPs via Fly-Client-IP header (#130)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 59s
## Summary
- Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header
- Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki
- Keeps `remote_addr` (the internal proxy IP) for debugging

Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs.

## Deployment and Testing
- [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory
- [ ] Verify in Grafana that new log lines include `client_ip` with real IPs
- [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130
2026-02-09 11:02:06 -08:00
Forgejo Actions
92a1081302 Update docs release to v1.5.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-09 15:30:21 +00:00
9e361cf38f Add docs-review task with last-reviewed frontmatter tracking (#129)
## Summary
- New `docs-review` mise task replaces `docs-review-random` — sorts docs by `last-reviewed` frontmatter field (never-reviewed first, then oldest)
- Updated review-documentation how-to to explain the new workflow and how to mark cards as reviewed
- Updated ai-assistance-guide task table to reference `docs-review`

## Test plan
- [x] `mise run docs-review` runs and shows staleness table + most stale doc
- [x] `mise run docs-review -- --limit 5` respects the limit flag
- [x] All pre-commit checks pass (links, index, filenames)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/129
2026-02-09 07:29:45 -08:00
c6f8fcd346 Fix fly-deploy WARNING by starting nginx before Tailscale (#128)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m3s
## Summary
- Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy`
- Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time
- DNS results cached 30s per worker — no per-request lookup overhead

## Context
The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy.

## Test plan
- [ ] Merge and let the deploy-fly workflow trigger
- [ ] Check runner logs for absence of the WARNING
- [ ] Verify `docs.eblu.me` still serves correctly
- [ ] Verify `/healthz` still passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128
2026-02-09 07:01:58 -08:00
a0b076172f Fix Immich/Homepage Ingress host matching, add missing service checks (#127)
## Summary

- Fix Immich Ingress `host: photos` causing 404 with ProxyGroup (same FQDN mismatch as Prometheus/Loki)
- Migrate Homepage from old per-service Tailscale proxy to shared ProxyGroup (was the last holdout)
- Add Immich and Navidrome to `services-check` HTTP endpoints

## Deployment Notes

- Already tested on branch: Immich and Homepage both return 200 via Caddy
- Homepage's old Helm-managed Ingress was deleted manually; ArgoCD may recreate it on sync — prune with `argocd app sync homepage --prune` after merge
- Old per-service `ts-homepage-*` pod in tailscale namespace can be cleaned up after confirming ProxyGroup works

## Test Plan

- [x] `curl https://photos.ops.eblu.me/` returns 200
- [x] `curl https://go.ops.eblu.me/` returns 200
- [ ] `mise run services-check` fully passes after merge

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/127
2026-02-08 22:12:50 -08:00
e6cf7e47e0 Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m8s
## Summary
- Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy
- Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test
- Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses
- Switch Alloy push endpoints from `*.ops.eblu.me` (Caddy) to `*.tail8d86e.ts.net` (Tailscale Ingress)
- Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly

## Manual step (not in PR)
Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes.

## Deployment order
1. **Pulumi ACLs** — `mise run tailnet-preview && mise run tailnet-up`
2. **OAuth client** — Manual update in Tailscale admin console
3. **K8s Ingresses** — `argocd app sync apps && argocd app sync docs loki prometheus`
4. **Fly.io proxy** — `mise run fly-deploy`
5. **Verify** — `mise run services-check`, check Grafana dashboards

## Test plan
- [ ] `mise run tailnet-preview` shows clean diff
- [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions
- [ ] After deploy: Grafana dashboards show continued log/metric flow
- [ ] `curl -sf https://docs.eblu.me` returns 200
- [ ] `mise run services-check` passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126
2026-02-08 21:54:18 -08:00
7f41621c7f Migrate Ansible op calls to op read URI syntax (#125)
## Summary
- Convert all 12 `op item get ... --fields ... --reveal` calls in Ansible to the newer `op read "op://vault/item/field"` syntax
- Remove the `regex_replace` workaround on the Fly deploy token (no longer needed since `op read` returns clean unquoted values)
- Covers `ansible/playbooks/indri.yml`, `ansible/roles/caddy/tasks/main.yml`, `ansible/roles/jellyfin_metrics/tasks/main.yml`, and `ansible/roles/alloy/tasks/main.yml`

## Test plan
- [x] `mise run provision-indri -- --check --diff` dry run passes (ok=67, failed=0)
- [x] No `op item get` calls remain in `ansible/` directory
- [x] All pre-commit hooks pass (yaml, ansible-lint, TruffleHog, etc.)
- [ ] Full provision run after merge to confirm secrets resolve correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/125
2026-02-08 10:52:43 -08:00
234c46c302 Filter blumeops-tasks to hide future-dated tasks (#124)
## Summary
- Tasks with a due date are now only shown when due today or earlier
- Recurring tasks stay hidden until their next occurrence is actionable
- Tasks without a due date continue to always display

## Test plan
- [x] Ran `mise run blumeops-tasks` — verified 18 undated tasks display correctly
- [x] Confirmed "BlumeOps doc review" (due tomorrow) is correctly hidden

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/124
2026-02-08 10:38:44 -08:00
Forgejo Actions
c8d0af6644 Update docs release to v1.5.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 18:06:46 +00:00
cc54b4f565 Add Fly.io proxy observability via embedded Alloy (#123)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00
61862b44c0 Add docs.eblu.me and Fly.io to services-check (#122)
## Summary

- Add public docs (`docs.eblu.me`) and Fly.io healthz checks to `mise run services-check`

## Test plan

- [ ] `mise run services-check` includes new "Public services (via Fly.io)" section

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/122
2026-02-08 02:48:15 -08:00
6b1167a345 Fix Fly.io deploy token quoting (#121)
## Summary

- Strip literal quotes from Fly.io deploy token when syncing to Forgejo Actions secrets
- The `op` CLI wraps values containing spaces in quotes; the Fly.io token format is `FlyV1 <key>` (contains a space)
- This caused the CI deploy workflow to fail with a 401 auth error

## Test plan

- [x] `mise run provision-indri -- --tags forgejo_actions_secrets` succeeds
- [ ] Re-trigger deploy-fly workflow after merge — should authenticate successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/121
2026-02-08 02:43:45 -08:00
Forgejo Actions
c46d55060d Update docs release to v1.5.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 10:37:30 +00:00
64a78422b1 Add Fly.io public reverse proxy for docs.eblu.me (#120)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 9s
## Summary

- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow

## Key details

- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`

## Test plan

- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120
2026-02-08 02:36:19 -08:00
fbedaf2833 docs/expose-service-publicly pt2 - fly.io (#119)
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/119
2026-02-08 00:38:27 -08:00
ae2445c99a Add how-to guide for public service exposure via Cloudflare (#118)
## Summary
- Adds `docs/how-to/expose-service-publicly.md` documenting the full plan for exposing `docs.eblu.me` to the public internet
- Covers Cloudflare Tunnel + CDN architecture, DNS migration from Gandi, Caddy TLS changes, Pulumi IaC, k8s cloudflared deployment, and verification steps
- Pattern is reusable for future public services
- Marked as "Plan — not yet implemented" status

## Test plan
- [x] `docs-check-links` passes
- [x] `docs-check-index` passes
- [x] All pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/118
2026-02-07 22:09:52 -08:00
Forgejo Actions
11c76d4768 Update docs release to v1.4.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:45:40 +00:00
dc46eb7def Update all docs titles to human-readable (#117)
## Summary
- Updated frontmatter `title:` in all 63 doc cards from slug-case to human-readable (e.g. `borgmatic` → `Borgmatic`, `ai-assistance-guide` → `AI Assistance Guide`)
- Titles now closely match file stems so `[[wiki-links]]` render naturally without alternate anchor text
- Corrected titles that diverged from stems (e.g. `host-inventory` → `Hosts`, `grafana-alloy` → `Alloy`, `argocd-applications` → `Apps`)
- Deleted `title-test-alpha.md` and `title-test-beta.md` test cards and removed their reference index entry

## Deployment and Testing
- [x] `docs-check-links` passes — all wiki-links valid
- [x] `docs-check-index` passes
- [x] `docs-check-filenames` passes
- [ ] Verify titles render correctly on docs site after deploy

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/117
2026-02-07 21:44:57 -08:00
Forgejo Actions
ab7efd8c1c Update docs release to v1.4.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:27:23 +00:00
a7d6d44d3d Remove title slug check and test duplicate titles (#116)
## Summary
- Remove `docs-check-titles` pre-commit hook and mise task — wiki-links resolve by filename stem, not frontmatter title, so slug-format titles and uniqueness aren't needed
- Add two test cards (`title-test-alpha`, `title-test-beta`) with identical `title: Title Test Card` to verify duplicate titles don't break Quartz or obsidian.nvim
- Retitle `index.md` from `blumeops-documentation` to `BlumeOps`
- Add GitHub and Forgejo repo links to homepage intro

## Test plan
- [ ] Deploy to docs site and verify both test cards render and cross-link correctly
- [ ] Verify homepage title renders as "BlumeOps"
- [ ] Verify repo links on homepage work
- [ ] After confirming, remove test cards in a follow-up

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/116
2026-02-07 21:26:18 -08:00
Forgejo Actions
3f5017f732 Update docs release to v1.4.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:03:34 +00:00
8e4afe77e0 Add Gandi DNS docs and rewrite homepage intro (#115)
## Summary
- New reference card (`docs/reference/infrastructure/gandi.md`) covering DNS records, Pulumi config, TLS integration
- New how-to guide (`docs/how-to/gandi-operations.md`) for DNS deployment and PAT cycling with `pbpaste` shortcut
- Rewritten homepage intro for wider audience ahead of public docs.eblu.me
- Cross-linked from reference index, routing, caddy, and how-to index
- Fixed PAT expiration inaccuracy in `pulumi/gandi/README.md` (max is 90 days, not 30)

## Test plan
- [ ] Verify wiki-links resolve in Quartz build
- [ ] Review gandi reference card for accuracy
- [ ] Review gandi-operations how-to for accuracy
- [ ] Check homepage reads well for external visitors

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/115
2026-02-07 21:02:10 -08:00
2cb76de5a2 Fix doc tag inconsistencies and add missing ai changelog type (#114)
## Summary
- Add missing `ai` changelog fragment type to the update-documentation how-to guide (comment and table)
- Consolidate `cicd` → `ci-cd` tag on forgejo.md
- Consolidate `network` → `networking` tag on routing.md and tailscale.md

Found during random doc review via `docs-review-tags`.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/114
2026-02-06 18:52:36 -08:00
cb343a2e35 Rename doc-* mise tasks to docs-check-* / docs-review-* (#113)
## Summary
- Rename 4 automated-check tasks: `doc-titles` → `docs-check-titles`, `doc-filenames` → `docs-check-filenames`, `doc-links` → `docs-check-links`, `doc-index` → `docs-check-index`
- Rename 3 interactive-review tasks: `doc-random` → `docs-review-random`, `doc-tags` → `docs-review-tags`, `doc-stale` → `docs-review-stale`
- Update all references in `.pre-commit-config.yaml`, `ai-assistance-guide.md`, and `review-documentation.md`
- Historical changelog entries left as-is

## Test plan
- [x] `mise run docs-check-titles` exits 0
- [x] `mise run docs-check-links` exits 0
- [x] `mise run docs-review-tags` exits 0
- [x] `mise run doc-titles` fails with "no task found"
- [x] All pre-commit hooks pass (including renamed hook IDs)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/113
2026-02-06 07:08:46 -08:00
060c7a24e3 Review exploring-the-docs and add doc consistency checks (#112)
## Summary
- Reviewed and cleaned up exploring-the-docs tutorial: simplified wiki-links, fixed broken replication/ reference, added Related section, corrected zk-docs flags to match CLAUDE.md
- Added orphan detection to doc-links (finds docs not linked from any other doc)
- Added new doc tooling: `doc-index` (checks category index coverage), `doc-stale` (staleness report), `doc-tags` (tag inventory)
- Added `doc-index` as a pre-commit hook
- Updated use-pypi-proxy to document env-var-based proxy toggle for pip/uv
- Updated ai-assistance-guide with new doc task descriptions

## Test plan
- [ ] Run `mise run doc-links` — passes
- [ ] Run `mise run doc-index` — passes
- [ ] Run `mise run doc-stale` — informational output
- [ ] Run `mise run doc-tags` — informational output
- [ ] Pre-commit hooks pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/112
2026-02-05 21:12:06 -08:00
61c5328ec2 Update restart-indri docs after power outage recovery (#111)
## Summary
- Simplified restart-indri startup procedure to match reality (most services autostart via mcquack LaunchAgents and brew services)
- Added minikube tailscale serve port fix step (`mise run provision-indri -- --tags minikube`)
- Added Anker SOLIX F2000 GaNPrime UPS to indri reference card

## Context
After a power outage, discovered that the restart-indri docs overstated what needs manual intervention. Docker Desktop, Forgejo, Caddy, and all mcquack services autostart. Only Amphetamine, AutoMounter, and minikube need manual action.

## Test plan
- [ ] Verify restart-indri doc reads clearly
- [ ] Verify indri ref card UPS entry renders correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/111
2026-02-05 21:05:35 -08:00
3b4ff91469 Fix homepage Admin bookmark icons (#110)
## Summary
- Fix broken Pulumi icon: changed `pulumi` to `si-pulumi` (Simple Icons prefix required)
- Fix broken ArgoCD icon: changed `argocd` to `argo-cd` (Dashboard Icons uses hyphenated name)

## Deployment and Testing
- [ ] Sync homepage app in ArgoCD
- [ ] Verify icons appear on go.ops.eblu.me Admin bookmarks section

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/110
2026-02-05 06:29:39 -08:00
ab1386b015 Fix zk-docs 2026-02-04 17:31:21 -08:00
Forgejo Actions
808bc507d8 Update docs release to v1.3.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-05 01:22:10 +00:00
3da455e49c Enforce unique doc filenames and simple wiki-links (#109)
## Summary
- Rename section index files to match their titles (tutorials.md, reference.md, how-to.md, explanation.md) so all filenames are unique
- Convert all ~47 path-based wiki-links to simple filename format across 15 files
- Update doc-filenames task to no longer skip index.md files
- Update doc-links task to reject path-based links containing '/'

This ensures all wiki-links work correctly in obsidian.nvim by making links resolvable by filename alone.

## Testing
- `mise run doc-filenames` - all unique
- `mise run doc-links` - no broken or path-based links
- `mise run doc-titles` - no duplicates

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/109
2026-02-04 17:21:34 -08:00
Forgejo Actions
a03a9faaad Update docs release to v1.3.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 22:40:18 +00:00
7aa0e60b27 Add how-to guide for restarting indri (#108)
## Summary
- Add `docs/how-to/restart-indri.md` with safe shutdown and startup procedures
- Add `docs/reference/services/automounter.md` documenting the SMB share automounter app
- Update indri reference card with GUI applications section
- Update how-to and reference indexes

## Test plan
- [ ] Review restart-indri.md for accuracy
- [ ] Verify AutoMounter details are correct
- [ ] Perform the actual restart using this guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/108
2026-02-04 14:39:48 -08:00
74bd5abe54 Add IaC for Forgejo Actions secrets via Ansible (#107)
## Summary
- New `forgejo_actions_secrets` Ansible role syncs repository-level Actions secrets from 1Password to Forgejo via the Forgejo API
- Replaces manual process of copying secrets from 1Password to Forgejo UI
- Documents the one-time PAT setup requirement in forgejo.md

## Manual Setup Required
Before this role can run, a Forgejo PAT must be created:
1. Go to https://forge.ops.eblu.me/user/settings/applications
2. Create a new token with `write:repository` scope
3. Store it in 1Password → "Forgejo Secrets" item → `api-token` field

This has already been done.

## Test Plan
- [x] Ran `mise run provision-indri -- --tags forgejo_actions_secrets` successfully
- [x] Verified secret synced (API returned 204 = updated existing)
- [x] Ansible-lint passes

🤖 Generated with [Claude Code](https://claude.ai/code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/107
2026-02-04 09:11:01 -08:00
Forgejo Actions
e15caec898 Update docs release to v1.3.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 16:47:27 +00:00
95610d8e54 Fix Quartz build to preserve git history for accurate file dates (#106)
## Summary

Fixes the "isn't yet tracked by git, dates will be inaccurate" warnings by using Quartz's `-d docs` flag instead of symlinking.

## Problem

The previous approach symlinked `content -> docs`, but git doesn't follow symlinks. When Quartz asked git about `content/index.md`, git had no history for that path.

## Solution

Use `npx quartz build -d docs` to tell Quartz to read from `docs/` directly. Now when Quartz asks git about `docs/index.md`, git finds the actual file history.

- CHANGELOG.md is copied (not symlinked) into `docs/` for the build, then removed
- All other files have accurate git-based dates

## Testing

Tested locally - build produces no warnings.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/106
2026-02-04 08:46:47 -08:00
Forgejo Actions
4aeade1543 Update docs release to v1.3.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 16:26:24 +00:00
03bda41de4 Fix Quartz build to preserve git history for accurate file dates (#105)
## Summary

Fixes the "isn't yet tracked by git, dates will be inaccurate" warnings in the Build docs step by restructuring how Quartz builds the documentation.

## Problem

Previously, we copied docs into Quartz's content folder. Since this was inside a fresh Quartz clone with no history of our files, the `CreatedModifiedDate` plugin couldn't determine accurate dates.

## Solution

Build Quartz from within the blumeops repo instead:
1. Copy Quartz's build system (quartz/, package.json, etc.) into the workspace
2. Symlink `content` -> `docs` (preserves git history)
3. Symlink `docs/CHANGELOG.md` -> `../CHANGELOG.md`
4. Build from workspace root where git can trace file history
5. Clean up artifacts after creating tarball

## Deployment and Testing

- [ ] Run build workflow and verify no "not tracked by git" warnings
- [ ] Verify file dates appear correctly on built docs site

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/105
2026-02-04 08:25:46 -08:00
Forgejo Actions
1835e3e80e Update docs release to v1.3.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 16:14:08 +00:00
efdd569285 Improve build workflow with version bump selection and changelog in releases (#104)
## Summary

- Add `version_type` choice input with options: BUMP_PATCH (default), BUMP_MINOR, BUMP_MAJOR, SPECIFIC_VERSION
- Add optional `specific_version` input for explicit version selection
- Include changelog content in Forgejo release body under "What's Changed" section
- Move CHANGELOG.md to repository root (still copied into docs during Quartz build)
- Add CHANGELOG link to docs index page
- Update doc-links script to recognize build-time docs from repo root

## Changes

**Workflow inputs:**
- Previously: single optional `version` string input
- Now: `version_type` choice dropdown (defaults to BUMP_PATCH) + optional `specific_version` for explicit versions

**Release body:**
- Previously: just asset download instructions
- Now: includes "What's Changed" section with changelog entries for this release

**CHANGELOG.md location:**
- Previously: `docs/CHANGELOG.md`
- Now: `CHANGELOG.md` (repo root), copied into docs content during build

## Deployment and Testing

- [ ] Run build workflow with BUMP_PATCH (default)
- [ ] Run build workflow with BUMP_MINOR
- [ ] Verify changelog appears in release body
- [ ] Verify docs site includes CHANGELOG page

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/104
2026-02-04 08:13:16 -08:00
e720b524d3 Rename indri-services-check to services-check (#103)
## Summary
- Rename `indri-services-check` task to `services-check` since it checks all services (indri native, Kubernetes, HTTP endpoints), not just indri-specific ones
- Update references in CLAUDE.md, ai-assistance-guide.md, and troubleshooting.md

## Deployment and Testing
- [ ] Run `mise run services-check` to verify the task works under its new name

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/103
2026-02-04 07:49:15 -08:00
aabfcf6fc0 Document Forgejo Actions secrets (#102)
## Summary
- Add "Forgejo Actions Secrets" section to forgejo reference card
- Document that `ARGOCD_AUTH_TOKEN` is used by `build-blumeops.yaml` workflow
- Note that secrets are stored in 1Password but manually copied to Forgejo (no auto-sync)
- Add missing `build-blumeops.yaml` to workflows list
- Clarify distinction between server config secrets (1Password → Ansible) vs CI/CD secrets (Forgejo UI)

## Context
The forgejo-runner ArgoCD app was showing OutOfSync because a previous attempt stored `argocd_token` in the ExternalSecret. This was incorrect - the token is actually a Forgejo Actions secret, not a k8s secret. Synced the app to remove the drift and added documentation to prevent future confusion.

🤖 Generated with [Claude Code](https://claude.ai/code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/102
2026-02-04 07:32:32 -08:00
1e13d4b83d Fix Navidrome automatic library scan schedule (#101)
## Summary
- Fix env var name from `ND_SCANSCHEDULE` to `ND_SCANNER_SCHEDULE` (Navidrome uses viper config where dots become underscores)
- Use explicit `@every 1h` format for clarity
- Reorder CLAUDE.md rules to emphasize running zk-docs first

## Root Cause
Navidrome logs showed "Periodic scan is DISABLED" at startup despite the env var being set. The config key is `scanner.schedule`, which translates to `ND_SCANNER_SCHEDULE` (not `ND_SCANSCHEDULE`).

## Deployment and Testing
- [ ] Sync navidrome app: `argocd app sync navidrome`
- [ ] Verify pod restarts with new env var
- [ ] Check logs for "Scheduling scanner" message instead of "Periodic scan is DISABLED"
- [ ] Wait ~1 hour and confirm scan runs automatically

🤖 Generated with [Claude Code](https://claude.ai/code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/101
2026-02-04 07:23:12 -08:00
72f9f21d46 Remove iCloud Photos from borgmatic backup (#100)
## Summary
- Remove ~/Pictures from borgmatic source directories
- Update borgmatic and backup policy documentation
- Add Sifaka-Native Data section to clarify that photos (via Immich), music (via Navidrome), and video (via Jellyfin) are stored directly on Sifaka

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags borgmatic --check --diff` to preview changes
- [ ] Run `mise run provision-indri -- --tags borgmatic` to apply
- [ ] Verify borgmatic config no longer includes ~/Pictures

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/100
2026-02-04 07:09:28 -08:00
12c92de9e1 Add troubleshooting how-to to zk-docs (#99)
## Summary
- Add the troubleshooting how-to guide to the zk-docs script output
- Placed after how-to/index.md to group how-to content together

## Deployment and Testing
- [x] Verified script runs successfully with `mise run zk-docs`
- [x] Changelog fragment added

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/99
2026-02-04 06:44:23 -08:00
Forgejo Actions
e405a48881 Update docs release to v1.2.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 05:18:37 +00:00
5c79a8dbe2 Add doc-random task and documentation improvements (#98)
## Summary
- Add `doc-random` mise task that selects a random documentation card for review
- Add how-to/knowledgebase section with review-documentation guide
- Add Caddy reference card with proxy configuration details
- Fix replication tutorial sequence (tailscale-setup now links to core-services)
- Fix "BluemeOps" typo in tailscale-setup
- Clean up obsolete zk/ directory references from doc-links

## Deployment and Testing
- [x] `mise run doc-random` works and displays a random card
- [x] `mise run doc-links` passes (all wiki-links valid)
- [x] Pre-commit hooks pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/98
2026-02-03 21:17:58 -08:00
Forgejo Actions
f88da51e23 Update docs release to v1.2.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 04:53:30 +00:00
f8f11121eb Complete Phase 6: documentation cleanup and integration (#97)
## Summary
- Delete `docs/zk/` directory - all useful content migrated to structured docs
- Delete `docs/README.md` - `docs/index.md` is now the documentation root
- Add `devpi` reference card and `use-pypi-proxy` how-to guide
- Add maintenance notes to `indri` reference (sleep prevention, passwordless sudo)
- Add iCloud Photos backup note to `borgmatic` reference
- Rewrite `zk-docs` mise task to prime AI context with key docs instead of legacy cards
- Update `CLAUDE.md` and `README.md` to remove zk references
- Update `exploring-the-docs` with AI context priming section

This completes the Diataxis documentation restructuring. All six phases are now done.

## Deployment and Testing
- [x] Pre-commit hooks pass (including doc-links validator)
- [ ] Build and deploy to docs.ops.eblu.me to verify rendering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/97
2026-02-03 20:52:37 -08:00
Forgejo Actions
16cdffaebf Update docs release to v1.1.5
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 04:34:31 +00:00
0a28622751 Add Phase 5: explanation documentation (#96)
## Summary
- Create `docs/explanation/` directory with index and three explanation articles
- why-gitops: Philosophy of GitOps for homelabs (memory, rollback, AI context)
- architecture: How pieces fit together (ASCII diagrams of hosts, data flow, secrets)
- security-model: Tailscale zero-trust, 1Password secrets, access control philosophy
- Update docs/index.md with How-to and Explanation section links
- Update exploring-the-docs to link Explanation section

Decision log deferred to future work.

## Deployment and Testing
- [x] Pre-commit hooks pass (including doc-links validator)
- [ ] Build and deploy to docs.ops.eblu.me to verify rendering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/96
2026-02-03 20:33:39 -08:00
Forgejo Actions
e426473c59 Update docs release to v1.1.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 04:18:04 +00:00
e311b36b3c Add Phase 4: how-to guides documentation (#95)
## Summary
- Create `docs/how-to/` directory with index and four how-to guides
- deploy-k8s-service: Quick reference for Kubernetes deployments via ArgoCD
- add-ansible-role: Adding new Ansible roles for indri services
- update-tailscale-acls: Modifying Tailscale ACL policies via Pulumi
- troubleshooting: Diagnosing and fixing common issues
- Update exploring-the-docs to include How-to section links
- Update README.md to mark Phase 4 as complete

## Deployment and Testing
- [x] Pre-commit hooks pass (including doc-links validator)
- [ ] Build and deploy to docs.ops.eblu.me to verify rendering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/95
2026-02-03 20:17:24 -08:00
Forgejo Actions
672dbda9d7 Update docs release to v1.1.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 03:07:15 +00:00
d0e37b8b91 Fix towncrier fragment format: use flat <name>.<type>.md
The towncrier config uses the type's `directory` field as the type
identifier in filenames, NOT as subdirectories. Correct format:
  docs/changelog.d/<name>.<type>.md

NOT:
  docs/changelog.d/<type>/<name>.md

- Move fragments to root with type suffix
- Remove empty type subdirectories
- Fix CLAUDE.md instructions
- Fix tutorial examples in contributing.md and ai-assistance-guide.md

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 19:06:14 -08:00
Forgejo Actions
f279891575 Update docs release to v1.1.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 03:02:13 +00:00
c036562dfe Fix towncrier fragment naming and CLAUDE.md instructions
Fragments in subdirectories should be named `<name>.md`, not
`<name>.<type>.md` - the type is already indicated by the directory.

- Rename feature/auto-deploy-docs.feature.md → feature/auto-deploy-docs.md
- Rename misc/+container-tag-no-confirm.misc.md → misc/+container-tag-no-confirm.md
- Update CLAUDE.md with correct fragment path format

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 19:00:26 -08:00
Forgejo Actions
81d99b689d Update docs release to v1.1.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 02:53:17 +00:00
7ebac4aef6 Add Phase 3 tutorials with audience targeting (#94)
## Summary
- Create tutorials directory structure with index page
- Add 5 main tutorials targeting different audiences:
  - **what-is-blumeops** (Reader, AI) - High-level orientation
  - **exploring-the-docs** (All) - Navigation guide
  - **ai-assistance-guide** (AI, Owner) - Context for AI-assisted operations
  - **contributing** (Contributor) - First contribution workflow
  - **replicating-blumeops** (Replicator) - Overview for building similar setup
- Add 4 replication sub-tutorials:
  - tailscale-setup, kubernetes-bootstrap, argocd-config, observability-stack
- Update README.md to mark Phase 3 complete
- Add changelog fragment

Each tutorial explicitly identifies its target audiences and links to reference material rather than re-explaining concepts.

## Deployment and Testing
- [x] All pre-commit hooks pass (doc-links validates wiki links)
- [ ] Build docs via workflow to verify rendering
- [ ] Review content for accuracy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/94
2026-02-03 18:51:57 -08:00
Forgejo Actions
bf03d71780 Update docs release to v1.1.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:27:09 +00:00
0d2fdebd4d Add changelog fragment for auto-deploy docs feature
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:25:21 -08:00
82bcd935cd Move DOCS_RELEASE_URL from ConfigMap to Deployment
This ensures ArgoCD sync triggers a pod rollout when the URL changes,
since ConfigMap data changes don't restart pods automatically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:23:52 -08:00
Forgejo Actions
103cc0deab Update docs release to v1.0.14
- Updated configmap with new DOCS_RELEASE_URL
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:18:33 +00:00
aaf5090509 Remove ARGOCD_AUTH_TOKEN from external secret
Workflow secrets come from Forgejo's secret store, not runner env.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:17:53 -08:00
Forgejo Actions
492aa9a104 Update docs release to v1.0.13
- Updated configmap with new DOCS_RELEASE_URL
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:15:22 +00:00
3a26d7e49a Update forgejo-runner image to v2.5.0
Fixes argocd CLI download.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:13:37 -08:00
09317fecc1 Fix argocd CLI download in forgejo-runner image
All checks were successful
Build Container / build (push) Successful in 1m40s
- Add -L flag to follow redirects
- Add -f flag to fail on HTTP errors
- Use dpkg --print-architecture as fallback for TARGETARCH
- Verify binary works after download

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:11:36 -08:00
Forgejo Actions
4d3222d91b Update docs release to v1.0.12
- Updated configmap with new DOCS_RELEASE_URL
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:07:05 +00:00
f08595a3c0 Update forgejo-runner image to v2.4.0
Includes uv and argocd CLI for auto-deploy workflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:05:09 -08:00
e4d4e7b70e Add changelog fragment for container-tag-and-release change
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:00:53 -08:00
8e1cc4c638 Remove confirmation prompt from container-tag-and-release
All checks were successful
Build Container / build (push) Successful in 1m35s
Allows the task to run non-interactively.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:59:55 -08:00
1f73eb675d Auto-deploy docs from build workflow (#93)
## Summary
- Add `uv` and `argocd` CLI to forgejo-runner container image
- Add `workflow-bot` ArgoCD account with sync permissions (declarative via kustomize patches)
- Add `ARGOCD_AUTH_TOKEN` to forgejo-runner external secret for workflow auth
- Update build workflow to auto-deploy docs after release:
  - Update configmap with new release URL
  - Commit changelog and configmap changes
  - Sync docs app via ArgoCD

## Deployment and Testing
Manual steps required before this can work:
1. [ ] Build and push new forgejo-runner image (v2.4.0)
2. [ ] Sync argocd app to create workflow-bot account
3. [ ] Generate token: `argocd account generate-token --account workflow-bot`
4. [ ] Store token in 1Password under "Forgejo Secrets" with field `argocd_token`
5. [ ] Sync forgejo-runner app to pick up new external secret
6. [ ] Update forgejo-runner deployment to use new image version
7. [ ] Test by running workflow manually

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/93
2026-02-03 16:58:03 -08:00
7d5e6b032b Update docs release to v1.0.11
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:40:06 -08:00
2fef0c120d Fix reference/index wiki-link and support path-based links in doc-links
- Use [[reference/index | reference]] for ambiguous index.md link
- Update doc-links to validate both filenames and paths

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:38:28 -08:00
31564d1d9a Update docs release to v1.0.10
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:32:17 -08:00
9630655cc9 Switch to filename-based wiki-links (Quartz resolves by filename)
- Convert all wiki-links from title-based to filename-based
- Update doc-links to validate against filenames
- Add doc-filenames task for duplicate filename detection
- Consolidate doc hooks into single local block in pre-commit config

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:29:31 -08:00
d359583d0a Update docs release to v1.0.9
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:23:02 -08:00
cd2623760c Add unspaced pipe detection to doc-links validation
Wiki-links with display text must use spaces around the pipe:
- Good: [[target | display]]
- Bad: [[target|display]]

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:20:12 -08:00
9261ec7679 Add spaces around pipe in wiki-links (test for Quartz resolution)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:17:20 -08:00
46a5c3a20f Update docs release to v1.0.8
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:07:54 -08:00
Forgejo Actions
b4c930ba78 Release v1.0.8: Update changelog
Built changelog from towncrier fragments.

[skip ci]
2026-02-04 00:06:51 +00:00
3e4b5c2dd3 Convert wiki-link titles to lowercase slugs (#92)
## Summary
- Convert all frontmatter titles to lowercase-hyphenated format (e.g., `grafana-alloy` instead of `Grafana Alloy`)
- Update all wiki-links to use the new slug format
- Update `doc-titles` task to validate slug format (lowercase, hyphens only)

Quartz appears to require titles without spaces for wiki-link resolution.

## Deployment and Testing
- [x] Pre-commit hooks pass (`doc-titles` and `doc-links`)
- [ ] Build docs v1.0.8 and deploy
- [ ] Verify wiki-links resolve correctly (e.g., `[[grafana-alloy]]`)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/92
2026-02-03 16:06:35 -08:00
8d7863e61d Update docs release to v1.0.7
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 15:58:01 -08:00
Forgejo Actions
ceb5fad7e4 Release v1.0.7: Update changelog
Built changelog from towncrier fragments.

[skip ci]
2026-02-03 23:55:57 +00:00
01adc4cf0f Switch to title-based wiki-links (#91)
## Summary
- Remove aliases from all zk cards to prevent them from capturing wiki-links
- Convert all wiki-links from `[[filename|Title]]` to `[[Title]]` format
- Replace `doc-filenames` task with `doc-titles` for duplicate title detection
- Update pre-commit hook to use `doc-titles`

Wiki-links now resolve to reference docs by their frontmatter title, which is more readable and maintainable than filename-based links.

## Deployment and Testing
- [x] Pre-commit hooks pass (including new `doc-titles` check)
- [x] Manually verified zk cards have aliases removed
- [ ] Deploy docs v1.0.7 and verify wiki-links resolve correctly
- [ ] Test links to reference docs (e.g., [[Grafana Alloy]], [[ArgoCD]])

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/91
2026-02-03 15:55:31 -08:00
6162179ac9 Update docs release to v1.0.6
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 15:35:14 -08:00
Forgejo Actions
690b2c9405 Release v1.0.6: Update changelog
Built changelog from towncrier fragments.

[skip ci]
2026-02-03 23:31:22 +00:00
4ccb7b9a26 Fix wiki-links to use filename-based resolution (#90)
## Summary
- Quartz's "shortest" path mode resolves wiki-links by **filename**, not frontmatter title
- Previous PR used title-based links like `[[Grafana Alloy]]` which looked for non-existent `Grafana-Alloy.md`
- Now using filename-based links like `[[alloy|Grafana Alloy]]` which correctly resolve

## Changes
- Rename zk duplicate files with `-log` suffix (e.g., `argocd.md` → `argocd-log.md`)
- Rename `reference/storage/postgresql.md` to `postgresql-storage.md`
- Convert all 175 wiki-links from `[[Title]]` to `[[filename|Title]]` format
- Rename `doc-card-titles` task to `doc-filenames` (checks filename uniqueness, not titles)
- Update pre-commit hook for renamed task

## Deployment and Testing
- [x] Pre-commit hooks pass
- [x] `mise run doc-filenames` shows no duplicate filenames
- [ ] Verify wiki-links work correctly in Quartz build

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/90
2026-02-03 15:30:42 -08:00
8f427beeab Update docs release to v1.0.5
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 15:07:07 -08:00
Forgejo Actions
f93ebdfd5b Release v1.0.5: Update changelog
Built changelog from towncrier fragments.

[skip ci]
2026-02-03 23:05:35 +00:00
a45a78b478 Convert wiki-links to title-based format (#89)
## Summary
- Add `doc-card-titles` mise task to enumerate all doc cards by title/id and detect duplicates
- Remove redundant aliases from zk cards where alias matched the id
- Rename `reference/storage/postgresql.md` title to "PostgreSQL Storage" to avoid duplicate with `reference/services/postgresql.md`
- Convert all 175 path-based wiki-links `[[reference/path|Title]]` to title-based `[[Title]]` format
- Add pre-commit hook to check for duplicate card titles on doc changes

## Deployment and Testing
- [x] Pre-commit hooks pass
- [x] `mise run doc-card-titles` shows no duplicates
- [ ] Verify wiki-links work correctly in Quartz build

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/89
2026-02-03 15:04:38 -08:00
ae64021224 Update docs release to v1.0.4
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 14:43:37 -08:00
b294caa734 Fix wiki-link paths in reference docs
Links like [[services/foo]] need to be [[reference/services/foo]]
to resolve correctly from the docs root.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 14:41:55 -08:00
9904429562 Update docs release to v1.0.3
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 14:37:18 -08:00
236a8ac2c0 Link reference section from docs index
Add navigation to Reference section and clarify that README is
temporary home during Diataxis restructuring.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 14:32:59 -08:00
Forgejo Actions
79bccde1a7 Release v1.0.2: Update changelog
Built changelog from towncrier fragments.

[skip ci]
2026-02-03 22:28:00 +00:00
254b93096a Phase 2: Add Reference section with 24 technical reference cards (#88)
## Summary
- Create `docs/reference/` section with 24 technical reference cards
- Services (16): alloy, argocd, borgmatic, 1password, forgejo, grafana, jellyfin, kiwix, loki, miniflux, navidrome, postgresql, prometheus, teslamate, transmission, zot
- Infrastructure (3): hosts, tailscale, routing
- Kubernetes (2): cluster, apps
- Storage (2): sifaka, backups
- Update README to mark Phase 2 as complete
- Add towncrier changelog fragment

## Deployment and Testing
- [ ] Build docs locally to verify wiki-links resolve
- [ ] Deploy via ArgoCD and verify at docs.ops.eblu.me/reference/

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/88
2026-02-03 14:27:37 -08:00
a606cea5a1 Reorder docs phases: Reference before Tutorials (#87)
## Summary
- Move Reference from Phase 4 to Phase 2 in the documentation restructuring plan
- Update directory layout in README to reflect new phase order
- Add explanatory note about why Reference comes first (so other docs can link to it)

## Deployment and Testing
- [x] Updated docs/README.md with new phase order
- [x] Added towncrier changelog fragment
- [ ] Ready to proceed with Phase 2 (Reference) content after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/87
2026-02-03 13:01:22 -08:00
Forgejo Actions
53f63a9961 Release v1.0.1: Update changelog
Built changelog from towncrier fragments.

[skip ci]
2026-02-03 19:50:54 +00:00
4cb80d9fd9 Add changelog fragment for towncrier adoption
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 11:49:53 -08:00
9a8587b83f Add towncrier changelog system (#86)
## Summary
- Configure towncrier with custom types (feature, bugfix, infra, doc, misc)
- Build initial v0.1.0 changelog from zk management log entries
- Integrate towncrier into build-blumeops workflow
- Update README to mark Phase 1b complete

## How It Works
1. Add changelog fragments to `docs/changelog.d/` as `<id>.<type>.md`
2. When running build-blumeops workflow, towncrier collects fragments
3. CHANGELOG.md is updated and fragments are removed
4. Changes are committed back to main before docs build

## Testing
- [x] Tested `uvx towncrier build` locally
- [ ] Test workflow execution (after merge)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/86
2026-02-03 11:48:13 -08:00
1c86134a62 Phase 1b: Deploy docs hosting with Quartz (#85)
## Summary
- Add ArgoCD Application and manifests for `quartz` service
- Add `docs.ops.eblu.me` to Caddy reverse proxy configuration
- ConfigMap points to blumeops v1.0.0 release tarball
- Tailscale ingress with homepage annotations for auto-discovery

## Deployment and Testing

**Pre-deployment (container build):**
- [ ] Build and tag quartz container: `mise run container-tag-and-release quartz v1.0.0`

**K8s deployment:**
- [ ] Sync apps: `argocd app sync apps`
- [ ] Point quartz at feature branch: `argocd app set quartz --revision feature/docs-phase-1b-hosting`
- [ ] Sync quartz: `argocd app sync quartz`
- [ ] Verify pod is running: `kubectl --context=minikube-indri get pods -n quartz`
- [ ] Verify Tailscale ingress: `kubectl --context=minikube-indri get ingress -n quartz`

**Caddy deployment:**
- [ ] Dry run: `mise run provision-indri -- --tags caddy --check --diff`
- [ ] Apply: `mise run provision-indri -- --tags caddy`

**Verification:**
- [ ] Test https://docs.tail8d86e.ts.net
- [ ] Test https://docs.ops.eblu.me
- [ ] Verify homepage dashboard shows docs link

**Post-merge:**
- [ ] Reset to main: `argocd app set quartz --revision main && argocd app sync quartz`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/85
2026-02-03 10:52:20 -08:00
a1044a8c4a Update docs/README.md: Phase 1a complete, prep for 1b
Mark Phase 1a as complete with link to first release.
Move CHANGELOG to Phase 1b with towncrier note.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 10:16:24 -08:00
95a82321ee Add authentication and error logging to release creation
- Add Authorization header using GITHUB_TOKEN
- Remove silent fail flag to see error responses
- Log API responses for debugging

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:33:47 -08:00
9719fc05f7 Update forgejo-runner to v2.3.0 (Node.js 24)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:29:50 -08:00
d870694e10 Upgrade forgejo-runner to Node.js 24.x LTS
All checks were successful
Build Container / build (push) Successful in 54s
Quartz now requires Node.js >= 22.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:19:58 -08:00
8780928b9a Use GitHub upstream for Quartz until mirror is fixed
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:17:40 -08:00
f11f8a4e89 Fix workflow to handle no existing releases
Remove -f flag from curl so 404 on /releases/latest doesn't fail the
script when there are no releases yet.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 09:15:53 -08:00
b8104d75ad Move zk cards to docs/zk/ for documentation restructuring (#84)
## Summary
- Move all existing zettelkasten cards from `docs/` to `docs/zk/` as a temporary holding area
- Update `zk-docs` mise task to look in the new location
- Add `docs/README.md` explaining the Diataxis-based restructuring plan and target audiences

## Context
This is phase 1 of a multi-phase documentation restructuring effort. The goal is to reorganize docs to follow the Diataxis framework while serving multiple audiences:
1. Erich (owner) - knowledge graph/zk
2. Claude/AI agents - memory and context enrichment
3. New external readers - high-level overview
4. Potential operators/contributors - onboarding
5. Replicators - people wanting to duplicate the approach

## Testing
- [x] Verified `mise run zk-docs` still works with the new path
- [x] Updated obsidian.nvim config (in ~/.config/nvim) to point to new path

## Note
The obsidian.nvim config change is outside this repo but was made as part of this work.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/84
2026-02-03 09:13:50 -08:00
737371ab59 Add pod state observability to minikube dashboard (#83)
## Summary
- Add "Unhealthy Pods" stat panel showing count of pods in error states (ImagePullBackOff, CrashLoopBackOff, etc.) with red background when > 0
- Add "Pods by Waiting Reason" time series chart showing container waiting states over time
- Provides visibility into stuck pods that ArgoCD doesn't track (since it manages CronJobs, not the Jobs/Pods they spawn)

## Context
This addresses the issue where a `zim-watcher` cronjob pod was stuck in `ImagePullBackOff` for 11 days without any alerting. ArgoCD showed the CronJob as "Synced, Healthy" because it only manages the CronJob resource, not its spawned Jobs/Pods.

## Deployment and Testing
- [ ] Sync grafana-config app to test branch
- [ ] Verify dashboard renders correctly
- [ ] Confirm "Unhealthy Pods" shows 0 (green) when no issues
- [ ] Reset to main after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/83
2026-02-03 07:20:05 -08:00
c89e69e25f Add docs/ with blumeops zk cards (#82)
## Summary
- Move 21 blumeops-tagged zettelkasten cards from ~/code/personal/zk/ to docs/
- Create symlink ~/code/personal/zk/blumeops -> blumeops/docs for obsidian integration
- Update zk-docs mise task to read from local docs/ directory
- Add blumeops workspace to obsidian.nvim config (strict=true)

## Benefits
- Docs are now git-managed in the blumeops repo (visible on GitHub)
- Wiki links between blumeops docs continue to work via symlink
- obsidian-sync isolation: docs don't sync to work laptop
- Direct editing via obsidian.nvim with dedicated workspace

## Testing
- [x] Files moved to docs/ (21 files)
- [x] Symlink created: ~/code/personal/zk/blumeops -> blumeops/docs
- [x] zk-docs mise task updated and working
- [ ] Verify obsidian.nvim link resolution (after merge)
- [ ] Verify obsidian backlinks work

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/82
2026-02-02 21:40:53 -08:00
4d97ac4c26 Expand homepage widgets and info panels (#81)
## Summary
- Add greeting and datetime info widgets to homepage header
- Add Miniflux widget showing unread/read counts (via existing API key in 1Password)
- Add Grafana widget showing dashboards/datasources/alerts (via existing credentials in 1Password)
- Add ArgoCD to bookmarks section
- Add TODO comments for widgets needing additional setup (Forgejo, Caddy, UniFi, Glances, Navidrome, Transmission, Immich)

## Deployment and Testing
- [ ] Sync homepage app to deploy new ExternalSecrets
- [ ] Verify greeting and datetime appear in header
- [ ] Verify Miniflux widget shows unread/read counts
- [ ] Verify Grafana widget shows dashboard stats
- [ ] Check that services without credentials still display (just without widgets)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/81
2026-02-02 16:11:20 -08:00
9db4c9d9ae Replace homepage search widget with Quick Launch
Use Quick Launch settings for Kagi search with suggestions instead of
the search widget, which is the proper way to configure keyboard-driven
search in homepage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-02 15:08:35 -08:00
fc0ce955e4 Move DJ to Apps group on Homepage (#80)
## Summary
- Change navidrome homepage group from Media to Apps

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/80
2026-01-31 20:29:43 -08:00
ade21cc49e Add Navidrome music streaming server (#79)
## Summary
- Deploy Navidrome music streaming server to k8s
- NFS mount for music library from sifaka:/volume1/music (read-only)
- Local PVC for SQLite database and config (10Gi)
- Tailscale ingress for dj.tail8d86e.ts.net
- Caddy reverse proxy for dj.ops.eblu.me
- Homepage annotations for dashboard discovery in Media group

## Deployment and Testing
- [ ] Sync `apps` application to pick up new Application definition
- [ ] Set navidrome app to feature branch and sync
- [ ] Verify NFS mount with `kubectl exec`
- [ ] Provision Caddy for dj.ops.eblu.me
- [ ] Access https://dj.ops.eblu.me and create initial admin user
- [ ] Verify Homepage shows DJ in Media group
- [ ] Reset to main and resync after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/79
2026-01-31 20:19:31 -08:00
b8b33b76c8 Remove Plex media server (#78)
## Summary
- Remove plex_metrics ansible role
- Remove Plex Grafana dashboard
- Remove Plex log collection from Alloy config
- Update indri-services-check to check Jellyfin instead of Plex

## Deployment and Testing
- [x] Unloaded plex-metrics LaunchAgent on indri
- [x] Deleted plex-metrics plist and script
- [x] Deleted plex.prom textfile
- [ ] Deploy Alloy config update
- [ ] Sync grafana-config to remove dashboard

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78
2026-01-30 17:06:00 -08:00
bcc8685316 Add Jellyfin media server deployment (#77)
## Summary
- Add Jellyfin ansible role for native macOS deployment via Homebrew cask
- Add jellyfin_metrics role for Prometheus textfile metrics collection
- Add Caddy routing for jellyfin.ops.eblu.me
- Add Alloy log collection for Jellyfin stdout/stderr
- Add Grafana dashboard for Jellyfin monitoring

## Architecture
Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams.

## Deployment and Testing
- [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy`
- [ ] Sync Grafana dashboard: `argocd app sync grafana-config`
- [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me
- [ ] Generate API key and save to `~/.jellyfin-api-key`
- [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV)
- [ ] Enable VideoToolbox hardware transcoding
- [ ] Verify metrics in Grafana dashboard
- [ ] Verify logs in Loki: `{service="jellyfin"}`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77
2026-01-30 16:57:26 -08:00
23b8897c1f Add provider field to OpenWeatherMap widget
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:11:37 -08:00
e830a0f26b Homepage dashboard improvements (#76)
## Summary
- Fix ArgoCD icon (use `argo-cd.png` per Dashboard Icons naming)
- Add Borgmatic backup metrics widget (time since last backup, archive size)
- Add Sifaka NAS disk usage widget (used/total space)
- Create `[[grafana]]` zk card with management notes

## What didn't work
Attempted Grafana iframe embedding for a metrics panel but reverted:
- Homepage iframe widget only supports height classes, not width
- Some panels fail to load even with anonymous auth enabled
- Documented in grafana zk card for future reference

## Deployment and Testing
- [x] ArgoCD icon displays correctly
- [x] Borgmatic metrics show time since backup and archive size
- [x] NAS disk usage shows used/total bytes
- [x] Grafana reverted to authenticated-only access

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/76
2026-01-30 15:05:02 -08:00
38538ad5f0 Replace hajimari with gethomepage (#75)
## Summary
- Remove hajimari (unmaintained since Oct 2022, broken helm deps)
- Add gethomepage (28k stars, actively maintained, monthly releases)
- Migrate custom apps, bookmarks, and search config
- Enable k8s RBAC for service autodiscovery
- Configure Tailscale ingress at go.tail8d86e.ts.net

## Why the switch
Hajimari hasn't released since October 2022. The helm chart has a broken
dependency (bjw-s/common URL is 404), and unreleased code on main has bugs.
gethomepage has similar k8s autodiscovery via ingress annotations and is
very actively maintained.

## Deployment and Testing
- [ ] Delete hajimari app from ArgoCD
- [ ] Delete hajimari namespace
- [ ] Sync apps to pick up new homepage app
- [ ] Sync homepage app
- [ ] Verify go.ops.eblu.me loads

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/75
2026-01-30 13:21:12 -08:00
9fd70c3892 Switch Hajimari to custom fork
- Use chart from forge.ops.eblu.me/eblume/hajimari fork
- Use custom image from registry.ops.eblu.me/blumeops/hajimari
- Enables future customizations (search auto-focus, weather widget)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 11:26:20 -08:00
4c852751db Update forgejo-runner to v2.2.0 (adds skopeo)
All checks were successful
Build Container / build (push) Successful in 13s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 11:13:54 -08:00
dc974858b0 Add skopeo to forgejo-runner image
All checks were successful
Build Container / build (push) Successful in 1m8s
Pre-install skopeo for pushing images to zot registry.
Docker 27's manifest format has compatibility issues with zot,
so we use skopeo for the push step.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 11:11:19 -08:00
fd29244854 Simplify CI: remove Tailscale sidecar, use skopeo for push (#74)
## Summary
- Remove Tailscale sidecar from build-push-image action - registry.ops.eblu.me is directly reachable from k8s pods via Caddy
- Use skopeo for pushing images instead of docker push - Docker 27's manifest format has compatibility issues with zot registry
- Remove tailscale_authkey secret requirement from workflows

## Deployment and Testing
- [x] Tested with nettest-v0.10.0 tag - build succeeded and image pushed to registry

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/74
2026-01-30 10:18:20 -08:00
316a4c4e42 Shorten Hajimari info descriptions and hide URLs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:34:46 -08:00
1c63220dcd Rename bookmarks group from Docs to Admin
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:16:39 -08:00
a40e1dad1b Rename customApps group to "Host Services"
Avoids duplicate "Infrastructure" groups since Hajimari doesn't
merge customApps with discovered apps.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:14:47 -08:00
e9c5153a75 Disable Hajimari for Loki (no useful landing page)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:11:09 -08:00
303fb0ba05 Use simple-icons:immich for Immich dashboard icon
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:08:54 -08:00
08e5546c0b Configure Hajimari with Kagi search and app groups
- Set Kagi as default search provider with simple-icons:kagi
- Add Google and DuckDuckGo as alternative search providers
- Explicitly enable showAppGroups, showAppUrls, showAppInfo

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:04:55 -08:00
d1164c8aac Add Hajimari service dashboard (#73)
## Summary
- Add Hajimari as a service dashboard/start page at `go.ops.eblu.me`
- Auto-discovers k8s services from ingress annotations
- Custom apps for non-k8s services: Forgejo, Registry, Sifaka NAS
- Add `nas.ops.eblu.me` Caddy proxy to Synology dashboard

## Services Configured

**Auto-discovered (k8s ingresses with hajimari.io annotations):**
- Grafana, ArgoCD, Prometheus, Loki (Observability)
- Miniflux, Kiwix, Transmission, TeslaMate, Immich (Apps)
- PyPI/devpi (Infrastructure)

**Custom apps (non-k8s):**
- Forgejo (forge.ops.eblu.me)
- Registry (registry.ops.eblu.me)
- Sifaka NAS (nas.ops.eblu.me)

**Bookmarks:**
- Tailscale Admin, 1Password, Pulumi

## Deployment and Testing
- [ ] Sync `apps` application to pick up new Hajimari Application
- [ ] Sync `hajimari` application
- [ ] Run `mise run provision-indri -- --tags caddy` for go/nas proxy entries
- [ ] Re-sync all k8s apps with hajimari annotations (or wait for natural drift)
- [ ] Verify https://go.ops.eblu.me shows dashboard with all services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/73
2026-01-29 15:51:42 -08:00
3c3c90f206 Rename github readme to avoid display issues on github.com 2026-01-29 11:32:55 -08:00
dae438a5ed Update Immich to v2.5.2 (#72)
## Summary
- Update Immich from v2.5.0 to v2.5.2

## Deployment and Testing
- [ ] Sync apps application: `argocd app sync apps`
- [ ] Point immich at feature branch: `argocd app set immich --revision update-immich-2.5.2`
- [ ] Sync immich: `argocd app sync immich`
- [ ] Verify pods restart and photos.ops.eblu.me is accessible
- [ ] After merge, reset to main: `argocd app set immich --revision main && argocd app sync immich`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/72
2026-01-29 11:25:22 -08:00
0e2df9645d Fix ArgoCD sync drift for apps and immich (#71)
## Summary
- Fix immich app to track `main` branch instead of `feature/immich` for values
- The tailscale-operator ignoreDifferences schema drift will be fixed by syncing the `apps` app

## Deployment and Testing
- [ ] Sync `apps` to fix tailscale-operator schema drift
- [ ] Sync `immich` to pick up correct image versions from main

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/71
2026-01-29 10:24:26 -08:00
2bc826c31f Move metrics scripts from ~/bin to ~/.local/bin (#70)
## Summary
- Update all metrics role defaults to install scripts to ~/.local/bin following XDG conventions
- Scripts already manually moved on indri from ~/bin to ~/.local/bin
- Cleaned up orphaned scripts (devpi-metrics, transmission-metrics, mcquack) and plist files

## Deployment and Testing
- [x] Manually moved scripts on indri
- [x] Deleted orphaned plist files (devpi-metrics, devpi, kiwix-serve, transmission-metrics)
- [x] Deleted orphaned scripts (devpi-metrics, transmission-metrics, mcquack)
- [x] Verified no metrics dependencies on orphaned scripts (checked alloy config and textfile directory)
- [ ] Run ansible to update LaunchAgent plist files with new paths
- [ ] Verify metrics collection continues working

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/70
2026-01-29 09:59:38 -08:00
0d8eb651d4 Fix XID Age graph to show threshold context (#69)
All checks were successful
Build Container / build (push) Successful in 1m10s
## Summary
- Add fixed Y-axis (0-220M) so the 200M autovacuum threshold is always visible
- Add dashed threshold lines at 150M (yellow warning) and 200M (red danger)
- Update title to clarify the threshold

## Context
The raw XID age naturally trends upward between vacuum freezes, which looked alarming without context. Current values (~143K-216K) are at 0.1% of the threshold - completely healthy.

## Deployment and Testing
- [ ] Sync grafana-config app to feature branch
- [ ] Verify threshold lines appear on PostgreSQL dashboard

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/69
2026-01-29 07:08:21 -08:00
0604877db2 Add 'Tesla' prefix to all TeslaMate dashboard titles (#68)
## Summary
- Renamed all 18 TeslaMate Grafana dashboards to include "Tesla" prefix
- Improves organization and discoverability in the dashboard list

## Deployment and Testing
- [ ] Sync grafana-config app: `argocd app set grafana-config --revision feature/rename-tesla-dashboards && argocd app sync grafana-config`
- [ ] Verify dashboards display with "Tesla" prefix in Grafana

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/68
2026-01-29 06:55:44 -08:00
46081f5f10 Update rule 10: also require permission to push to main
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:45:44 -08:00
55522579dc Add rule: never merge PRs without explicit user request
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:43:51 -08:00
a93f2a77e1 Merge pull request 'Migrate remaining secrets to ExternalSecrets' (#67) from feature/migrate-remaining-secrets into main 2026-01-28 20:41:45 -08:00
8f4660915d Fix argocd SSH key format for 1Password Connect
1Password Connect doesn't support ?ssh-format=openssh, so we need a
separate Secure Note item with the OpenSSH-formatted key.

Created new 1Password item: argocd-forge-ssh-key

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:31:16 -08:00
9114aac8f6 Switch all ExternalSecrets to creationPolicy: Owner
ESO now has full ownership of these secrets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:27:16 -08:00
dd6cf20d51 Remove obsolete secret templates
- Delete 13 .yaml.tpl files replaced by ExternalSecrets
- Update immich/README.md with direct CNPG secret copy instructions
- Update miniflux/README.md with context flag and ESO note

Only 1password-connect/secret-credentials.yaml.tpl remains (bootstrap).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:26:37 -08:00
351528474c Add ExternalSecrets for remaining k8s secrets
Migrate 10 secret templates to ESO ExternalSecrets with 1Password Connect:
- databases: eblume, borgmatic, teslamate passwords
- tailscale-operator: OAuth client credentials
- grafana-config: admin password, teslamate datasource
- teslamate: db password, encryption key
- forgejo-runner: runner registration token
- argocd: forge SSH credentials

All use creationPolicy: Merge for safe migration from existing secrets.

Skipped:
- miniflux/secret-db: Uses CNPG secret, not 1Password directly
- immich/secret-db: Requires 1Password item creation first
- 1password-connect: Bootstrap secret, must stay as template

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 19:50:38 -08:00
482414346e Add External Secrets Operator with 1Password Connect (#66) (#66)
## Summary
- Add 1Password Connect server for secrets automation API
- Add External Secrets Operator (ESO) to sync secrets from 1Password to K8s
- Add ClusterSecretStore connecting ESO to 1Password Connect
- Convert devpi secret to ExternalSecret as proof of concept

## Architecture
```
1Password Cloud → 1Password Connect (k8s) → ESO → Native K8s Secrets
```

## Deployment and Testing
- [ ] Mirror Helm charts to forge (connect-helm-charts, external-secrets) - DONE
- [ ] Create 1Password Connect credentials (`op connect server create`)
- [ ] Store credentials in 1Password item "1Password Connect"
- [ ] Bootstrap secret: `op inject -i argocd/manifests/1password-connect/secret-credentials.yaml.tpl | kubectl apply -f -`
- [ ] Deploy 1password-connect: `argocd app sync 1password-connect`
- [ ] Deploy external-secrets: `argocd app sync external-secrets`
- [ ] Deploy external-secrets-config: `argocd app sync external-secrets-config`
- [ ] Test devpi ExternalSecret: `argocd app sync devpi`
- [ ] Verify secret synced: `kubectl get externalsecret -n devpi`

## Future Work
After PoC validated, migrate remaining 12 secret templates to ExternalSecrets:
- databases (3), tailscale-operator (1), grafana-config (2), teslamate (2)
- forgejo-runner (1), argocd (1), immich (1), 1password-connect (1 - self-bootstrap)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/66
2026-01-28 19:30:10 -08:00
3971670832 Remove immich-sync ansible role (#65)
## Summary
- Remove immich_sync ansible role (server-side photo sync via osxphotos)
- The Immich iOS app has built-in automatic backup that replaces this functionality
- iOS app supports foreground/background backup and can sync iCloud photos directly

## Deployment and Testing
- [ ] Clean up files on indri (see manual cleanup commands below)
- [ ] Configure Immich iOS app for automatic backup

### Manual cleanup on indri:
```bash
# Unload and remove LaunchAgent
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist
rm ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist

# Remove script and credentials
rm ~/bin/immich-sync.sh
rm ~/.immich-api-key

# Remove logs
rm ~/Library/Logs/mcquack.immich-sync.*.log

# Optionally remove export directory (check if empty first)
ls ~/Pictures/immich-export
# rm -r ~/Pictures/immich-export
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/65
2026-01-28 08:49:22 -08:00
aa4464f84e Upgrade Immich from v2.4.1 to v2.5.0 (#64)
## Summary
- Upgrades Immich image tag from v2.4.1 to v2.5.0

## Deployment and Testing
- [ ] Point immich ArgoCD app at feature branch and sync
- [ ] Verify pods come up healthy
- [ ] Verify Immich web UI accessible
- [ ] Reset to main and sync after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/64
2026-01-27 20:51:09 -08:00
54945be0e3 Add immich-sync ansible role for photo library sync (#63)
## Summary
- Add `immich_sync` role that syncs macOS Photos Library to Immich
- Uses osxphotos to export photos with metadata to staging directory
- Uses immich-cli (via Docker) to upload to Immich server
- LaunchAgent schedules hourly syncs following mcquack pattern
- API key fetched from 1Password in playbook pre_tasks

## Architecture
```
Photos Library → osxphotos export → ~/Pictures/immich-export/ → immich-cli upload → Immich
```

## Prerequisites (manual)
- Install osxphotos on indri: Add `"pipx:osxphotos" = "latest"` to `~/.config/mise/config.toml`, run `mise install`
- Docker is already installed on indri

## Deployment and Testing
- [ ] Dry run: `mise run provision-indri -- --tags immich_sync --check --diff`
- [ ] Deploy: `mise run provision-indri -- --tags immich_sync`
- [ ] Verify LaunchAgent: `ssh indri 'launchctl list | grep immich'`
- [ ] Test manual sync: `ssh indri '~/bin/immich-sync.sh'`
- [ ] Check logs: `ssh indri 'tail -50 ~/Library/Logs/mcquack.immich-sync.out.log'`
- [ ] Verify photos in Immich at https://photos.ops.eblu.me

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/63
2026-01-26 12:38:38 -08:00
8621996343 Add Immich photo management + migrate forge URLs (#62)
## Summary
- Migrate all ArgoCD app repo URLs from `indri.tail8d86e.ts.net:2200` to `forge.ops.eblu.me:2222`
- Add Immich self-hosted photo management service with:
  - Helm chart deployment via ArgoCD
  - PostgreSQL cluster with pgvecto.rs for AI vector search (immich-pg)
  - NFS storage on sifaka for photo library (2Ti)
  - Tailscale Ingress + Caddy proxy for `photos.ops.eblu.me`
  - Machine learning service for face/object recognition

## Deployment and Testing
- [x] Update ArgoCD repo-creds-forge secret with new URL (one-time manual step)
- [ ] Sync `apps` to pick up new applications
- [ ] Sync all existing apps to verify new forge URL works
- [ ] Sync `blumeops-pg` to deploy immich-pg cluster
- [ ] Wait for immich-pg to be healthy
- [ ] Create immich-db secret from auto-generated password
- [ ] Sync `immich-storage` (PV, PVC, Ingress)
- [ ] Sync `immich` (Helm chart)
- [ ] Run `mise run provision-indri -- --tags caddy` to add photos.ops.eblu.me
- [ ] Verify Immich UI is accessible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/62
2026-01-26 11:20:11 -08:00
c8b655f177 Build local containers for k8s services (#61)
## Summary
- Move devpi Dockerfile from argocd/manifests to containers/devpi/
- Add containers for: transmission, teslamate, miniflux, kiwix-serve, kubectl
- Update all k8s deployments to use local images (registry.ops.eblu.me/blumeops/*)
- All containers use v1.0.0 tag for initial release

## Containers Added
| Container | Source | Notes |
|-----------|--------|-------|
| devpi | python:3.12-slim | Existing, moved to containers/ |
| kubectl | alpine + download | For zim-watcher CronJob |
| miniflux | Go build from source | v2.2.16 |
| kiwix-serve | Download pre-built binary | v3.8.1 |
| transmission | alpine + apk install | Simpler than linuxserver image |
| teslamate | Elixir build from source | v2.2.0 |

## Deployment and Testing
- [ ] Build and tag devpi-v1.0.0
- [ ] Build and tag kubectl-v1.0.0
- [ ] Build and tag miniflux-v1.0.0
- [ ] Build and tag kiwix-serve-v1.0.0
- [ ] Build and tag transmission-v1.0.0
- [ ] Build and tag teslamate-v1.0.0
- [ ] Sync ArgoCD apps and verify services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/61
2026-01-25 21:35:57 -08:00
ea42362b6f Migrate Forgejo runner to Kubernetes with DinD (#60)
## Summary
- Deploy Forgejo runner to k8s with Docker-in-Docker sidecar
- Add job execution image with Node.js and Docker CLI
- Retire host-mode runner on indri
- All CI jobs now run containerized in k8s

## Components Added
- `containers/forgejo-runner/Dockerfile` - Job execution image
- `argocd/apps/forgejo-runner.yaml` - ArgoCD Application
- `argocd/manifests/forgejo-runner/` - Kubernetes manifests

## Components Removed
- `ansible/roles/forgejo_runner/` - No longer needed

## Changes to Existing Files
- `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env
- `.github/actionlint.yaml` - Only `k8s` label now valid

## Deployment
1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl | kubectl --context=minikube-indri apply -f -`
2. Sync ArgoCD: `argocd app sync forgejo-runner`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60
2026-01-25 19:56:17 -08:00
66badfafd1 Migrate k8s services to Caddy (*.ops.eblu.me) (#59)
All checks were successful
Build Container / build (push) Successful in 13s
## Summary
- Add Caddy reverse proxy routes for all k8s services (grafana, argocd, prometheus, loki, miniflux, devpi, kiwix, torrent, teslamate)
- Add PostgreSQL via Caddy L4 TCP proxy on port 5432
- Caddy proxies to existing Tailscale endpoints - traffic stays local on indri
- Both `*.ops.eblu.me` and `*.tail8d86e.ts.net` URLs continue to work

## Updated References
- Alloy: prometheus/loki push endpoints → `*.ops.eblu.me`
- Borgmatic: PostgreSQL backup host → `pg.ops.eblu.me`
- Devpi: DEVPI_OUTSIDE_URL → `pypi.ops.eblu.me`
- indri-services-check: health check URLs
- CLAUDE.md: argocd login command

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags caddy` to deploy new Caddy config
- [ ] Test HTTP services: `curl https://grafana.ops.eblu.me/api/health`
- [ ] Test PostgreSQL: `pg_isready -h pg.ops.eblu.me -p 5432`
- [ ] Run `mise run provision-indri -- --tags alloy` to update Alloy endpoints
- [ ] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic
- [ ] Sync devpi in ArgoCD: `argocd app sync devpi`
- [ ] Re-login to ArgoCD: `argocd login argocd.ops.eblu.me ...`
- [ ] Run `mise run indri-services-check` to verify all services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/59
2026-01-25 12:56:31 -08:00
d6e6b48f6a Migrate registry to Caddy (registry.ops.eblu.me) (#58)
## Summary
- Update all references from `registry.tail8d86e.ts.net` to `registry.ops.eblu.me`
- Remove `tailscale_serve` ansible role (no longer needed - all services migrated to Caddy)
- Update minikube containerd config for new registry URL
- Update devpi manifest, CI actions, and mise tasks

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --check --diff` (dry run)
- [ ] Run `mise run provision-indri -- --tags minikube` to update containerd config
- [ ] Sync devpi ArgoCD app: `argocd app sync devpi`
- [ ] Manually remove old Tailscale serve entry: `ssh indri 'tailscale serve --service=svc:registry off'`
- [ ] Test registry access: `curl https://registry.ops.eblu.me/v2/_catalog`
- [ ] Run `mise run indri-services-check` to verify all services healthy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/58
2026-01-25 12:06:15 -08:00
9c1b7c7ca1 Update docs for Caddy migration (#57)
## Summary
- Update CLAUDE.md with new service routing documentation
- Document the two DNS domains: `*.ops.eblu.me` (Caddy) vs `*.tail8d86e.ts.net` (Tailscale)
- Fix incorrect service listings (Prometheus/Loki are in k8s, not indri)

## ZK Updates (not in this PR)
Also updated the blumeops zk card with:
- Source code URL (forge is primary, GitHub is mirror)
- Services split into Caddy vs Tailscale sections
- Updated port map for Caddy
- Updated "Adding a New Service" instructions

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/57
2026-01-25 11:52:35 -08:00
1184b4de1d Add Caddy layer4 for Forgejo SSH (#56)
## Summary
- Add layer4 TCP proxy configuration to Caddyfile template for SSH services
- Configure Forgejo SSH on port 2222 → localhost:2200
- Switch HTTPS from port 8443 (testing) to 443 (production)
- Requires Caddy rebuilt with `github.com/mholt/caddy-l4` plugin

## What This Enables
Git+SSH access via `forge.ops.eblu.me:2222` is now accessible from:
- Tailnet clients (gilbert)
- Docker containers on indri
- Kubernetes pods in minikube

This solves the DNS resolution issues where containers couldn't reach Tailscale MagicDNS names.

## Testing Done
- [x] Caddy rebuilt with layer4 plugin
- [x] Validated Caddyfile syntax
- [x] Cleared `svc:forge` from tailscale serve
- [x] Verified HTTPS works: `curl https://forge.ops.eblu.me`
- [x] Verified SSH works: `ssh -p 2222 forgejo@forge.ops.eblu.me`
- [x] Verified git clone works via new endpoint
- [x] Verified minikube pods can reach both HTTPS and SSH endpoints

## Deployment
Caddy is already running with the new config on indri. This PR captures the ansible changes.

## Next Steps
- Update zk docs with new git remote format
- Migrate registry and other services to Caddy
- Retire tailscale_services ansible role

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/56
2026-01-25 11:37:23 -08:00
682a68dc9c Add Caddy reverse proxy for blumeops services (#55)
## Summary
- Add Caddy ansible role following zot pattern (manual build, ansible deploy)
- Caddy built with Gandi DNS plugin for ACME DNS-01 challenges
- Gandi PAT fetched from 1Password and written to secured file on indri
- Configure wildcard TLS for `*.ops.eblu.me`
- Initial services: forge, registry (indri-local)
- Uses port 8443 during testing to avoid Tailscale serve conflicts

## Build Instructions (already done)
On indri:
```bash
cd ~/code/3rd/caddy && mise run build
```

## Deployment and Testing
- [ ] Review Caddyfile configuration
- [ ] Run `mise run provision-indri -- --tags caddy` to deploy
- [ ] Test: `curl -v https://forge.ops.eblu.me:8443` (should get TLS cert)
- [ ] Test: `curl -v https://registry.ops.eblu.me:8443/v2/` (should return `{}`)
- [ ] Once verified, switch to port 443 and migrate services from Tailscale serve

## Files Changed
- `ansible/playbooks/indri.yml` - Add pre_task for Gandi PAT, add caddy role
- `ansible/roles/caddy/` - New role with Caddyfile and LaunchAgent templates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/55
2026-01-25 09:35:06 -08:00
b08faa50cc Add Gandi DNS management via Pulumi (#54)
## Summary
- Restructure Pulumi into separate projects: `pulumi/tailscale/` and `pulumi/gandi/`
- Add Gandi LiveDNS management for `eblu.me` domain
- Create wildcard DNS record `*.ops.eblu.me` → indri's Tailscale IP (100.98.163.89)
- Add mise tasks: `dns-up`, `dns-preview`
- Update `tailnet-up` to pass `--yes` by default
- Document PAT cycling process (expires every 30 days)

## Background
This enables using real DNS names (`*.ops.eblu.me`) that resolve to Tailscale IPs,
which allows containers and other systems to resolve services without depending on
MagicDNS. Since Tailscale IPs (100.x.x.x) are not publicly routable, services remain
tailnet-only while using standard DNS.

## Deployment and Testing
- [ ] Run `cd pulumi/gandi && uv sync` to install dependencies
- [ ] Run `cd pulumi/gandi && pulumi stack init eblu-me` to create stack
- [ ] Run `mise run dns-preview` to verify configuration
- [ ] Run `mise run dns-up` to apply DNS records
- [ ] Verify with `dig +short test.ops.eblu.me` returns `100.98.163.89`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/54
2026-01-25 08:15:46 -08:00
af3536bc17 Simplify indri IP extraction from tailscale status
Some checks failed
Build Container / build (push) Failing after 13s
Use simple grep and awk to parse plain text tailscale status output
instead of trying to parse JSON. Also show the status output for
debugging.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 20:04:28 -08:00
04fdd5906d Get indri's IP from tailscale status for registry access
Some checks failed
Build Container / build (push) Failing after 9s
Use 'tailscale status' to get indri's Tailscale IP and add it to
/etc/hosts for registry hostname resolution. The registry service
runs on indri, so we need indri's IP specifically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 20:00:57 -08:00
16ccabbc34 Resolve registry IP via Tailscale and add to /etc/hosts
Some checks failed
Build Container / build (push) Failing after 8s
The Tailscale container's DNS doesn't work because it runs in userspace
mode. Instead, resolve the registry IP using 'tailscale ip' and add it
to /etc/hosts inside the container before running skopeo.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:44:39 -08:00
b4b8abb2d9 Add tag:ci-gateway to Tailscale ACL for CI builds
Some checks failed
Build Container / build (push) Failing after 13s
- Add tag:ci-gateway to tagOwners
- Grant ci-gateway access to registry on port 443
- Add test for ci-gateway -> registry access

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:38:02 -08:00
9f98b3007e Add debugging for Tailscale container failures
Some checks failed
Build Container / build (push) Failing after 7s
Capture container logs when the Tailscale sidecar exits unexpectedly.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:32:40 -08:00
424647cd93 Use Tailscale sidecar for container registry push
Some checks failed
Build Container / build (push) Failing after 1m9s
Docker Desktop's VM can't resolve tailnet hostnames. Work around this by:
1. Starting a Tailscale container that joins the tailnet
2. Building the image with docker build
3. Saving to tarball with docker save
4. Pushing via skopeo inside the Tailscale container

Uses TS_CI_GATEWAY_AUTHKEY repository secret for authentication.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 19:29:01 -08:00
b598ea8d5a Add indri-runner-logs script to fetch workflow logs
Fetches logs for Forgejo Actions runs from indri's local storage.
Logs are stored as zstd-compressed files in the forgejo data directory.

Usage: mise run indri-runner-logs <run_id>

Only works for runs executed by the indri-host-runner.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-24 17:25:32 -08:00
31697b4d63 Add nettest container for CI/CD network debugging (#52)
Some checks failed
Build Container / build (push) Failing after 18s
## Summary
- Add `containers/nettest/` with Alpine-based Dockerfile and connectivity test script
- Add `.forgejo/workflows/build-nettest.yaml` workflow triggered by `nettest-v*` tags
- Test script checks DNS resolution and HTTPS connectivity to forge and registry

## Deployment and Testing
- [ ] Merge PR to main
- [ ] Run `mise run container-release nettest v0.1.0` to trigger first build
- [ ] Verify workflow runs successfully and container can reach tailnet services
- [ ] Manually test from minikube: `kubectl run nettest --rm -it --image=registry.tail8d86e.ts.net/blumeops/nettest:v0.1.0`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/52
2026-01-24 16:54:35 -08:00
13b8d16eb2 Remove mention of plans dir
All checks were successful
Test CI / test (push) Successful in 3s
2026-01-24 16:22:39 -08:00
ceba6b3c2c Remove plans, they dont seem to work
All checks were successful
Test CI / test (push) Successful in 3s
2026-01-24 16:21:49 -08:00
8ca8798121 Switch to Buildah for container builds (#51)
All checks were successful
Test CI / test (push) Successful in 4s
## Summary
- Replace Docker with Buildah for container image builds
- No Docker socket required - buildah is daemonless
- Cleaner security model (no privileged containers or socket mounting)
- Remove Docker-related security context from deployment

## Changes
- Update Dockerfile to install buildah/podman instead of docker-cli
- Configure buildah storage with overlay driver and fuse-overlayfs
- Update composite action to use `buildah bud` and `buildah push`
- Add `imagePullPolicy: Always` to ensure fresh image pulls
- Update test workflow to verify buildah/podman

## Testing
- [ ] Runner pod starts successfully
- [ ] Buildah is available in runner
- [ ] Test workflow verifies buildah/podman versions
- [ ] Container build workflow builds and pushes to zot

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51
2026-01-24 13:30:26 -08:00
784 changed files with 60742 additions and 36990 deletions

View file

@ -0,0 +1,62 @@
---
name: change-classifier
description: Classifies proposed changes as C0/C1/C2 before work begins. Use proactively when the user describes a new task or change, before any implementation starts.
tools: Read, Glob, Grep, Bash
model: haiku
permissionMode: dontAsk
---
You are a change classifier for the BlumeOps infrastructure project. Your job is to assess a proposed change and classify it as C0, C1, or C2 before any work begins.
## Classification Criteria
| Class | Name | When to use | Key trait |
|-------|------|-------------|-----------|
| **C0** | Quick Fix | Small, low-risk, fix-forward safe | Direct to main, no PR |
| **C1** | Human Review | Moderate complexity or risk | Feature branch + PR, docs-first |
| **C2** | Mikado Chain | Multi-phase, multi-session, high complexity | Mikado Branch Invariant |
## Assessment Process
1. Understand what the user wants to change
2. Identify which files/services are affected — use Glob/Grep to check the blast radius
3. Assess risk factors:
- How many files change?
- Are critical services affected (networking, auth, DNS)?
- Is the change easily reversible?
- Could it cause downtime?
- Does it span multiple services or systems?
- Does it require multi-step sequencing?
4. Classify and explain your reasoning
## C0 Indicators
- Single file or small number of related files
- Config value change, version bump, typo fix, doc update
- No service restart needed, or restart is safe
- Easy to fix-forward if wrong
## C1 Indicators
- Multiple files across a service boundary
- New feature or significant behavior change
- Could affect service availability
- Needs human review for correctness
- Touching Ansible roles, ArgoCD manifests, or routing config
## C2 Indicators
- Multi-phase work with ordering dependencies
- Spans multiple sessions or multiple services
- Requires prerequisite changes before the main goal
- User explicitly requests Mikado methodology
- Discovery-heavy work where the full scope isn't known upfront
## Output Format
```
Classification: C0 / C1 / C2
Confidence: high / medium / low
Rationale: <1-2 sentences>
Blast radius: <files/services affected>
Risk factors: <key concerns, if any>
```
If confidence is low, explain what additional information would help. When in doubt, classify one level higher (C0 → C1, C1 → C2).

View file

@ -0,0 +1,36 @@
---
name: infra-health
description: Infrastructure health monitor. Use proactively after deployments, provisioning, or when the user asks about service status. Runs services-check and diagnoses failures.
tools: Bash, Read, Grep, Glob
model: haiku
permissionMode: dontAsk
background: true
---
You are an infrastructure health monitor for the BlumeOps homelab.
When invoked, run the full health check suite and report results:
1. Run `mise run services-check` and capture the full output
2. Parse the results — identify any FAILED services
3. For each failure, provide a brief diagnosis:
- Is the service process down?
- Is it a network/connectivity issue?
- Is it an ArgoCD sync issue?
4. Summarize: total services checked, how many passed, how many failed
If everything is healthy, keep the summary to one line.
If there are failures, group them by category:
- **Process failures** (service not running)
- **HTTP failures** (endpoint not responding)
- **Kubernetes failures** (pod not running, sync issues)
- **Connectivity failures** (SSH, network)
Do NOT attempt to fix anything. Report findings only.
Context:
- Services run across indri (Mac Mini, native + minikube), ringtail (NixOS, k3s), and Fly.io
- Use `--context=minikube-indri` for indri k8s commands, `--context=k3s-ringtail` for ringtail
- HTTP endpoints are proxied through Caddy at `*.ops.eblu.me`
- Public endpoints go through Fly.io at `*.eblu.me`

View file

@ -0,0 +1,69 @@
---
name: mikado-navigator
description: Mikado chain navigator for C2 changes. Use when resuming a C2 chain, checking chain status, or deciding which leaf node to work next. Understands the Mikado Branch Invariant.
tools: Read, Glob, Grep, Bash
model: sonnet
permissionMode: dontAsk
---
You are a Mikado chain navigator for the BlumeOps C2 change process. You help the user understand the current state of a Mikado chain and decide what to do next.
## What You Do
1. Run `mise run docs-mikado --resume` to detect the current chain state
2. Read the relevant Mikado cards (docs in `docs/how-to/` with `status: active`)
3. Analyze the dependency graph and branch position
4. Recommend the next action
## Chain State Analysis
After running `docs-mikado --resume`, interpret the output:
- **Planning phase:** Cards are being added, no code yet. Suggest reviewing the dependency graph for completeness.
- **Mid-cycle:** An `impl` is in progress. Identify which leaf is being worked and what remains.
- **Between cycles:** A leaf was just closed. Identify the next ready leaf and summarize what it requires.
- **Finalized:** The chain is complete and awaiting merge.
- **Invariant violation:** A plan commit was found after impl. Explain the reset procedure.
## Recommending Next Actions
For each ready leaf node:
1. Read the card content to understand what it requires
2. Check if there are related source files (manifests, playbooks, configs)
3. Assess relative complexity and suggest an ordering if multiple leaves are ready
4. Note any potential risks or dependencies not captured in the card graph
## The Mikado Branch Invariant
The branch must always have this structure:
```
main <- [plan commits] <- [impl, close] <- [impl, close] <- ... <- [finalize]
```
Rules:
- First N commits are card-only (plan phase)
- Then repeating cycles of impl + close
- No card introductions after any code commit
- New prerequisites require a branch reset
## Output Format
```
Chain: <name>
Branch: <branch name>
Position: <planning / mid-cycle / between-cycles / etc.>
PR: #<number> (if exists)
Ready leaves:
1. <leaf-stem> — <title> — <brief description of work needed>
2. ...
Recommendation: <what to do next and why>
```
## Important
- Do NOT make any changes. You are advisory only.
- If the user is on `main`, list all active chains and suggest which to resume.
- If PR comments exist, remind the user to check them with `mise run pr-comments <number>`.
- Check for stashed work — resets sometimes leave stashed changes.

View file

@ -0,0 +1,40 @@
# Automated Branch Cleanup
#
# Deletes remote branches that have been merged into main and are older
# than a cutoff (default 30 days). Detects both fast-forward and
# squash-merged branches via the Forgejo API.
#
# Runs on a schedule (~every 10 days) and can be triggered manually
# with a custom cutoff for testing.
name: Branch Cleanup
on:
schedule:
# Approximately every 10 days: 1st, 11th, 21st of each month at 06:00 UTC
- cron: '0 6 1,11,21 * *'
workflow_dispatch:
inputs:
cutoff:
description: 'Delete branches older than N days'
required: false
default: '30'
type: string
jobs:
cleanup:
runs-on: k8s
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Run branch cleanup
env:
FORGEJO_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
CUTOFF="${{ inputs.cutoff || '30' }}"
echo "Running branch cleanup with cutoff=${CUTOFF} days..."
uv run --script mise-tasks/branch-cleanup \
--remote-only \
--yes \
--cutoff "$CUTOFF"

View file

@ -0,0 +1,291 @@
# BlumeOps Release Workflow
#
# Creates a versioned release of BlumeOps with all build artifacts.
# Currently includes:
# - Documentation site (Quartz static build)
# - Changelog (built from towncrier fragments)
#
# Usage:
# 1. Go to Actions > Build BlumeOps > Run workflow
# 2. Select version bump type (patch/minor/major) or choose specific version
# 3. The workflow creates a release with attached artifacts
#
# Documentation asset URL:
# https://forge.eblu.me/eblume/blumeops/releases/download/<tag>/docs-<version>.tar.gz
name: Build BlumeOps
on:
workflow_dispatch:
inputs:
version_type:
description: 'Version bump type'
required: true
default: 'BUMP_PATCH'
type: choice
options:
- BUMP_PATCH
- BUMP_MINOR
- BUMP_MAJOR
- SPECIFIC_VERSION
specific_version:
description: 'Specific version (only used when version_type is SPECIFIC_VERSION, e.g., v1.2.0)'
required: false
default: ''
type: string
jobs:
build:
runs-on: k8s
steps:
- name: Resolve version
id: version
run: |
VERSION_TYPE="${{ inputs.version_type }}"
SPECIFIC_VERSION="${{ inputs.specific_version }}"
# Fetch latest release
echo "Fetching latest release..."
LATEST=$(curl -s "https://forge.eblu.me/api/v1/repos/eblume/blumeops/releases/latest" | jq -r '.tag_name // empty' || true)
if [ -z "$LATEST" ]; then
LATEST="v0.0.0"
echo "No previous releases found, using base version: $LATEST"
else
echo "Latest release: $LATEST"
fi
# Parse current version components (strip 'v' prefix)
CURRENT="${LATEST#v}"
MAJOR=$(echo "$CURRENT" | cut -d. -f1)
MINOR=$(echo "$CURRENT" | cut -d. -f2)
PATCH=$(echo "$CURRENT" | cut -d. -f3)
case "$VERSION_TYPE" in
BUMP_MAJOR)
VERSION="v$((MAJOR + 1)).0.0"
echo "Bumping major: $LATEST -> $VERSION"
;;
BUMP_MINOR)
VERSION="v${MAJOR}.$((MINOR + 1)).0"
echo "Bumping minor: $LATEST -> $VERSION"
;;
BUMP_PATCH)
VERSION="v${MAJOR}.${MINOR}.$((PATCH + 1))"
echo "Bumping patch: $LATEST -> $VERSION"
;;
SPECIFIC_VERSION)
if [ -z "$SPECIFIC_VERSION" ]; then
echo "Error: specific_version is required when version_type is SPECIFIC_VERSION"
exit 1
fi
# Validate format
if [[ ! "$SPECIFIC_VERSION" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
echo "Error: Version must be in format vX.Y.Z (e.g., v1.0.0)"
exit 1
fi
VERSION="$SPECIFIC_VERSION"
echo "Using specific version: $VERSION"
;;
*)
echo "Error: Unknown version_type: $VERSION_TYPE"
exit 1
;;
esac
# Check if this version already exists
if curl -sf "https://forge.eblu.me/api/v1/repos/eblume/blumeops/releases/tags/$VERSION" > /dev/null 2>&1; then
echo "Error: Release $VERSION already exists"
echo "See: https://forge.eblu.me/eblume/blumeops/releases/tag/$VERSION"
exit 1
fi
echo "version=$VERSION" >> "$GITHUB_OUTPUT"
echo "Building BlumeOps release: $VERSION"
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- name: Build changelog
id: changelog
run: |
VERSION="${{ steps.version.outputs.version }}"
# Run towncrier on the runner so that CHANGELOG.md updates and
# fragment deletions appear in the working tree for both the Quartz
# build (next step) and the git commit step.
# Check if there are any changelog fragments
FRAGMENTS=$(find docs/changelog.d -name "*.md" -not -name ".gitkeep" 2>/dev/null | wc -l)
if [ "$FRAGMENTS" -gt 0 ]; then
echo "Found $FRAGMENTS changelog fragments, building changelog..."
uvx towncrier build --version "$VERSION" --yes
echo "changelog_updated=true" >> "$GITHUB_OUTPUT"
# Extract the changelog section for this release to include in release body
RELEASE_NOTES=$(awk -v ver="$VERSION" '
/^## \[/ {
if (found) exit
if (index($0, "[" ver "]")) found=1
}
found {print}
' CHANGELOG.md | tail -n +2)
echo "$RELEASE_NOTES" > /tmp/release_notes.md
echo "Release notes extracted for $VERSION"
else
echo "No changelog fragments found, skipping towncrier"
echo "changelog_updated=false" >> "$GITHUB_OUTPUT"
echo "" > /tmp/release_notes.md
fi
- name: Build docs
run: |
VERSION="${{ steps.version.outputs.version }}"
TARBALL="docs-${VERSION}.tar.gz"
echo "Building docs via Dagger..."
# Towncrier already ran on the runner above, so the working tree
# has an up-to-date CHANGELOG.md. build-docs now only runs the
# Quartz static site build (no towncrier).
dagger call build-docs --src=. --version="$VERSION" \
export --path="./$TARBALL"
echo "Build complete!"
ls -lh "$TARBALL"
- name: Create release
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
VERSION="${{ steps.version.outputs.version }}"
TARBALL="docs-${VERSION}.tar.gz"
CHANGELOG_UPDATED="${{ steps.changelog.outputs.changelog_updated }}"
echo "Creating release $VERSION..."
# Build release body with changelog if available
{
echo "BlumeOps release $VERSION"
echo ""
if [ "$CHANGELOG_UPDATED" = "true" ] && [ -s /tmp/release_notes.md ]; then
echo "## What's Changed"
echo ""
cat /tmp/release_notes.md
echo ""
fi
echo "## Documentation"
echo ""
echo "Download \`$TARBALL\` directly, or bump \`docs_version\`"
echo "in \`ansible/roles/docs/defaults/main.yml\` and run:"
echo ""
echo "\`\`\`"
echo "mise run provision-indri -- --tags docs"
echo "\`\`\`"
} > /tmp/release_body.txt
# Use jq to properly escape the body for JSON
RELEASE_DATA=$(jq -n \
--arg tag "$VERSION" \
--arg name "BlumeOps $VERSION" \
--rawfile body /tmp/release_body.txt \
'{tag_name: $tag, name: $name, body: $body, draft: false, prerelease: false}')
RELEASE_RESPONSE=$(curl -s \
-X POST \
-H "Content-Type: application/json" \
-H "Authorization: token $GITHUB_TOKEN" \
-d "$RELEASE_DATA" \
"https://forge.eblu.me/api/v1/repos/eblume/blumeops/releases")
echo "API Response: $RELEASE_RESPONSE"
RELEASE_ID=$(echo "$RELEASE_RESPONSE" | jq -r '.id')
if [ -z "$RELEASE_ID" ] || [ "$RELEASE_ID" = "null" ]; then
echo "Error: Failed to create release"
exit 1
fi
echo "Created release ID: $RELEASE_ID"
# Upload the asset
echo "Uploading $TARBALL..."
UPLOAD_RESPONSE=$(curl -s \
-X POST \
-H "Content-Type: application/gzip" \
-H "Authorization: token $GITHUB_TOKEN" \
--data-binary "@$TARBALL" \
"https://forge.eblu.me/api/v1/repos/eblume/blumeops/releases/$RELEASE_ID/assets?name=$TARBALL")
echo "Upload Response: $UPLOAD_RESPONSE"
echo ""
echo "Release created successfully!"
- name: Bump docs_version in ansible role
run: |
VERSION="${{ steps.version.outputs.version }}"
DEFAULTS_FILE="ansible/roles/docs/defaults/main.yml"
echo "Bumping docs_version in $DEFAULTS_FILE to ${VERSION}..."
yq -i ".docs_version = \"${VERSION}\"" "$DEFAULTS_FILE"
echo "Updated defaults:"
grep -E "^docs_version:" "$DEFAULTS_FILE"
- name: Commit release changes
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
VERSION="${{ steps.version.outputs.version }}"
CHANGELOG_UPDATED="${{ steps.changelog.outputs.changelog_updated }}"
# Configure git
git config user.name "Forgejo Actions"
git config user.email "actions@forge.ops.eblu.me"
# Stage deployment changes
git add ansible/roles/docs/defaults/main.yml
# Stage changelog changes if updated
if [ "$CHANGELOG_UPDATED" = "true" ]; then
git add CHANGELOG.md docs/changelog.d/
fi
# Check if there are changes to commit
if git diff --cached --quiet; then
echo "No changes to commit"
else
git commit -m "Update docs release to $VERSION
$([ "$CHANGELOG_UPDATED" = "true" ] && echo "- Built changelog from towncrier fragments")
[skip ci]"
# Push to main
git push origin HEAD:main
echo "Changes committed and pushed"
fi
- name: Summary
run: |
VERSION="${{ steps.version.outputs.version }}"
TARBALL="docs-${VERSION}.tar.gz"
echo "================================================"
echo "BlumeOps Release: $VERSION"
echo "================================================"
echo ""
echo "Release URL:"
echo " https://forge.eblu.me/eblume/blumeops/releases/tag/$VERSION"
echo ""
echo "Asset URL:"
echo " https://forge.eblu.me/eblume/blumeops/releases/download/$VERSION/$TARBALL"
echo ""
echo "To deploy on indri, run from gilbert:"
echo " mise run provision-indri -- --tags docs"
echo ""
echo "Then purge the Fly.io proxy cache:"
echo " fly ssh console -a blumeops-proxy -C \\"
echo " \"sh -c 'rm -rf /tmp/cache && nginx -s reload'\""

View file

@ -0,0 +1,202 @@
# Unified container build workflow
# Manual dispatch only — use `mise run container-build-and-release <name>`.
# Shared Dagger helpers (src/blumeops/) make path-based auto-triggers unreliable,
# so all container builds are triggered explicitly.
# Routes to the correct runner:
# - Dockerfile/Dagger containers build on k8s (indri) via Dagger
# - Nix containers build on nix-container-builder (ringtail) via nix-build + skopeo
name: Build Container
on:
workflow_dispatch:
inputs:
container:
description: 'Container name (directory under containers/)'
required: true
type: string
ref:
description: 'Commit SHA to build (defaults to current HEAD)'
required: false
type: string
jobs:
detect:
runs-on: k8s
outputs:
dagger: ${{ steps.classify.outputs.dagger }}
nix: ${{ steps.classify.outputs.nix }}
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ inputs.ref || github.sha }}
fetch-depth: 2
- name: Classify container build type
id: classify
run: |
CHANGED='["${{ inputs.container }}"]'
echo "Building container: $CHANGED"
# Classify each container by build type (a container can appear in both)
DAGGER='[]'
NIX='[]'
for name in $(echo "$CHANGED" | jq -r '.[]'); do
has_any=false
if [ -f "containers/$name/container.py" ] || [ -f "containers/$name/Dockerfile" ]; then
DAGGER=$(echo "$DAGGER" | jq -c --arg n "$name" '. + [$n]')
has_any=true
fi
if [ -f "containers/$name/default.nix" ]; then
NIX=$(echo "$NIX" | jq -c --arg n "$name" '. + [$n]')
has_any=true
fi
if [ "$has_any" = "false" ]; then
echo "Warning: $name has neither container.py, Dockerfile, nor default.nix — skipping"
fi
done
echo "dagger=$DAGGER" >> "$GITHUB_OUTPUT"
echo "nix=$NIX" >> "$GITHUB_OUTPUT"
echo "Dagger builds: $DAGGER"
echo "Nix builds: $NIX"
build-dagger:
needs: detect
if: needs.detect.outputs.dagger != '[]'
runs-on: k8s
env:
# Send Dagger OTLP telemetry to Tempo. Without a real backend the
# engine's internal proxy returns 500 on /v1/metrics, causing noisy
# retry warnings in every build.
OTEL_EXPORTER_OTLP_ENDPOINT: http://tempo.tracing.svc.cluster.local:4318
strategy:
matrix:
container: ${{ fromJson(needs.detect.outputs.dagger) }}
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ inputs.ref || github.sha }}
- name: Extract version and SHA
id: meta
run: |
CONTAINER="${{ matrix.container }}"
# Try native Dagger pipeline (container.py) first, fall back to Dockerfile
if [ -f "containers/$CONTAINER/container.py" ]; then
VERSION=$(dagger call container-version --container-name="$CONTAINER")
elif [ -f "containers/$CONTAINER/Dockerfile" ]; then
VERSION=$(grep -m1 '^ARG CONTAINER_APP_VERSION=' \
"containers/$CONTAINER/Dockerfile" \
| sed 's/^ARG CONTAINER_APP_VERSION=//')
fi
if [ -z "$VERSION" ]; then
echo "Error: Could not extract version for $CONTAINER"
exit 1
fi
REF="${{ inputs.ref }}"
if [ -z "$REF" ]; then
REF="${GITHUB_SHA}"
fi
SHORT_SHA=$(echo "$REF" | head -c 7)
# Ensure version starts with 'v'
case "$VERSION" in
v*) ;;
*) VERSION="v${VERSION}" ;;
esac
echo "version=$VERSION" >> "$GITHUB_OUTPUT"
echo "sha=$SHORT_SHA" >> "$GITHUB_OUTPUT"
echo "Version: $VERSION, SHA: $SHORT_SHA"
- name: Publish
env:
ZOT_CI_API_KEY: ${{ secrets.ZOT_CI_API_KEY }}
run: |
dagger call publish \
--src=. \
--container-name=${{ matrix.container }} \
--version=${{ steps.meta.outputs.version }} \
--commit-sha=${{ steps.meta.outputs.sha }} \
--registry-password=env:ZOT_CI_API_KEY
build-nix:
needs: detect
if: needs.detect.outputs.nix != '[]'
runs-on: nix-container-builder
strategy:
matrix:
container: ${{ fromJson(needs.detect.outputs.nix) }}
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ inputs.ref || github.sha }}
- name: Extract version and SHA
id: meta
run: |
CONTAINER="${{ matrix.container }}"
NIX_FILE="containers/$CONTAINER/default.nix"
# Extract version = "..." from the nix file
VERSION=$(grep -m1 '^\s*version\s*=\s*"' "$NIX_FILE" \
| sed 's/.*"\(.*\)".*/\1/' || true)
if [ -z "$VERSION" ]; then
echo "Error: No version declaration found in $NIX_FILE"
exit 1
fi
REF="${{ inputs.ref }}"
if [ -z "$REF" ]; then
REF="${GITHUB_SHA}"
fi
SHORT_SHA=$(echo "$REF" | head -c 7)
# Ensure version starts with 'v'
case "$VERSION" in
v*) ;;
*) VERSION="v${VERSION}" ;;
esac
echo "version=$VERSION" >> "$GITHUB_OUTPUT"
echo "sha=$SHORT_SHA" >> "$GITHUB_OUTPUT"
echo "Version: $VERSION, SHA: $SHORT_SHA"
- name: Resolve nixpkgs
id: nixpkgs
run: |
NIXPKGS_PATH=$(nix flake metadata nixpkgs --json | jq -r '.path')
echo "Resolved nixpkgs: $NIXPKGS_PATH"
echo "path=$NIXPKGS_PATH" >> "$GITHUB_OUTPUT"
- name: Build with nix
env:
NIX_PATH: "nixpkgs=${{ steps.nixpkgs.outputs.path }}"
run: |
echo "Building containers/${{ matrix.container }}/default.nix"
echo "NIX_PATH=$NIX_PATH"
nix-build "containers/${{ matrix.container }}/default.nix" -o result
echo "Build complete: $(readlink result)"
- name: Push to registry
env:
ZOT_CI_API_KEY: ${{ secrets.ZOT_CI_API_KEY }}
run: |
CONTAINER="${{ matrix.container }}"
VERSION="${{ steps.meta.outputs.version }}"
SHORT_SHA="${{ steps.meta.outputs.sha }}"
IMAGE="registry.ops.eblu.me/blumeops/$CONTAINER:${VERSION}-${SHORT_SHA}-nix"
echo "Pushing to $IMAGE"
skopeo copy \
--dest-creds="zot-ci:$ZOT_CI_API_KEY" \
"docker-archive:result" \
"docker://$IMAGE"
echo "Push complete: $IMAGE"

View file

@ -0,0 +1,109 @@
# CV Deploy Workflow
#
# Bumps cv_version in ansible/roles/cv/defaults/main.yml and pushes the change.
# Deployment to indri is manual (runner has no SSH access to indri):
# mise run provision-indri -- --tags cv
#
# Usage:
# 1. Release a new CV package from the cv repo first
# 2. Go to Actions > Deploy CV > Run workflow
# 3. Enter the version to deploy, or leave as "latest"
# 4. Run the command above on gilbert to apply
name: Deploy CV
on:
workflow_dispatch:
inputs:
version:
description: 'CV package version to deploy (e.g., v1.0.0, or "latest")'
required: true
default: 'latest'
type: string
jobs:
deploy:
runs-on: k8s
steps:
- name: Resolve version
id: version
run: |
INPUT_VERSION="${{ inputs.version }}"
if [ "$INPUT_VERSION" = "latest" ]; then
echo "Resolving latest CV package version..."
VERSION=$(curl -s "https://forge.eblu.me/api/v1/packages/eblume?type=generic&q=cv" \
| jq -r '[.[] | select(.name == "cv")] | sort_by(.version) | last | .version // empty')
if [ -z "$VERSION" ]; then
echo "Error: No CV packages found"
exit 1
fi
echo "Resolved latest version: $VERSION"
else
VERSION="$INPUT_VERSION"
if [[ ! "$VERSION" =~ ^v[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
echo "Error: Version must be in format vX.Y.Z (e.g., v1.0.0)"
exit 1
fi
fi
# Verify the package exists
TARBALL="cv-${VERSION}.tar.gz"
PACKAGE_URL="https://forge.eblu.me/api/packages/eblume/generic/cv/${VERSION}/${TARBALL}"
if ! curl -fsSL --head "$PACKAGE_URL" > /dev/null 2>&1; then
echo "Error: Package not found at $PACKAGE_URL"
echo "Run the 'Release CV' workflow in the cv repo first."
exit 1
fi
echo "Package verified: $PACKAGE_URL"
echo "version=$VERSION" >> "$GITHUB_OUTPUT"
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Bump cv_version in ansible role
run: |
VERSION="${{ steps.version.outputs.version }}"
DEFAULTS_FILE="ansible/roles/cv/defaults/main.yml"
echo "Bumping cv_version in $DEFAULTS_FILE to ${VERSION}..."
yq -i ".cv_version = \"${VERSION}\"" "$DEFAULTS_FILE"
echo "Updated defaults:"
grep -E "^cv_version:" "$DEFAULTS_FILE"
- name: Commit release changes
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
VERSION="${{ steps.version.outputs.version }}"
git config user.name "Forgejo Actions"
git config user.email "actions@forge.ops.eblu.me"
git add ansible/roles/cv/defaults/main.yml
if git diff --cached --quiet; then
echo "No changes to commit (already at $VERSION)"
else
git commit -m "Update CV release to $VERSION
[skip ci]"
git push origin HEAD:main
echo "Changes committed and pushed"
fi
- name: Summary
run: |
VERSION="${{ steps.version.outputs.version }}"
echo "================================================"
echo "CV version bumped: $VERSION"
echo "================================================"
echo ""
echo "To deploy on indri, run from gilbert:"
echo " mise run provision-indri -- --tags cv"
echo ""
echo "Then purge the Fly.io proxy cache:"
echo " fly ssh console -a blumeops-proxy -C \\"
echo " \"sh -c 'rm -rf /tmp/cache && nginx -s reload'\""

View file

@ -0,0 +1,37 @@
name: Deploy Fly.io Proxy
on:
workflow_dispatch:
push:
branches: [main]
paths:
- 'fly/**'
jobs:
deploy:
runs-on: k8s
steps:
- name: Checkout
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Install flyctl
run: |
curl -L https://fly.io/install.sh | sh
echo "/root/.fly/bin" >> "$GITHUB_PATH"
- name: Deploy to Fly.io
env:
FLY_API_TOKEN: ${{ secrets.FLY_DEPLOY_TOKEN }}
run: |
cd fly
fly deploy
- name: Verify health
env:
FLY_API_TOKEN: ${{ secrets.FLY_DEPLOY_TOKEN }}
run: |
fly status -a blumeops-proxy
echo ""
echo "Health check:"
sleep 10
curl -sf https://blumeops-proxy.fly.dev/healthz || echo "Warning: health check failed (may need DNS propagation)"

View file

@ -1,44 +0,0 @@
name: Test CI
on:
push:
branches: [main]
pull_request:
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Verify tools
run: |
echo "=== Node.js ==="
node --version
npm --version
echo ""
echo "=== Git ==="
git --version
echo ""
echo "=== Build tools ==="
make --version | head -1
gcc --version | head -1
echo ""
echo "=== Docker ==="
docker --version
echo ""
echo "=== Other tools ==="
curl --version | head -1
jq --version
- name: Show repo info
run: |
echo "Repository: ${{ github.repository }}"
echo "Event: ${{ github.event_name }}"
echo "Ref: ${{ github.ref }}"
echo "Branch: ${{ github.ref_name }}"
echo ""
echo "=== Files ==="
ls -la

1
.gitattributes vendored Normal file
View file

@ -0,0 +1 @@
/sdk/** linguist-generated

9
.github/USE_FORGE_WORKFLOWS.md vendored Normal file
View file

@ -0,0 +1,9 @@
# .github directory
This directory contains configuration for GitHub-ecosystem tooling only.
**Workflows and actions belong in `.forgejo/`** - this repository uses Forgejo Actions, not GitHub Actions.
## Contents
- `actionlint.yaml` - Configuration for actionlint prek hook (custom runner labels)

4
.github/actionlint.yaml vendored Normal file
View file

@ -0,0 +1,4 @@
self-hosted-runner:
labels:
- k8s
- nix-container-builder

7
.gitignore vendored
View file

@ -1,4 +1,6 @@
.claude/settings.local.json
.claude/agent-memory/
.claude/scheduled_tasks.lock
# Python
__pycache__/
@ -6,5 +8,10 @@ __pycache__/
*.pyo
.venv/
# Dagger (auto-generated SDK)
/sdk/
# OS
.DS_Store
/**/__pycache__
/.env

View file

@ -1,89 +0,0 @@
---
# See https://pre-commit.com for more information
# Run: uvx pre-commit run --all-files
# Install: uvx pre-commit install
repos:
# General file hygiene
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v6.0.0
hooks:
- id: trailing-whitespace
- id: end-of-file-fixer
- id: check-added-large-files
args: ['--maxkb=1000']
- id: check-merge-conflict
- id: check-json
- id: check-yaml
args: ['--unsafe'] # Allow custom tags (ansible uses them)
- id: check-toml
# Secret detection
- repo: https://github.com/trufflesecurity/trufflehog
rev: v3.92.5
hooks:
- id: trufflehog
entry: trufflehog git file://. --no-verification --fail
stages: [pre-commit, pre-push]
# YAML linting
- repo: https://github.com/adrienverge/yamllint
rev: v1.38.0
hooks:
- id: yamllint
args: ['-c', '.yamllint.yaml']
# Ansible linting
- repo: local
hooks:
- id: ansible-lint
name: ansible-lint
entry: env ANSIBLE_ROLES_PATH=ansible/roles ansible-lint
language: python
files: ^ansible/
additional_dependencies:
- ansible-lint>=26.1.1
- ansible-core>=2.15
# Python - ruff for linting and formatting
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.14.13
hooks:
- id: ruff
args: ['--fix']
- id: ruff-format
# Shell scripts - shellcheck and shfmt
- repo: https://github.com/shellcheck-py/shellcheck-py
rev: v0.10.0.1
hooks:
- id: shellcheck
args: ['--severity=warning']
- repo: https://github.com/scop/pre-commit-shfmt
rev: v3.12.0-2
hooks:
- id: shfmt
args: ['-i', '2', '-ci', '-bn'] # 2-space indent, case indent, binary newline
# TOML - taplo
- repo: https://github.com/ComPWA/taplo-pre-commit
rev: v0.9.3
hooks:
- id: taplo-format
- id: taplo-lint
# JSON formatting (prettier for consistent style)
- repo: https://github.com/rbubley/mirrors-prettier
rev: v3.8.0
hooks:
- id: prettier
types_or: [json]
args: ['--tab-width', '2']
# GitHub/Forgejo Actions workflow linting
- repo: https://github.com/rhysd/actionlint
rev: v1.7.10
hooks:
- id: actionlint-system
files: ^\.forgejo/workflows/

View file

@ -21,11 +21,11 @@ rules:
# Required for ansible-lint compatibility
comments-indentation: false
octal-values:
forbid-implicit-octal: true
forbid-implicit-octal: false
forbid-explicit-octal: true
ignore:
- .venv/
- pulumi/.venv/
# Third-party k8s manifest with non-standard formatting
- argocd/manifests/tailscale-operator/operator.yaml
- argocd/manifests/tailscale-operator-base/operator.yaml

171
AGENTS.md Normal file
View file

@ -0,0 +1,171 @@
# AGENTS.md
Guidance for AI agents working in this repository. See also [[ai-assistance-guide]].
## Overview
blumeops is Erich Blume's GitOps repository for personal infrastructure, orchestrated via tailnet `tail8d86e.ts.net`.
**CRITICAL: Public repo at github.com/eblume/blumeops - never commit secrets!**
**Shell:** The user's interactive shell may differ from the current harness shell. Prefer repo-safe, non-interactive commands when possible, and match the user's shell conventions when giving interactive examples.
## Rules
1. **Always run `mise run ai-docs` at session start**
This will refresh your context with important information you will be assumed to know and follow.
**Read the full output** — never truncate, pipe to `head`/`tail`, or skip sections.
For problems with a large surface area, ask the user if `mise run ai-sources` should also be run — it concatenates all non-doc source files (~270K tokens) for deep codebase context.
2. **Always use `--context=minikube-indri` with kubectl** (or `--context=k3s-ringtail` for ringtail services) - work contexts must never be touched
**NEVER run `minikube delete`** — it destroys all PVs, etcd, and cluster state. Use `minikube stop`/`minikube start` for restarts. If minikube is stuck, see [[restart-indri]]. Full rebuild from scratch requires the DR procedure in [[rebuild-minikube-cluster]].
3. **Classify the change as C0/C1/C2 before starting** (see below) — this determines branching and PR requirements
4. **Feature branches + PRs for C1/C2** - checkout main, pull, create branch, open PR via `tea pr create`. C0 goes direct to main.
5. **Check PR comments with `mise run pr-comments <pr_number>`** before proceeding
6. **Add changelog fragments (all change levels)** - `docs/changelog.d/<name>.<type>.md`
Types: `feature`, `bugfix`, `infra`, `doc`, `ai`, `misc`
Applies to C0, C1, and C2 whenever the change is user-visible or noteworthy.
- **C1/C2:** Use branch name: `<branch>.<type>.md`
- **C0:** Use orphan prefix: `+<descriptive-slug>.<type>.md` (avoids `main.*` collisions)
7. **Test before applying** - dry runs (`--check --diff`), syntax checks, `ssh indri '...'`
8. **Wait for user review before deploying** (C1/C2)
9. **Never merge PRs or push to main without explicit request** (C0 commits to main are fine)
10. **Verify deployments** - `mise run services-check`
## Change Classification
Before starting work, classify the change:
| Class | Name | When to use | Key trait |
|-------|------|-------------|-----------|
| **C0** | Quick Fix | Small, low-risk, fix-forward safe | Direct to main, no PR |
| **C1** | Human Review | Moderate complexity or risk | Feature branch + PR, docs-first |
| **C2** | Mikado Chain | Multi-phase, multi-session, high complexity | Mikado Branch Invariant |
**C0** — commit directly to main. No branch or PR needed. Fix forward if problems arise.
**C1** — feature branch with early PR. Search related docs first, write documentation changes before code, deploy from the unmerged branch (ArgoCD `--revision`, Ansible from checkout). Upgrade to C2 if complexity spirals.
**C2** — branch `mikado/<chain-stem>` governed by the Mikado Branch Invariant: all card commits first, then code progress, then card closures. Commits use `C2(<chain>): plan/impl/close/finalize` convention. Reset the branch when new prerequisites are discovered. Resume with `mise run docs-mikado --resume`.
See [[agent-change-process]] for the full methodology.
## Project Structure
```
./docs/ # documentation (Diataxis, Quartz)
./docs/changelog.d/ # towncrier fragments
./.dagger/ # dagger pipelines
./.forgejo/ # forgejo-runner actions and workflows
./mise-tasks/ # scripts via `mise run`
./ansible/playbooks/ # ansible (indri.yml primary)
./ansible/roles/ # indri service roles
./argocd/apps/ # ArgoCD Application definitions
./argocd/manifests/ # k8s manifests per service
./fly/ # fly.io proxy for public routing
./pulumi/ # Pulumi IaC (tailnet ACLs, dns, cloud)
~/.config/{nvim,fish} # user's shell config, managed by chezmoi
~/code/personal/ # user's projects
~/code/personal/zk # user's zettelkasten (Obsidian-sync). Reference-data source; migrating into heph docs (hephaestus).
~/code/3rd/ # mirrored external projects
~/code/work # FORBIDDEN
```
Other code paths will be listed via ai-docs, this is just an overview. When you
encounter wiki-links (`[[like-this]]`) it is referring to docs/ cards.
## Service Deployment
### Kubernetes (ArgoCD)
Most services run in minikube on indri via ArgoCD (app-of-apps, manual sync). GPU workloads (Frigate, ntfy) run on ringtail's k3s cluster, also managed by ArgoCD.
**PR workflow:**
1. Create branch, modify `argocd/manifests/<service>/`
2. Push. Sync 'apps' app if service definition changed (set --revision to branch).
3. Test on branch: `argocd app set <service> --revision <branch> && argocd app sync <service>`
4. After merge: `argocd app set <service> --revision main && argocd app sync <service>`
**Commands:** `argocd app list|get|diff|sync <app>`
**Login:** `argocd login argocd.ops.eblu.me --sso` (opens browser for Authentik SSO). Admin fallback for break-glass: `argocd login argocd.ops.eblu.me --username admin --password "$(op read 'op://vg6xf6vvfmoh5hqjjhlhbeoaie/srogeebssulhtb6tnqd7ls6qey/password')"`
### Indri (Ansible)
Native services: Forgejo, Zot, Caddy, Borgmatic, Alloy
```fish
mise run provision-indri # full
mise run provision-indri -- --tags <role> # specific
mise run provision-indri -- --check --diff # dry run
```
### Routing
| Domain | Mechanism | Reachable from |
|--------|-----------|----------------|
| `*.eblu.me` | Fly.io proxy (Tailscale tunnel) | public internet |
| `*.ops.eblu.me` | Caddy on indri | k8s pods, containers, tailnet |
| `*.tail8d86e.ts.net` | Tailscale MagicDNS | tailnet clients only |
Check tailscale serve: `ssh indri 'tailscale serve status --json'`
## Container Releases
```fish
mise run container-list # show images/tags
mise run container-release <name> <version> # tag and build
```
The goal is to eventually use only locally built containers in all cases, with
full supply chain control via forge.ops.eblu.me repositories, mirroring source
from upstream.
**After triggering a build** (manual dispatch or push to main), verify the
workflow succeeded before proceeding:
```fish
mise run runner-logs # find the run number
mise run runner-logs <run#> # see jobs in the run
mise run runner-logs <run#> -j <N> # fetch logs on failure
```
This also works for other forge repos (`--repo eblume/hermes`).
## Third-Party Projects
Ask user to mirror on forge first, then clone to `~/code/3rd/<project>/`.
### Sporked Projects
Some mirrored projects are "sporked" — a floating-branch soft-fork strategy
where local patches are continuously rebased on top of upstream. See
[[spork-strategy]] and [[create-a-spork]] for the full methodology.
Sporked projects live in `~/code/3rd/<project>/` with three remotes:
`origin` (eblume/ fork on forge), `mirror` (mirrors/ on forge), `upstream`
(canonical). The `blumeops` branch is the default; `deploy` merges everything.
Create a new spork: `mise run spork-create <mirror-name>`
## Task Discovery
BlumeOps tasks live in [hephaestus](https://github.com/eblume/hephaestus) (`heph`),
the user's self-hosted context/task system. Fetch them with the CLI:
```fish
heph list --project Blumeops --json # outstanding Blumeops tasks as JSON
```
(This replaced the retired `blumeops-tasks` mise task, which read from Todoist.)
Most operational scripts are stored in `./mise-tasks/`. For scripts with any logic or
complexity, use uv run --script 's with explicit dependencies. Complex
workflows with artifacts should become dagger pipelines. Mise tasks are for
development processes and operations - tools for the user or the agent.
## Credentials
Root store is 1Password. Never grab directly - use existing patterns (ansible
pre_tasks, external-secrets, scripts with `op` CLI). It's ok to use `op item
get` without `--reveal` to explore what secrets are available, however.
Prefer `op read "op://vault/item/field"` over `op item get --fields` to avoid
quoting issues with multi-line values.

View file

@ -1,6 +1,9 @@
# CLI tools for blumeops management
brew "actionlint" # GitHub/Forgejo Actions workflow linter
brew "age" # File encryption for 1Password backup (op-backup)
brew "argocd" # ArgoCD CLI for GitOps management
brew "bat" # Syntax-highlighted file concatenation
brew "tea" # Gitea/Forgejo CLI for forge.tail8d86e.ts.net
brew "mise" # Task runner and toolchain manager
brew "tea" # Gitea/Forgejo CLI for forge.ops.eblu.me
brew "flyctl" # Fly.io CLI for public proxy management
brew "podman" # Container CLI (uses VM on macOS, for building/pushing images)

1394
CHANGELOG.md Normal file

File diff suppressed because it is too large Load diff

158
CLAUDE.md
View file

@ -1,157 +1 @@
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
blumeops is Erich Blume's GitOps repository for personal infrastructure management, orchestrated via tailnet `tail8d86e.ts.net`.
**Critical: This repository is published publicly at https://github.com/eblume/blumeops, so never include any secrets!**
## Rules
1. At the start of every session, even if the user asked to do something else, run `mise run zk-docs -- --style=header --color=never --decorations=always` in order to review the `blumeops` documentation in the zettelkasten (zk). zk lives at `~/code/personal/zk`, and is managed via obsidian-sync (not git).
2. When making any changes, start by making sure you're on the `main` git branch and up-to-date, and then create a feature branch. Commit often while working, and create a PR using:
```fish
tea pr create --title "Description of change" --description "$(cat <<'EOF'
## Summary
- First change
- Second change
## Deployment and Testing
- [x] Done thing one
- [ ] Needed thing two
🤖 Generated with [Claude Code](https://claude.com/claude-code)
EOF
)"
```
The user will review your work as you go, and will merge the PR as the last step in the process, even after deploying. After the user reviews the PR and leaves comments, check for unresolved comments with:
```fish
mise run pr-comments <pr_number>
```
Address each unresolved comment before proceeding. The user will resolve comments on the Forge UI as they are addressed.
3. Always keep the zk cards up to date with any changes, and suggest new links to new cards whenever appropriate. Refer back to the zk docs often during the process of planning and making corrections to ensure accuracy, and if you make a mistake, figure out a way to guard against it using the zk.
4. Use `Brewfile` and `mise.toml` to install tools needed on the development workstation (typically hostnamed "gilbert", username "eblume").
5. Services are hosted either on indri directly (via ansible) or in Kubernetes (via ArgoCD). See the "Service Deployment" section below for details.
6. Try to always test changes before applying them. Use syntax checkers, do dry runs (`--check --diff`), run commands manually via `ssh indri 'some command'`, etc.
7. **Wait for user review before deploying.** After creating a PR, do not run deployment commands until the user has had a chance to review the changes. The user will indicate when they're ready to deploy.
8. After deploying changes, try to verify the result. Use `mise run indri-services-check` to do a general service health check.
## Project Structure
```
./mise-tasks/ # management and utility scripts run via `mise run`
./ansible/playbooks/ # ansible playbooks (indri.yml is primary)
./ansible/roles/ # ansible roles for indri-hosted services
./argocd/apps/ # ArgoCD Application definitions (app-of-apps pattern)
./argocd/manifests/ # Kubernetes manifests for each service
./pulumi/ # Pulumi IaC for tailnet ACLs and cloud resources
./plans/ # Migration and project planning documents
~/code/personal/ # projects managed by the user
~/code/3rd/ # external projects, mirrored or downloaded
~/code/work # FORBIDDEN, never go here, avoid searching it
```
## Service Deployment
### Kubernetes Services (via ArgoCD)
Most services run on `k8s.tail8d86e.ts.net`, via minikube on indri. They are managed via ArgoCD using the app-of-apps pattern:
- **Application definitions**: `argocd/apps/<service>.yaml`
- **Manifests**: `argocd/manifests/<service>/`
- **Sync policy**: Manual sync (no auto-sync on git push)
**PR workflow for k8s services:**
1. Create feature branch and add/modify manifests
2. Push branch to forge
3. Sync the `apps` application to pick up new Application definitions:
```fish
argocd app sync apps
```
4. Point the service app at the feature branch for testing:
```fish
argocd app set <service> --revision feature/branch-name
argocd app sync <service>
```
5. Test the deployment
6. After PR merge, reset to main and resync:
```fish
argocd app set <service> --revision main
argocd app sync <service>
```
**Useful commands:**
```fish
argocd app list # List all apps
argocd app get <app> # Get app details
argocd app diff <app> # Preview changes before sync
argocd app sync <app> # Sync an app
kubectl --context=minikube-indri get pods -n <namespace> # Check pods
kubectl --context=minikube-indri logs -n <namespace> <pod> # View logs
```
Note: The user has fish abbreviations `ki` for `kubectl --context=minikube-indri` and `k9i` for `k9s --context=minikube-indri`, but these only work in interactive shells.
**ArgoCD login (when token expires):**
```fish
argocd login argocd.tail8d86e.ts.net --username admin --password "$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get srogeebssulhtb6tnqd7ls6qey --fields password --reveal)"
```
### Indri Services (via Ansible)
Some services remain on indri outside of Kubernetes:
- **Zot Registry** - Container registry (k8s depends on it)
- **Prometheus/Loki** - Observability (must survive k8s failures)
- **Borgmatic** - Backup system
- **Grafana Alloy** - Metrics/logs collector
- **Transmission** - BitTorrent for kiwix downloads
**Deployment:**
```fish
mise run provision-indri # Full playbook
mise run provision-indri -- --tags <role> # Specific role
mise run provision-indri -- --check --diff # Dry run
```
### Tailscale Service Hostnames
When migrating a service from indri to k8s, the Tailscale hostname must be freed:
1. Stop the service on indri
2. Clear the tailscale serve entry: `ssh indri 'tailscale serve clear svc:<name>'`
3. Delete the device from Tailscale admin console (user action required)
4. Deploy the k8s Ingress - it will claim the hostname
Use `ssh indri 'tailscale serve status --json'` to check current serve entries (the non-JSON output may be empty even when entries exist).
## Third-Party Projects
When a task requires cloning or using a third-party git repository (e.g., for building from source), **ask the user to mirror it on forge first**, then clone from the mirror:
- Mirror location: `https://forge.tail8d86e.ts.net/eblume/<project>.git`
- Clone to: `~/code/3rd/<project>/`
This avoids external dependencies and ensures the project is available even if the upstream is unreachable.
## Task Discovery
To discover pending blumeops tasks, run:
```fish
mise run blumeops-tasks
```
This fetches tasks from the "Blumeops" project in Todoist (via 1Password for API credentials) and displays them sorted by priority: p1 (urgent), p2 (high), p4 (normal/default), p3 (backlog). The typical workflow is to pick a task from this list at the start of a session, then dive in with planning.
## Credentials
The root store for credentials is 1password, which can be accessed via `op --vault <vaultid> item get <itemid> --field fieldname --reveal`, which will prompt the user for their assent and biometrics or password. Typically, use scripts to defer this action - try not to ever grab credentials directly. For instance, the indri.yml playbook starts with `pre_tasks` to gather the relevant secrets needed to provision its services. Some services have their credentials exported to files `chmod 0600` on indri, but they still start out in 1password. In some cases you can test services with a command that grabs the credential, but try to use environment variables or other arrangements to avoid learning the credential yourself, and warn the user first.
@AGENTS.md

674
LICENSE Normal file
View file

@ -0,0 +1,674 @@
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc. <https://fsf.org/>
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
<one line to give the program's name and a brief idea of what it does.>
Copyright (C) <year> <name of author>
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <https://www.gnu.org/licenses/>.
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
<program> Copyright (C) <year> <name of author>
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
<https://www.gnu.org/licenses/>.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
<https://www.gnu.org/licenses/why-not-lgpl.html>.

144
README.md
View file

@ -1,89 +1,95 @@
# blumeops
aka "Blue Mops"
Tools and configuration for Erich Blume's personal infrastructure, orchestrated
across a Tailscale tailnet.
This is a homelab, but it's also a testing ground for AI-assisted
infrastructure development. Much of this codebase was initially co-authored with [Claude
Code](https://docs.anthropic.com/en/docs/agents-and-tools/claude-code/overview),
and the repo places heavy emphasis on documentation, process, and change
classification to make that collaboration work well. I don't know entirely how
I feel about LLMs in our current era (there are real concerns about how
training data is sourced and energy subsidy) but it felt important to learn how
to work with these tools.
The full documentation is published at **[docs.eblu.me](https://docs.eblu.me)**
and lives in the [`docs/`](docs/) directory, structured around the
[Diataxis](https://diataxis.fr/) framework and designed to be compatible with
[Obsidian](https://obsidian,nd)/[Obsidian.nvim](https://github.com/obsidian-nvim/obsidian.nvim).
## What runs here
Services are a mix of Kubernetes pods (managed by ArgoCD), macOS LaunchAgent
services (managed by Ansible), and NixOS systemd services (managed by Nix
flakes), all connected via Tailscale:
- **Indri** (Mac Mini M1) - primary server. Most services run in Minikube via
ArgoCD; Forgejo, Caddy, and others run natively as LaunchAgent services via
Ansible.
- **Ringtail** (NixOS desktop, RTX 4080) - GPU workloads (Frigate NVR,
Authentik SSO) on k3s, plus NixOS systemd services.
- **Sifaka** (Synology NAS) - backup target and bulk storage.
Notable services include Grafana/Prometheus/Loki observability, Immich photos,
Jellyfin media, Forgejo git forge, a Zot container registry, and more. Public
access is routed through a Fly.io proxy; everything else is tailnet-only.
## Project structure
```
l0K k..:k.
.:...c. ;c....
....'o x.....
....k x....
... l' 'c....
....,l o'....
.....x k....
.....d. c....
... l x....
.,.d ;c.c'
'c':; x',c.
.:,'o .x.::.
.;:.k ,:.c'
,c.c';:.
.,.:;.
;'.c, l
d',c..:.d.
O.:;. 'c';c
;c.c' .:;.x
o',c. .;:.k
x.::. 'c.l.
dOKl.c, .c,'o
0l'...... ..' .::.ocx.
'o ............ o .... :olx;
x,ox;. ....... .k ....,dKKo;..x
'd,OXXXXk:. ...... ; ;:dXOl;',';l;o;
x,oXXXXXXXXXkc. ... .lc,',':dKNNNx;x;
;o;0KXXXXXXXXXXXX0l. .',ckNNNNNNNNNxco0d
l,d0oOXKOKXXXXKXXXX0. kNNNNNNNNNNNNNXxloo::
.OXxdXKOX0kXXXX0. .KNNNNNNNNNNXONX0o.
,OdxKldXXXXx. ,NNNNNNNNNNNKoc
:.OXXkKo .kNNNNNNNNXx.
':0c .NdNkXkc
ansible/ Ansible playbooks and roles (indri, sifaka)
argocd/apps/ ArgoCD Application definitions
argocd/manifests/ Kubernetes manifests per service
containers/ Custom container builds (Dockerfile + Nix)
docs/ Diataxis documentation (published at docs.eblu.me)
fly/ Fly.io public proxy configuration
mise-tasks/ Operational scripts run via mise
nixos/ NixOS configuration for ringtail
pulumi/ Pulumi IaC (Tailscale ACLs, Gandi DNS)
.dagger/ Dagger CI pipelines
.forgejo/ Forgejo Actions CI/CD workflows
```
*Blue Mops* — GitOps for Erich Blume's personal computing environment.
## Getting started
## What is this?
Infrastructure-as-code for my tailnet (`tail8d86e.ts.net`). This repo contains
ansible playbooks, configuration, and automation for managing my personal
infrastructure.
This codebase was heavily co-authored by Claude Code, as an experiment in
LLM-assisted development. I want to include a personal note here that I don't
know entirely how I feel about LLMs in our current era, but it felt important
to learn.
## Development
### Pre-commit Hooks
This repo uses [pre-commit](https://pre-commit.com) for code quality and consistency. Install hooks with:
You'll need [Homebrew](https://brew.sh) and [mise](https://mise.jdx.dev):
```bash
uvx pre-commit install
brew bundle # install CLI tools (argocd, tea, flyctl, etc.)
mise install # install managed toolchains (ansible, pulumi, dagger, etc.)
prek install # set up git hooks
```
Run all hooks manually:
Git hooks (via [prek](https://github.com/j178/prek)) enforce secret scanning
(TruffleHog), linting, formatting, and custom checks like doc link validation
and the Mikado branch invariant. They run automatically on `git commit`.
Operational tasks are driven through mise. Run `mise tasks` to see what's
available. Key examples:
```bash
uvx pre-commit run --all-files
mise run provision-indri # deploy to indri via Ansible
mise run services-check # verify service health
mise run container-list # list tracked container images
```
Hooks include:
- **General**: trailing whitespace, end-of-file fixer, large files, merge conflicts
- **Secrets**: [TruffleHog](https://github.com/trufflesecurity/trufflehog) for secret detection
- **YAML**: yamllint, ansible-lint
- **Python**: ruff (linting + formatting)
- **Shell**: shellcheck, shfmt
- **TOML**: taplo
- **JSON**: prettier
## AI-assisted development
## CI/CD
This repo is designed to be worked on by both humans and AI agents. The
[`AGENTS.md`](AGENTS.md) file provides shared instructions for agentic tools, and the
[`docs/tutorials/ai-assistance-guide.md`](docs/tutorials/ai-assistance-guide.md)
explains the full workflow.
This repo uses [Forgejo Actions](https://forgejo.org/docs/latest/user/actions/) for CI/CD. Workflows live in `.forgejo/workflows/` (not `.github/workflows/`). The runner executes jobs in host mode within the Kubernetes cluster.
Changes are classified before starting work:
## Documentation
- **C0** - quick fixes, committed directly to main
- **C1** - feature branch + PR, documentation written before code
- **C2** - multi-phase work using the Mikado method for dependency tracking
Detailed documentation lives in my personal zettelkasten, which is not included in this repository. You can view the docs with:
See the [agent change process](docs/explanation/agent-change-process.md) for
details.
```bash
mise run zk-docs
```
## License
The zettelkasten is private at time of writing. If you're interested in the documentation or have questions about this project, please reach out to blume.erich@gmail.com.
[GPLv3](LICENSE)

View file

@ -1,2 +0,0 @@
---
ansible_managed: "Managed by ansible - do not edit. Source: ssh://forgejo@forge.tail8d86e.ts.net/eblume/blumeops.git"

View file

@ -0,0 +1,6 @@
---
ansible_managed: "Managed by ansible - do not edit. Source: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git"
# Sifaka NAS exporter ports — shared by caddy (indri) and sifaka_exporters roles
sifaka_node_exporter_port: 9100
sifaka_smartctl_exporter_port: 9633

View file

@ -0,0 +1,3 @@
---
ansible_user: eblume
ansible_python_interpreter: /usr/bin/python3

View file

@ -5,6 +5,9 @@ all:
hosts:
indri:
ansible_host: indri
ringtail:
ansible_host: ringtail
ansible_user: eblume
workstations:
hosts:
gilbert:

View file

@ -8,7 +8,7 @@
pre_tasks:
- name: Fetch borgmatic database password
ansible.builtin.command:
cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get mw2bv5we7woicjza7hc6s44yvy --fields db-password --reveal
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/mw2bv5we7woicjza7hc6s44yvy/db-password"
delegate_to: localhost
register: _borgmatic_db_pw
changed_when: false
@ -22,10 +22,26 @@
no_log: true
tags: [borgmatic]
- name: Fetch BorgBase SSH private key
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/noiobufntsxyzageu7mvlp2nbe/ssh-private-key"
delegate_to: localhost
register: _borgbase_ssh_key
changed_when: false
no_log: true
check_mode: false
tags: [borgmatic]
- name: Set BorgBase SSH key fact
ansible.builtin.set_fact:
borgbase_ssh_private_key: "{{ _borgbase_ssh_key.stdout }}"
no_log: true
tags: [borgmatic]
# Forgejo secrets
- name: Fetch forgejo LFS JWT secret
ansible.builtin.command:
cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get w3663ffnvkewbftncqxtcpeavy --fields lfs-jwt-secret --reveal
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/lfs-jwt-secret"
delegate_to: localhost
register: _forgejo_lfs_jwt
changed_when: false
@ -35,7 +51,7 @@
- name: Fetch forgejo internal token
ansible.builtin.command:
cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get w3663ffnvkewbftncqxtcpeavy --fields internal-token --reveal
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/internal-token"
delegate_to: localhost
register: _forgejo_internal_token
changed_when: false
@ -45,7 +61,7 @@
- name: Fetch forgejo OAuth2 JWT secret
ansible.builtin.command:
cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get w3663ffnvkewbftncqxtcpeavy --fields oauth2-jwt-secret --reveal
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/oauth2-jwt-secret"
delegate_to: localhost
register: _forgejo_oauth2_jwt
changed_when: false
@ -61,6 +77,158 @@
no_log: true
tags: [forgejo]
# Forgejo Actions secrets (synced to Forgejo via API)
- name: Fetch Forgejo API token
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/api-token"
delegate_to: localhost
register: _forgejo_api_token
changed_when: false
no_log: true
check_mode: false
tags: [forgejo_actions_secrets]
- name: Fetch ArgoCD auth token for Forgejo Actions
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/argocd_token"
delegate_to: localhost
register: _forgejo_argocd_token
changed_when: false
no_log: true
check_mode: false
tags: [forgejo_actions_secrets]
- name: Fetch Fly.io deploy token for Forgejo Actions
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/on5slfaygtdjrxmdwezyhfmqsq/deploy-token"
delegate_to: localhost
register: _fly_deploy_token
changed_when: false
no_log: true
check_mode: false
tags: [forgejo_actions_secrets]
- name: Fetch Zot CI API key for Forgejo Actions
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/zot-ci-api"
delegate_to: localhost
register: _zot_ci_api_key
changed_when: false
no_log: true
check_mode: false
tags: [forgejo_actions_secrets]
- name: Set Forgejo Actions secrets facts
ansible.builtin.set_fact:
forgejo_api_token: "{{ _forgejo_api_token.stdout }}"
forgejo_secret_argocd_token: "{{ _forgejo_argocd_token.stdout }}"
forgejo_secret_fly_deploy_token: "{{ _fly_deploy_token.stdout }}"
forgejo_secret_zot_ci_api_key: "{{ _zot_ci_api_key.stdout }}"
no_log: true
tags: [forgejo_actions_secrets]
# Zot OIDC client secret
- name: Fetch zot OIDC client secret
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/oor7os5kapczgpbwv7obkca4y4/zot-client-secret"
delegate_to: localhost
register: _zot_oidc_secret
changed_when: false
no_log: true
check_mode: false
tags: [zot]
- name: Set zot OIDC client secret fact
ansible.builtin.set_fact:
zot_oidc_client_secret: "{{ _zot_oidc_secret.stdout }}"
no_log: true
tags: [zot]
# Caddy Gandi token for ACME DNS-01 challenges
- name: Fetch Gandi PAT for Caddy
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/mco6ka3dc3rmw7zkg2dhia5d2m/pat"
delegate_to: localhost
register: _caddy_gandi_token
changed_when: false
no_log: true
check_mode: false
tags: [caddy]
- name: Set Caddy Gandi token fact
ansible.builtin.set_fact:
caddy_gandi_token: "{{ _caddy_gandi_token.stdout }}"
no_log: true
tags: [caddy]
# Jellyfin SSO client secret
- name: Fetch Jellyfin OIDC client secret
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/oor7os5kapczgpbwv7obkca4y4/jellyfin-client-secret"
delegate_to: localhost
register: _jellyfin_oidc_secret
changed_when: false
no_log: true
check_mode: false
tags: [jellyfin]
- name: Set Jellyfin OIDC client secret fact
ansible.builtin.set_fact:
jellyfin_sso_client_secret: "{{ _jellyfin_oidc_secret.stdout }}"
no_log: true
tags: [jellyfin]
# Jellyfin API key for metrics collection
- name: Fetch Jellyfin API key
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/ceywxkcd3z7najsy2nmmbs2vke/credential"
delegate_to: localhost
register: _jellyfin_metrics_api_key
changed_when: false
no_log: true
check_mode: false
tags: [jellyfin_metrics]
- name: Set Jellyfin API key fact
ansible.builtin.set_fact:
jellyfin_metrics_api_key: "{{ _jellyfin_metrics_api_key.stdout }}"
no_log: true
tags: [jellyfin_metrics]
# Forgejo API token for metrics collection
- name: Fetch Forgejo API token for metrics
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/api-token"
delegate_to: localhost
register: _forgejo_metrics_api_token
changed_when: false
no_log: true
check_mode: false
tags: [forgejo_metrics]
- name: Set Forgejo metrics API token fact
ansible.builtin.set_fact:
forgejo_metrics_api_key: "{{ _forgejo_metrics_api_token.stdout }}"
no_log: true
tags: [forgejo_metrics]
# Devpi root password (PyPI mirror admin)
- name: Fetch devpi root password
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/kyhzfifryqnuk7jeyibmmjvxxm/add more/root password"
delegate_to: localhost
register: _devpi_root_password
changed_when: false
no_log: true
check_mode: false
tags: [devpi]
- name: Set devpi root password fact
ansible.builtin.set_fact:
devpi_root_password: "{{ _devpi_root_password.stdout }}"
no_log: true
tags: [devpi]
roles:
- role: alloy
tags: alloy
@ -70,15 +238,29 @@
tags: borgmatic_metrics
- role: forgejo
tags: forgejo
- role: forgejo_actions_secrets
tags: forgejo_actions_secrets
- role: zot
tags: zot
- role: zot_metrics
tags: zot_metrics
- role: devpi
tags: devpi
- role: minikube
tags: minikube
- role: minikube_metrics
tags: minikube_metrics
- role: plex_metrics
tags: plex_metrics
- role: tailscale_serve
tags: tailscale-serve
- role: jellyfin
tags: jellyfin
- role: jellyfin_metrics
tags: jellyfin_metrics
- role: forgejo_metrics
tags: forgejo_metrics
- role: cv
tags: cv
- role: docs
tags: docs
- role: heph
tags: heph
- role: caddy
tags: caddy

View file

@ -0,0 +1,118 @@
---
- name: Configure ringtail (NixOS)
hosts: ringtail
become: true
pre_tasks:
- name: Fetch 1Password Connect credentials from 1Password
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/1Password Connect/credentials-file"
register: _op_credentials
changed_when: false
delegate_to: localhost
become: false
- name: Fetch 1Password Connect token from 1Password
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/1Password Connect/token"
register: _op_token
changed_when: false
delegate_to: localhost
become: false
- name: Fetch Forgejo runner registration token from 1Password
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/Forgejo Secrets/runner_reg"
register: _runner_reg
changed_when: false
delegate_to: localhost
become: false
- name: Ensure /etc/forgejo-runner directory exists
ansible.builtin.file:
path: /etc/forgejo-runner
state: directory
mode: "0700"
- name: Write Forgejo runner token file
ansible.builtin.copy:
content: "TOKEN={{ _runner_reg.stdout }}"
dest: /etc/forgejo-runner/token.env
mode: "0600"
no_log: true
- name: Ensure /etc/k3s directory exists
ansible.builtin.file:
path: /etc/k3s
state: directory
mode: "0700"
- name: Generate k3s token if not present
ansible.builtin.copy:
content: "{{ lookup('ansible.builtin.password', '/dev/null', chars=['hexdigits'], length=32) }}"
dest: /etc/k3s/token
mode: "0600"
force: false
tasks:
- name: Ensure blumeops repo is present
ansible.builtin.git:
repo: "https://forge.ops.eblu.me/eblume/blumeops.git"
dest: /etc/blumeops
version: "{{ ringtail_commit | default('main') }}"
force: true
register: _repo
- name: Rebuild NixOS
ansible.builtin.command:
cmd: nixos-rebuild switch --flake /etc/blumeops/nixos/ringtail#ringtail
register: _rebuild
changed_when: "'activating the configuration' in _rebuild.stderr"
when: _repo.changed
- name: Verify tailscale is connected
ansible.builtin.command: tailscale status --self --json
register: _ts_status
changed_when: false
failed_when: "'Running' not in _ts_status.stdout"
post_tasks:
- name: Wait for k3s to be ready
ansible.builtin.command: k3s kubectl get nodes
register: _k3s_ready
changed_when: false
retries: 30
delay: 5
until: _k3s_ready.rc == 0
- name: Create 1password namespace
ansible.builtin.command: k3s kubectl create namespace 1password
register: _ns
changed_when: _ns.rc == 0
failed_when: _ns.rc != 0 and 'AlreadyExists' not in _ns.stderr
- name: Create or update op-credentials secret
ansible.builtin.shell:
cmd: |
set -o pipefail
k3s kubectl create secret generic op-credentials \
--namespace=1password \
--from-literal=1password-credentials.json='{{ _op_credentials.stdout }}' \
--dry-run=client -o yaml | k3s kubectl apply -f -
executable: /run/current-system/sw/bin/bash
register: _op_credentials_apply
changed_when: "'configured' in _op_credentials_apply.stdout or 'created' in _op_credentials_apply.stdout"
no_log: true
- name: Create or update onepassword-token secret
ansible.builtin.shell:
cmd: |
set -o pipefail
k3s kubectl create secret generic onepassword-token \
--namespace=1password \
--from-literal=token={{ _op_token.stdout }} \
--dry-run=client -o yaml | k3s kubectl apply -f -
executable: /run/current-system/sw/bin/bash
register: _op_token_apply
changed_when: "'configured' in _op_token_apply.stdout or 'created' in _op_token_apply.stdout"
no_log: true

View file

@ -0,0 +1,7 @@
---
- name: Configure sifaka
hosts: nas
roles:
- role: sifaka_exporters
tags: sifaka_exporters

View file

@ -10,10 +10,10 @@
# Build on dev machine (gilbert), then copy to indri:
#
# 1. Clone from forge mirror:
# git clone ssh://forgejo@forge.tail8d86e.ts.net/eblume/alloy.git ~/code/3rd/alloy
# git clone ssh://forgejo@forge.ops.eblu.me:2222/mirrors/alloy.git ~/code/3rd/alloy
#
# 2. Set up build tools via mise:
# cd ~/code/3rd/alloy && mise use go@1.25 node yarn
# cd ~/code/3rd/alloy && mise use go@1.25.7 node yarn
#
# 3. Build with CGO enabled (default in Makefile):
# cd ~/code/3rd/alloy && mise x -- make alloy
@ -21,7 +21,10 @@
# 4. Copy binary to indri:
# scp ~/code/3rd/alloy/build/alloy indri:~/.local/bin/alloy
#
# 5. Run ansible to deploy config and LaunchAgent
# 5. Ad-hoc codesign on indri (SCP'd binaries get quarantined by macOS):
# ssh indri 'codesign --sign - --force ~/.local/bin/alloy'
#
# 6. Run ansible to deploy config and LaunchAgent
# Binary and paths
alloy_binary: /Users/erichblume/.local/bin/alloy
@ -32,11 +35,11 @@ alloy_log_dir: /Users/erichblume/Library/Logs
# Textfile collector directory (same as node_exporter for compatibility)
alloy_textfile_dir: /opt/homebrew/var/node_exporter/textfile
# Prometheus remote write endpoint (k8s via Tailscale)
alloy_prometheus_url: "https://prometheus.tail8d86e.ts.net/api/v1/write"
# Prometheus remote write endpoint (k8s via Caddy)
alloy_prometheus_url: "https://prometheus.ops.eblu.me/api/v1/write"
# Loki endpoint (k8s via Tailscale)
alloy_loki_url: "https://loki.tail8d86e.ts.net/loki/api/v1/push"
# Loki endpoint (k8s via Caddy)
alloy_loki_url: "https://loki.ops.eblu.me/loki/api/v1/push"
# Instance label for metrics
alloy_instance_label: indri
@ -72,11 +75,12 @@ alloy_mcquack_logs:
- path: /Users/erichblume/Library/Logs/mcquack.zot.err.log
service: zot
stream: stderr
alloy_plex_logs:
- path: /Users/erichblume/Library/Logs/Plex Media Server/Plex Media Server.log
service: plex
- path: /Users/erichblume/Library/Logs/mcquack.jellyfin.out.log
service: jellyfin
stream: stdout
- path: /Users/erichblume/Library/Logs/mcquack.jellyfin.err.log
service: jellyfin
stream: stderr
# Enable log collection (requires Loki to be running)
alloy_collect_logs: true
@ -97,6 +101,10 @@ alloy_op_vault: vg6xf6vvfmoh5hqjjhlhbeoaie
alloy_op_postgres_item: guxu3j7ajhjyey6xxl2ovsl2ui
alloy_op_postgres_field: alloy-user-pw
# Forgejo metrics collection
alloy_collect_forgejo: true
alloy_forgejo_port: 3001
# macOS power metrics collection (via powermetrics, requires root)
alloy_collect_power_metrics: true
alloy_power_metrics_script: /usr/local/bin/macos-power-metrics

View file

@ -38,9 +38,7 @@
- name: Fetch PostgreSQL metrics password from 1Password
ansible.builtin.command:
cmd: >-
op --vault {{ alloy_op_vault }} item get {{ alloy_op_postgres_item }}
--fields {{ alloy_op_postgres_field }} --reveal
cmd: op read "op://{{ alloy_op_vault }}/{{ alloy_op_postgres_item }}/{{ alloy_op_postgres_field }}"
delegate_to: localhost
register: alloy_postgres_password_result
changed_when: false

View file

@ -29,6 +29,11 @@ prometheus.relabel "instance" {
target_label = "instance"
replacement = "{{ alloy_instance_label }}"
}
rule {
target_label = "cluster"
replacement = "indri"
}
}
// Push metrics to Prometheus via remote_write
@ -69,6 +74,18 @@ prometheus.scrape "zot" {
}
{% endif %}
{% if alloy_collect_forgejo | default(false) %}
// ============== FORGEJO METRICS ==============
// Scrape Forgejo's native metrics endpoint
prometheus.scrape "forgejo" {
targets = [{"__address__" = "localhost:{{ alloy_forgejo_port }}"}]
metrics_path = "/metrics"
forward_to = [prometheus.relabel.instance.receiver]
scrape_interval = "{{ alloy_scrape_interval }}"
}
{% endif %}
{% if alloy_collect_logs %}
// ============== LOG COLLECTION ==============
@ -90,15 +107,6 @@ local.file_match "mcquack_logs" {
]
}
// Discover log files - Plex Media Server
local.file_match "plex_logs" {
path_targets = [
{% for log in alloy_plex_logs %}
{__path__ = "{{ log.path }}", service = "{{ log.service }}", stream = "{{ log.stream }}"},
{% endfor %}
]
}
// Read and forward brew service logs
loki.source.file "brew_logs" {
targets = local.file_match.brew_logs.targets
@ -111,12 +119,6 @@ loki.source.file "mcquack_logs" {
forward_to = [loki.relabel.add_host.receiver]
}
// Read and forward Plex logs
loki.source.file "plex_logs" {
targets = local.file_match.plex_logs.targets
forward_to = [loki.relabel.add_host.receiver]
}
// Add host label to all logs
loki.relabel "add_host" {
forward_to = [loki.write.loki.receiver]
@ -125,6 +127,11 @@ loki.relabel "add_host" {
target_label = "host"
replacement = "{{ alloy_instance_label }}"
}
rule {
target_label = "cluster"
replacement = "indri"
}
}
// Write logs to Loki

View file

@ -6,6 +6,16 @@ borgmatic_log_dir: /Users/erichblume/Library/Logs
# Full path to borg binary since LaunchAgent doesn't have homebrew in PATH
borgmatic_local_path: /opt/homebrew/bin/borg
# Borgmatic version — keep in sync with mise.toml in the repo root.
# Ansible installs this via `mise install` so indri doesn't need the repo cloned.
borgmatic_version: "2.1.4"
# Full path to borgmatic binary — called directly by LaunchAgents to avoid
# routing through mise, which triggers macOS TCC permission dialogs for
# protected folders (e.g. ~/Documents) that hang headless LaunchAgent sessions.
# Uses mise's "latest" symlink so version bumps don't break the LaunchAgent path.
borgmatic_bin: /Users/erichblume/.local/share/mise/installs/pipx-borgmatic/latest/bin/borgmatic
# Schedule: runs daily at 2:00 AM
borgmatic_schedule_hour: 2
borgmatic_schedule_minute: 0
@ -16,14 +26,47 @@ borgmatic_source_directories:
- /opt/homebrew/var/forgejo
- /Users/erichblume/.config/borgmatic
- /Users/erichblume/Documents
- /Users/erichblume/Pictures
- /Users/erichblume/.local/share/borgmatic/k8s-dumps
# Shower app prize-photo uploads (sifaka SMB mount). Mounted manually
# on indri via Finder — see docs/how-to/operations/shower-app.md.
- /Volumes/shower
# Backup repository
# Backup repositories
borgmatic_repositories:
- path: /Volumes/backups/borg/
label: sifaka-borg-backups
encryption: repokey
append_only: true
- path: ssh://u3ugi1x1@u3ugi1x1.repo.borgbase.com/./repo
label: borgbase-offsite
encryption: repokey
append_only: true
# BorgBase SSH key (fetched from 1Password in playbook pre_tasks)
borgmatic_borgbase_ssh_key_path: /Users/erichblume/.ssh/borgbase_ed25519
# Directory for pre-backup database dumps from k8s pods
borgmatic_k8s_dump_dir: /Users/erichblume/.local/share/borgmatic/k8s-dumps
# K8s SQLite databases to dump before backup via kubectl exec
# Each entry runs: kubectl exec <pod-selector> -- sqlite3 <path> ".backup /tmp/backup.db"
# then copies the dump to borgmatic_k8s_dump_dir/<name>.db
borgmatic_k8s_sqlite_dumps:
- name: mealie
namespace: mealie
label_selector: app=mealie
db_path: /app/data/mealie.db
# migrated to ringtail (wave-1); ssh to ringtail and run k3s kubectl
# there, same as shower below.
target: ssh:eblume@ringtail
- name: shower
namespace: shower
label_selector: app=shower
db_path: /app/data/db.sqlite3
# ssh to ringtail and run k3s kubectl there — avoids needing a
# ringtail kubeconfig on indri. k3s.yaml on ringtail is
# world-readable (mode 644), so no sudo required.
target: ssh:eblume@ringtail
# Exclude patterns
borgmatic_exclude_patterns: []
@ -39,14 +82,42 @@ borgmatic_keep_yearly: 1000
# PostgreSQL databases to backup (streamed via pg_dump)
# Password is read from ~/.pgpass (managed by this role)
# pg_dump_command must be full path since LaunchAgent doesn't have homebrew in PATH
# --- Immich photo library backup (BorgBase offsite only) ---
borgmatic_photos_config: /Users/erichblume/.config/borgmatic/photos.yaml
borgmatic_photos_source_directories:
- /Volumes/photos/library
- /Volumes/photos/upload
borgmatic_photos_borgbase_repo: ssh://xcrtl5tg@xcrtl5tg.repo.borgbase.com/./repo
# Schedule: runs daily at 4:00 AM (offset from main backup at 2:00 AM)
borgmatic_photos_schedule_hour: 4
borgmatic_photos_schedule_minute: 0
# Retention: photos are precious, keep more history
borgmatic_photos_keep_daily: 7
borgmatic_photos_keep_monthly: 12
borgmatic_photos_keep_yearly: 1000
borgmatic_pg_dump_command: /opt/homebrew/opt/postgresql@18/bin/pg_dump
borgmatic_postgresql_databases:
# k8s PostgreSQL (CloudNativePG)
# k8s PostgreSQL (CloudNativePG) via Caddy L4 proxy
- name: miniflux
hostname: pg.tail8d86e.ts.net
hostname: pg.ops.eblu.me
port: 5432
username: borgmatic
- name: authentik
hostname: pg.ops.eblu.me
port: 5432
username: borgmatic
# migrated to ringtail blumeops-pg (wave-1); port 5434 = Caddy L4 route
- name: teslamate
hostname: pg.tail8d86e.ts.net
port: 5432
hostname: pg.ops.eblu.me
port: 5434
username: borgmatic
- name: paperless
hostname: pg.ops.eblu.me
port: 5434
username: borgmatic
# immich-pg cluster (VectorChord) via Caddy L4 on port 5433
- name: immich
hostname: pg.ops.eblu.me
port: 5433
username: borgmatic

View file

@ -4,3 +4,9 @@
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.borgmatic.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.borgmatic.plist
changed_when: true
- name: Reload borgmatic-photos
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.borgmatic-photos.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.borgmatic-photos.plist
changed_when: true

View file

@ -1,6 +1,11 @@
---
# Note: borgmatic is installed via mise (pipx), not managed here.
# This role manages the config file and scheduled LaunchAgent.
# Borgmatic is installed via mise (pipx) and called directly by LaunchAgents.
# This role manages installation, config, and the scheduled LaunchAgents.
- name: Install borgmatic via mise
ansible.builtin.command: mise install pipx:borgmatic@{{ borgmatic_version }}
register: borgmatic_install
changed_when: "'installed' in borgmatic_install.stderr"
- name: Ensure borgmatic config directory exists
ansible.builtin.file:
@ -14,11 +19,52 @@
ansible.builtin.copy:
content: |
# Managed by ansible (borgmatic role) - k8s PostgreSQL backup credentials
pg.tail8d86e.ts.net:5432:*:borgmatic:{{ borgmatic_db_password }}
# 5432 = minikube blumeops-pg, 5433 = immich-pg, 5434 = ringtail blumeops-pg
pg.ops.eblu.me:5432:*:borgmatic:{{ borgmatic_db_password }}
pg.ops.eblu.me:5433:*:borgmatic:{{ borgmatic_db_password }}
pg.ops.eblu.me:5434:*:borgmatic:{{ borgmatic_db_password }}
dest: ~/.pgpass
mode: '0600'
no_log: true
# BorgBase offsite backup - SSH key and host verification
- name: Deploy BorgBase SSH private key
ansible.builtin.copy:
content: "{{ borgbase_ssh_private_key }}\n"
dest: "{{ borgmatic_borgbase_ssh_key_path }}"
mode: '0600'
no_log: true
- name: Add BorgBase host keys to known_hosts
ansible.builtin.known_hosts:
name: "{{ item }}"
key: "{{ item }} ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIGU0mISTyHBw9tBs6SuhSq8tvNM8m9eifQxM+88TowPO"
state: present
loop:
- u3ugi1x1.repo.borgbase.com
- xcrtl5tg.repo.borgbase.com
- name: Ensure k8s dump directory exists
ansible.builtin.file:
path: "{{ borgmatic_k8s_dump_dir }}"
state: directory
mode: '0700'
when: borgmatic_k8s_sqlite_dumps | length > 0
- name: Ensure ~/bin exists
ansible.builtin.file:
path: "{{ ansible_env.HOME }}/bin"
state: directory
mode: '0755'
when: borgmatic_k8s_sqlite_dumps | length > 0
- name: Deploy k8s SQLite dump helper script
ansible.builtin.template:
src: k8s-sqlite-dump.sh.j2
dest: "{{ ansible_env.HOME }}/bin/borgmatic-k8s-sqlite-dump"
mode: '0755'
when: borgmatic_k8s_sqlite_dumps | length > 0
- name: Deploy borgmatic configuration
ansible.builtin.template:
src: config.yaml.j2
@ -43,3 +89,30 @@
when: borgmatic_launchctl_check.rc != 0
changed_when: true
failed_when: false
# --- Immich photo library backup (BorgBase offsite only) ---
- name: Deploy borgmatic photos configuration
ansible.builtin.template:
src: photos.yaml.j2
dest: "{{ borgmatic_photos_config }}"
mode: '0600'
- name: Deploy borgmatic-photos LaunchAgent plist
ansible.builtin.template:
src: borgmatic-photos.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.borgmatic-photos.plist
mode: '0644'
notify: Reload borgmatic-photos
- name: Check if borgmatic-photos LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.borgmatic-photos
register: borgmatic_photos_launchctl_check
changed_when: false
failed_when: false
- name: Load borgmatic-photos LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.borgmatic-photos.plist
when: borgmatic_photos_launchctl_check.rc != 0
changed_when: true
failed_when: false

View file

@ -0,0 +1,36 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- {{ ansible_managed }} -->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>KeepAlive</key>
<false/>
<key>Label</key>
<string>mcquack.eblume.borgmatic-photos</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/bin:/bin</string>
</dict>
<key>ProgramArguments</key>
<array>
<string>{{ borgmatic_bin }}</string>
<string>--config</string>
<string>{{ borgmatic_photos_config }}</string>
<string>create</string>
</array>
<key>RunAtLoad</key>
<false/>
<key>StandardErrorPath</key>
<string>{{ borgmatic_log_dir }}/mcquack.borgmatic-photos.err.log</string>
<key>StandardOutPath</key>
<string>{{ borgmatic_log_dir }}/mcquack.borgmatic-photos.out.log</string>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key>
<integer>{{ borgmatic_photos_schedule_hour }}</integer>
<key>Minute</key>
<integer>{{ borgmatic_photos_schedule_minute }}</integer>
</dict>
</dict>
</plist>

View file

@ -14,16 +14,13 @@
</dict>
<key>ProgramArguments</key>
<array>
<string>/opt/homebrew/opt/mise/bin/mise</string>
<string>x</string>
<string>--</string>
<string>borgmatic</string>
<string>{{ borgmatic_bin }}</string>
<string>--config</string>
<string>{{ borgmatic_config }}</string>
<string>create</string>
</array>
<key>RunAtLoad</key>
<true/>
<false/>
<key>StandardErrorPath</key>
<string>{{ borgmatic_log_dir }}/mcquack.borgmatic.err.log</string>
<key>StandardOutPath</key>

View file

@ -31,6 +31,26 @@ exclude_patterns:
encryption_passcommand: {{ borgmatic_encryption_passcommand }}
{% if borgmatic_k8s_sqlite_dumps %}
# Pre-backup: dump SQLite databases from k8s pods.
# Uses sqlite3.backup() for a safe, consistent copy.
#
# Quoting/escaping is delegated to ~/bin/borgmatic-k8s-sqlite-dump
# (deployed by the borgmatic ansible role). Each entry's `target`
# is either:
# - local:<context> -> local kubectl with --context (mealie etc.)
# - ssh:<user@host> -> ssh + k3s kubectl on the cluster host,
# used for ringtail since indri's kubeconfig
# deliberately doesn't carry that context.
before_backup:
- mkdir -p {{ borgmatic_k8s_dump_dir }}
{% for db in borgmatic_k8s_sqlite_dumps %}
- {{ ansible_env.HOME }}/bin/borgmatic-k8s-sqlite-dump {{ db.target }} {{ db.namespace }} {{ db.label_selector }} {{ db.db_path }} {{ db.name }} {{ borgmatic_k8s_dump_dir }}/{{ db.name }}.db
{% endfor %}
{% endif %}
ssh_command: ssh -o IdentitiesOnly=yes -i {{ borgmatic_borgbase_ssh_key_path }}
# Retention policy
keep_daily: {{ borgmatic_keep_daily }}
keep_monthly: {{ borgmatic_keep_monthly }}

View file

@ -0,0 +1,73 @@
#!/usr/bin/env bash
# {{ ansible_managed }}
#
# Helper script invoked by borgmatic's before_backup hook to capture a
# k8s pod's SQLite database. Keeps the borgmatic config readable by
# pulling all the quoting out of YAML.
#
# Usage:
# borgmatic-k8s-sqlite-dump <target> <namespace> <selector> \
# <db_path> <name> <dump_target>
#
# <target> is one of:
# local:<context> - run local kubectl with --context=<context>
# ssh:<user@host> - ssh to host and run k3s kubectl there
# (no indri-side kubeconfig needed)
#
# <namespace> - k8s namespace of the pod
# <selector> - label selector to find the pod (e.g. app=shower)
# <db_path> - absolute path inside the pod to the SQLite DB
# <name> - short name used for temp filenames
# <dump_target> - file on this host to receive the dump
set -euo pipefail
target=${1:?missing target}
namespace=${2:?missing namespace}
selector=${3:?missing selector}
db_path=${4:?missing db path}
name=${5:?missing name}
dump_target=${6:?missing dump target}
# Stage the backup next to the source DB (a guaranteed-writable volume);
# minimal nix images (e.g. mealie) have no /tmp.
pod_tmp="$(dirname "$db_path")/.borgmatic-backup-${name}.db"
python_backup='import sqlite3; sqlite3.connect("'"$db_path"'").backup(sqlite3.connect("'"$pod_tmp"'"))'
mode=${target%%:*}
ref=${target#*:}
case "$mode" in
local)
# Pulls dump bytes out via "kubectl exec -- cat" rather than
# "kubectl cp", which would otherwise need tar inside the pod
# (nix-built images like shower don't bundle tar).
context=$ref
kubectl="/opt/homebrew/bin/kubectl --context=$context -n $namespace"
pod=$($kubectl get pod -l "$selector" \
-o jsonpath='{.items[0].metadata.name}')
$kubectl exec "$pod" -- python3 -c "$python_backup"
$kubectl exec "$pod" -- cat "$pod_tmp" > "$dump_target"
$kubectl exec "$pod" -- rm -f "$pod_tmp"
;;
ssh)
host=$ref
# Force bash on the remote (user's login shell on ringtail is
# fish). Pipe the script via stdin to dodge nested quoting.
# The dump bytes come back over the ssh stdout stream — no
# intermediate scp, no tar requirement in the pod.
ssh "$host" bash <<EOF > "$dump_target"
set -euo pipefail
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
pod=\$(k3s kubectl -n "$namespace" get pod -l "$selector" -o jsonpath='{.items[0].metadata.name}')
k3s kubectl -n "$namespace" exec "\$pod" -- python3 -c '$python_backup' 1>&2
k3s kubectl -n "$namespace" exec "\$pod" -- cat "$pod_tmp"
k3s kubectl -n "$namespace" exec "\$pod" -- rm -f "$pod_tmp" 1>&2
EOF
;;
*)
echo "borgmatic-k8s-sqlite-dump: unknown target mode: $mode" >&2
echo " expected local:<context> or ssh:<user@host>" >&2
exit 1
;;
esac

View file

@ -0,0 +1,37 @@
# {{ ansible_managed }}
#
# Borgmatic config for immich photo library backup.
# Backs up library/ and upload/ from /Volumes/photos (sifaka SMB mount)
# to BorgBase offsite ONLY. Excludes encoded-video/, thumbs/, backups/
# since those are regenerable from originals.
#
# Separate from the main borgmatic config to keep concerns isolated:
# - main config: indri data → sifaka + borgbase
# - this config: sifaka photos → borgbase (different repo)
local_path: {{ borgmatic_local_path }}
source_directories:
{% for dir in borgmatic_photos_source_directories %}
- {{ dir }}
{% endfor %}
source_directories_must_exist: true
repositories:
- path: {{ borgmatic_photos_borgbase_repo }}
label: borgbase-immich-photos
encryption: repokey
append_only: true
encryption_passcommand: {{ borgmatic_encryption_passcommand }}
ssh_command: ssh -o IdentitiesOnly=yes -o ServerAliveInterval=30 -o ServerAliveCountMax=5 -i {{ borgmatic_borgbase_ssh_key_path }}
# Save checkpoints every 10 minutes so interrupted backups don't lose all progress
checkpoint_interval: 600
# Retention policy — photos are precious, keep more history
keep_daily: {{ borgmatic_photos_keep_daily }}
keep_monthly: {{ borgmatic_photos_keep_monthly }}
keep_yearly: {{ borgmatic_photos_keep_yearly }}

View file

@ -1,7 +1,15 @@
---
borgmatic_metrics_repo: /Volumes/backups/borg/
# Borg repositories to collect metrics from
# Each entry needs a path (local or ssh://) and a label for Prometheus metrics
borgmatic_metrics_repos:
- path: /Volumes/backups/borg/
label: sifaka-local
- path: ssh://xcrtl5tg@xcrtl5tg.repo.borgbase.com/./repo
label: borgbase-immich-photos
borgmatic_metrics_passcommand: cat /Users/erichblume/.borg/config.yaml
borgmatic_metrics_ssh_key: /Users/erichblume/.ssh/borgbase_ed25519
borgmatic_metrics_dir: /opt/homebrew/var/node_exporter/textfile
borgmatic_metrics_script: /Users/erichblume/bin/borgmatic-metrics
borgmatic_metrics_script: /Users/erichblume/.local/bin/borgmatic-metrics
borgmatic_metrics_interval: 3600 # seconds between metric collection (hourly)
borgmatic_metrics_log_dir: /opt/homebrew/var/log

View file

@ -1,11 +1,12 @@
#!/bin/bash
# {{ ansible_managed }}
# Collects borg backup metrics for node_exporter textfile collector
# Supports multiple repositories with a repo label for Prometheus
set -euo pipefail
export BORG_PASSCOMMAND="{{ borgmatic_metrics_passcommand }}"
BORG_REPO="{{ borgmatic_metrics_repo }}"
export BORG_RSH="ssh -o IdentitiesOnly=yes -i {{ borgmatic_metrics_ssh_key }}"
OUTPUT_FILE="{{ borgmatic_metrics_dir }}/borgmatic.prom"
TEMP_FILE="${OUTPUT_FILE}.tmp"
@ -13,129 +14,109 @@ TEMP_FILE="${OUTPUT_FILE}.tmp"
BORG_CMD="/opt/homebrew/bin/borg"
JQ_CMD="/opt/homebrew/bin/jq"
# Get repository info
repo_json=$($BORG_CMD info --json "$BORG_REPO" 2>/dev/null) || {
echo "Failed to get borg repo info" >&2
# Write down metric
cat > "$TEMP_FILE" << 'EOF'
# Start fresh
cat > "$TEMP_FILE" << 'EOF'
# HELP borgmatic_up Borg backup repository is accessible
# TYPE borgmatic_up gauge
borgmatic_up 0
EOF
mv "$TEMP_FILE" "$OUTPUT_FILE"
exit 0
}
# Get archive list
archives_json=$($BORG_CMD list --json "$BORG_REPO" 2>/dev/null) || {
echo "Failed to list borg archives" >&2
exit 1
}
# Extract repository stats
total_size=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_size')
total_csize=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_csize')
unique_size=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.unique_size')
unique_csize=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.unique_csize')
total_chunks=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_chunks')
unique_chunks=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_unique_chunks')
# Count archives
archive_count=$(echo "$archives_json" | $JQ_CMD -r '.archives | length')
# Get last archive info
last_archive_name=$(echo "$archives_json" | $JQ_CMD -r '.archives[-1].name // empty')
if [ -n "$last_archive_name" ]; then
# Get detailed info for the last archive
last_archive_json=$($BORG_CMD info --json "${BORG_REPO}::${last_archive_name}" 2>/dev/null) || {
echo "Failed to get last archive info" >&2
last_archive_json=""
}
if [ -n "$last_archive_json" ]; then
last_original_size=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.original_size')
last_compressed_size=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.compressed_size')
last_deduplicated_size=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.deduplicated_size')
last_nfiles=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.nfiles')
last_start=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].start')
last_end=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].end')
last_duration=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].duration')
# Convert timestamp to unix epoch
last_timestamp=$(date -j -f "%Y-%m-%dT%H:%M:%S" "${last_start%.*}" "+%s" 2>/dev/null || echo "0")
fi
fi
# Write metrics
cat > "$TEMP_FILE" << EOF
# HELP borgmatic_up Borg backup repository is accessible
# TYPE borgmatic_up gauge
borgmatic_up 1
# HELP borgmatic_repo_original_size_bytes Total original size of all archives (sum of what each backup contains)
# TYPE borgmatic_repo_original_size_bytes gauge
borgmatic_repo_original_size_bytes $total_size
# HELP borgmatic_repo_compressed_size_bytes Total compressed size of all archives
# TYPE borgmatic_repo_compressed_size_bytes gauge
borgmatic_repo_compressed_size_bytes $total_csize
# HELP borgmatic_repo_deduplicated_size_bytes Actual disk usage after deduplication (unique data)
# TYPE borgmatic_repo_deduplicated_size_bytes gauge
borgmatic_repo_deduplicated_size_bytes $unique_csize
# HELP borgmatic_repo_total_chunks Total number of chunks across all archives
# TYPE borgmatic_repo_total_chunks gauge
borgmatic_repo_total_chunks $total_chunks
# HELP borgmatic_repo_unique_chunks Number of unique chunks (after deduplication)
# TYPE borgmatic_repo_unique_chunks gauge
borgmatic_repo_unique_chunks $unique_chunks
# HELP borgmatic_archive_count Number of archives in the repository
# TYPE borgmatic_archive_count gauge
borgmatic_archive_count $archive_count
EOF
# Add last archive metrics if available
if [ -n "${last_original_size:-}" ]; then
cat >> "$TEMP_FILE" << EOF
# HELP borgmatic_last_archive_original_size_bytes Original size of the last archive (data being backed up)
# TYPE borgmatic_last_archive_original_size_bytes gauge
borgmatic_last_archive_original_size_bytes $last_original_size
# HELP borgmatic_last_archive_compressed_size_bytes Compressed size of the last archive
# TYPE borgmatic_last_archive_compressed_size_bytes gauge
borgmatic_last_archive_compressed_size_bytes $last_compressed_size
# HELP borgmatic_last_archive_deduplicated_size_bytes Deduplicated size of last archive (new data added)
# TYPE borgmatic_last_archive_deduplicated_size_bytes gauge
borgmatic_last_archive_deduplicated_size_bytes $last_deduplicated_size
# HELP borgmatic_last_archive_files Number of files in the last archive
# TYPE borgmatic_last_archive_files gauge
borgmatic_last_archive_files $last_nfiles
# HELP borgmatic_last_archive_timestamp Unix timestamp of the last backup
# TYPE borgmatic_last_archive_timestamp gauge
borgmatic_last_archive_timestamp $last_timestamp
# HELP borgmatic_last_archive_duration_seconds Duration of the last backup in seconds
# TYPE borgmatic_last_archive_duration_seconds gauge
borgmatic_last_archive_duration_seconds ${last_duration:-0}
EOF
# Collect per-source-directory sizes
cat >> "$TEMP_FILE" << 'EOF'
# HELP borgmatic_source_size_bytes Size of each backup source directory in bytes
# TYPE borgmatic_source_size_bytes gauge
EOF
# List archive contents and group by source directory
$BORG_CMD list "${BORG_REPO}::${last_archive_name}" --format "{size} {path}{NL}" 2>/dev/null | awk '
collect_repo_metrics() {
local repo_path="$1"
local repo_label="$2"
# Get repository info
repo_json=$($BORG_CMD info --json "$repo_path" 2>/dev/null) || {
echo "Failed to get borg repo info for $repo_label" >&2
echo "borgmatic_up{repo=\"$repo_label\"} 0" >> "$TEMP_FILE"
return
}
# Get archive list
archives_json=$($BORG_CMD list --json "$repo_path" 2>/dev/null) || {
echo "Failed to list borg archives for $repo_label" >&2
echo "borgmatic_up{repo=\"$repo_label\"} 0" >> "$TEMP_FILE"
return
}
# Extract repository stats
total_size=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_size')
total_csize=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_csize')
unique_size=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.unique_size')
unique_csize=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.unique_csize')
total_chunks=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_chunks')
unique_chunks=$(echo "$repo_json" | $JQ_CMD -r '.cache.stats.total_unique_chunks')
archive_count=$(echo "$archives_json" | $JQ_CMD -r '.archives | length')
cat >> "$TEMP_FILE" << EOF
borgmatic_up{repo="$repo_label"} 1
borgmatic_repo_original_size_bytes{repo="$repo_label"} $total_size
borgmatic_repo_compressed_size_bytes{repo="$repo_label"} $total_csize
borgmatic_repo_deduplicated_size_bytes{repo="$repo_label"} $unique_csize
borgmatic_repo_total_chunks{repo="$repo_label"} $total_chunks
borgmatic_repo_unique_chunks{repo="$repo_label"} $unique_chunks
borgmatic_archive_count{repo="$repo_label"} $archive_count
EOF
# Get last archive info
last_archive_name=$(echo "$archives_json" | $JQ_CMD -r '.archives[-1].name // empty')
if [ -z "$last_archive_name" ]; then
return
fi
# Get detailed info for the last archive
last_archive_json=$($BORG_CMD info --json "${repo_path}::${last_archive_name}" 2>/dev/null) || {
echo "Failed to get last archive info for $repo_label" >&2
return
}
last_original_size=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.original_size')
last_compressed_size=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.compressed_size')
last_deduplicated_size=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.deduplicated_size')
last_nfiles=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].stats.nfiles')
last_start=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].start')
last_duration=$(echo "$last_archive_json" | $JQ_CMD -r '.archives[0].duration')
# Convert timestamp to unix epoch
last_timestamp=$(date -j -f "%Y-%m-%dT%H:%M:%S" "${last_start%.*}" "+%s" 2>/dev/null || echo "0")
cat >> "$TEMP_FILE" << EOF
borgmatic_last_archive_original_size_bytes{repo="$repo_label"} $last_original_size
borgmatic_last_archive_compressed_size_bytes{repo="$repo_label"} $last_compressed_size
borgmatic_last_archive_deduplicated_size_bytes{repo="$repo_label"} $last_deduplicated_size
borgmatic_last_archive_files{repo="$repo_label"} $last_nfiles
borgmatic_last_archive_timestamp{repo="$repo_label"} $last_timestamp
borgmatic_last_archive_duration_seconds{repo="$repo_label"} ${last_duration:-0}
EOF
# Collect per-source-directory sizes
$BORG_CMD list "${repo_path}::${last_archive_name}" --format "{size} {path}{NL}" 2>/dev/null | awk -v repo="$repo_label" '
{
size = $1
path = $2
@ -145,8 +126,10 @@ EOF
else if (path ~ /^Users\/[^\/]+\/devpi/) { source = "devpi" }
else if (path ~ /^Users\/[^\/]+\/code\/personal\/zk/) { source = "Zettelkasten" }
else if (path ~ /^Users\/[^\/]+\/.config\/borgmatic/) { source = "borgmatic_config" }
else if (path ~ /^Users\/[^\/]+\/.local\/share\/borgmatic/) { source = "k8s_dumps" }
else if (path ~ /^opt\/homebrew\/var\/forgejo/) { source = "Forgejo" }
else if (path ~ /^opt\/homebrew\/var\/loki/) { source = "Loki" }
else if (path ~ /^Volumes\/photos/) { source = "immich_photos" }
else if (path ~ /^borgmatic\/postgresql_databases/) { source = "PostgreSQL" }
else if (path ~ /^borgmatic\//) { source = "borgmatic_metadata" }
else { source = "other" }
@ -155,10 +138,15 @@ EOF
}
END {
for (src in totals) {
printf "borgmatic_source_size_bytes{source=\"%s\"} %.0f\n", src, totals[src]
printf "borgmatic_source_size_bytes{repo=\"%s\",source=\"%s\"} %.0f\n", repo, src, totals[src]
}
}' >> "$TEMP_FILE"
fi
}
# Collect metrics for each configured repository
{% for repo in borgmatic_metrics_repos %}
collect_repo_metrics "{{ repo.path }}" "{{ repo.label }}"
{% endfor %}
# Atomic move
mv "$TEMP_FILE" "$OUTPUT_FILE"

View file

@ -0,0 +1,128 @@
---
# Caddy reverse proxy configuration
# Caddy is built from ~/code/3rd/caddy with Gandi DNS and Layer 4 plugins
caddy_repo_dir: /Users/erichblume/code/3rd/caddy
caddy_binary: "{{ caddy_repo_dir }}/bin/caddy"
caddy_config_dir: /Users/erichblume/.config/caddy
caddy_data_dir: /Users/erichblume/.local/share/caddy
caddy_log_dir: /Users/erichblume/Library/Logs
# Gandi API token file (written by ansible, chmod 0600)
# Caddy reads this file for ACME DNS-01 challenges
caddy_gandi_token_file: /Users/erichblume/.config/caddy/gandi-token
# Domain configuration
caddy_domain: ops.eblu.me
# HTTPS port (443 is standard)
caddy_https_port: 443
# Services to proxy
# Format: { name: "service", host: "hostname", backend: "url" }
caddy_services:
# Indri-local services
- name: forge
host: "forge.{{ caddy_domain }}"
backend: "http://localhost:3001"
- name: registry
host: "registry.{{ caddy_domain }}"
backend: "http://localhost:5050"
- name: jellyfin
host: "jellyfin.{{ caddy_domain }}"
backend: "http://localhost:8096"
# K8s services (via Tailscale Ingress)
# Caddy proxies to existing Tailscale endpoints - traffic stays local
- name: grafana
host: "grafana.{{ caddy_domain }}"
backend: "https://grafana.tail8d86e.ts.net"
- name: argocd
host: "argocd.{{ caddy_domain }}"
backend: "https://argocd.tail8d86e.ts.net"
- name: prometheus
host: "prometheus.{{ caddy_domain }}"
backend: "https://prometheus.tail8d86e.ts.net"
- name: loki
host: "loki.{{ caddy_domain }}"
backend: "https://loki.tail8d86e.ts.net"
- name: miniflux
host: "feed.{{ caddy_domain }}"
backend: "https://feed.tail8d86e.ts.net"
- name: devpi
host: "pypi.{{ caddy_domain }}"
backend: "http://localhost:3141"
- name: heph
host: "heph.{{ caddy_domain }}"
backend: "http://localhost:8787" # hephaestus hub (server mode) + PWA shell
- name: kiwix
host: "kiwix.{{ caddy_domain }}"
backend: "https://kiwix.tail8d86e.ts.net"
- name: torrent
host: "torrent.{{ caddy_domain }}"
backend: "https://torrent.tail8d86e.ts.net"
- name: teslamate
host: "tesla.{{ caddy_domain }}"
backend: "https://tesla.tail8d86e.ts.net"
- name: immich
host: "photos.{{ caddy_domain }}"
backend: "https://photos.tail8d86e.ts.net"
- name: navidrome
host: "dj.{{ caddy_domain }}"
backend: "https://dj.tail8d86e.ts.net"
- name: homepage
host: "go.{{ caddy_domain }}"
backend: "https://go.tail8d86e.ts.net"
- name: docs
host: "docs.{{ caddy_domain }}"
kind: static
root: "{{ docs_content_dir }}"
try_html: true # Quartz: path → path/ → path.html → 404.html
- name: cv
host: "cv.{{ caddy_domain }}"
kind: static
root: "{{ cv_content_dir }}"
download_paths:
- path: /resume.pdf
filename: erich-blume-resume.pdf
- name: nvr
host: "nvr.{{ caddy_domain }}"
backend: "https://nvr.tail8d86e.ts.net"
- name: authentik
host: "authentik.{{ caddy_domain }}"
backend: "https://authentik.tail8d86e.ts.net"
cache_policy: spa
- name: ntfy
host: "ntfy.{{ caddy_domain }}"
backend: "https://ntfy.tail8d86e.ts.net"
- name: ollama
host: "ollama.{{ caddy_domain }}"
backend: "https://ollama.tail8d86e.ts.net"
- name: mealie
host: "meals.{{ caddy_domain }}"
backend: "https://meals.tail8d86e.ts.net"
- name: paperless
host: "paperless.{{ caddy_domain }}"
backend: "https://paperless.tail8d86e.ts.net"
- name: shower
host: "shower.{{ caddy_domain }}"
backend: "https://shower.tail8d86e.ts.net"
- name: sifaka
host: "nas.{{ caddy_domain }}"
backend: "http://sifaka:5000"
# Layer 4 (TCP) services
# Format: { port: external_port, backend: "host:port" }
caddy_tcp_services:
- port: 2222
backend: "localhost:2200" # Forgejo SSH
- port: 5432
backend: "pg.tail8d86e.ts.net:5432" # PostgreSQL (blumeops-pg)
- port: 5433
backend: "immich-pg.tail8d86e.ts.net:5432" # PostgreSQL (immich-pg)
- port: 5434
backend: "blumeops-pg-ringtail.tail8d86e.ts.net:5432" # PostgreSQL (blumeops-pg on ringtail)
- port: "{{ sifaka_node_exporter_port }}"
backend: "sifaka:{{ sifaka_node_exporter_port }}" # Sifaka node_exporter
- port: "{{ sifaka_smartctl_exporter_port }}"
backend: "sifaka:{{ sifaka_smartctl_exporter_port }}" # Sifaka smartctl_exporter

View file

@ -0,0 +1,6 @@
---
- name: Restart caddy
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.caddy.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.caddy.plist
changed_when: true

View file

@ -0,0 +1,80 @@
---
# Caddy reverse proxy deployment
# Binary is built manually - see ~/code/3rd/caddy/mise.toml
- name: Verify caddy binary exists
ansible.builtin.stat:
path: "{{ caddy_binary }}"
register: caddy_bin
failed_when: not caddy_bin.stat.exists
changed_when: false
- name: Create caddy config directory
ansible.builtin.file:
path: "{{ caddy_config_dir }}"
state: directory
mode: "0755"
- name: Create caddy data directory
ansible.builtin.file:
path: "{{ caddy_data_dir }}"
state: directory
mode: "0755"
- name: Fetch Gandi PAT (when running with --tags caddy)
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/mco6ka3dc3rmw7zkg2dhia5d2m/pat"
delegate_to: localhost
register: _caddy_gandi_token_fallback
changed_when: false
no_log: true
check_mode: false
when: caddy_gandi_token is not defined
- name: Set Gandi token fact (fallback)
ansible.builtin.set_fact:
caddy_gandi_token: "{{ _caddy_gandi_token_fallback.stdout }}"
no_log: true
when: caddy_gandi_token is not defined
- name: Write Gandi token file
ansible.builtin.copy:
content: "{{ caddy_gandi_token }}"
dest: "{{ caddy_gandi_token_file }}"
mode: "0600"
no_log: true
notify: Restart caddy
- name: Deploy Caddyfile
ansible.builtin.template:
src: Caddyfile.j2
dest: "{{ caddy_config_dir }}/Caddyfile"
mode: "0644"
notify: Restart caddy
- name: Deploy caddy wrapper script
ansible.builtin.template:
src: caddy-wrapper.sh.j2
dest: "{{ caddy_config_dir }}/caddy-wrapper.sh"
mode: "0755"
notify: Restart caddy
- name: Deploy caddy LaunchAgent plist
ansible.builtin.template:
src: caddy.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.caddy.plist
mode: "0644"
notify: Restart caddy
- name: Check if caddy LaunchAgent is loaded
ansible.builtin.command:
cmd: launchctl list mcquack.eblume.caddy
register: caddy_launchctl
changed_when: false
failed_when: false
- name: Load caddy LaunchAgent
ansible.builtin.command:
cmd: launchctl load ~/Library/LaunchAgents/mcquack.eblume.caddy.plist
when: caddy_launchctl.rc != 0
changed_when: true

View file

@ -0,0 +1,87 @@
# Caddy reverse proxy for blumeops services
# Managed by ansible - do not edit manually
#
# All *.{{ caddy_domain }} requests are proxied to backend services.
# TLS certificates are obtained via ACME DNS-01 challenge using Gandi.
{
# Global options
admin off
{% if caddy_tcp_services %}
# Layer 4 (TCP) routing
layer4 {
{% for tcp_svc in caddy_tcp_services %}
:{{ tcp_svc.port }} {
route {
proxy {{ tcp_svc.backend }}
}
}
{% endfor %}
}
{% endif %}
}
# Wildcard certificate for all services
*.{{ caddy_domain }}:{{ caddy_https_port }} {
tls {
dns gandi {env.GANDI_BEARER_TOKEN}
}
{% for service in caddy_services %}
@{{ service.name }} host {{ service.host }}
handle @{{ service.name }} {
{% if service.kind | default('proxy') == 'static' %}
root * {{ service.root }}
encode gzip
# Long-cache fingerprinted assets; everything else stays default.
@{{ service.name }}_assets path_regexp \.(css|js|png|jpg|jpeg|gif|ico|svg|woff|woff2)$
header @{{ service.name }}_assets Cache-Control "public, max-age=31536000, immutable"
{% for dl in service.download_paths | default([]) %}
@{{ service.name }}_dl{{ loop.index }} path {{ dl.path }}
header @{{ service.name }}_dl{{ loop.index }} Content-Disposition `attachment; filename="{{ dl.filename }}"`
{% endfor %}
{% if service.try_html | default(false) %}
# Quartz clean URLs: path → path/ → path.html → /404.html (200).
# Caddy's handle_errors is a top-level directive and can't live in
# this nested handle, so the 404 page rides as the final try_files
# candidate (served with 200 — acceptable for a human-facing 404).
try_files {path} {path}/ {path}.html /404.html
{% endif %}
file_server
{% else %}
{% if service.cache_policy | default('') == 'spa' %}
# SPA cache policy: hashed static assets are immutable, HTML must revalidate.
# Prevents stale HTML from referencing chunk hashes that no longer exist.
@{{ service.name }}_static path /static/dist/*
header @{{ service.name }}_static Cache-Control "public, max-age=31536000, immutable"
@{{ service.name }}_html path /if/*
header @{{ service.name }}_html Cache-Control "no-cache"
{% endif %}
{% if service.backend.startswith('https://') %}
reverse_proxy {{ service.backend }} {
# Caddy v2.11+ rewrites Host to upstream for HTTPS backends.
# Preserve the original Host so services see *.ops.eblu.me.
header_up Host {http.request.host}
}
{% else %}
reverse_proxy {{ service.backend }}
{% endif %}
{% endif %}
}
{% endfor %}
# Fallback for unknown hosts
handle {
respond "Unknown service" 404
}
}
# Base domain (ops.eblu.me)
{{ caddy_domain }}:{{ caddy_https_port }} {
tls {
dns gandi {env.GANDI_BEARER_TOKEN}
}
respond "blumeops services - use a subdomain (e.g., forge.{{ caddy_domain }})"
}

View file

@ -0,0 +1,6 @@
#!/bin/bash
# Wrapper script for Caddy that loads the Gandi token from file
# Managed by ansible - do not edit manually
export GANDI_BEARER_TOKEN=$(cat {{ caddy_gandi_token_file }})
exec {{ caddy_binary }} run --config {{ caddy_config_dir }}/Caddyfile

View file

@ -0,0 +1,36 @@
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.caddy</string>
<key>ProgramArguments</key>
<array>
<string>{{ caddy_config_dir }}/caddy-wrapper.sh</string>
</array>
<key>WorkingDirectory</key>
<string>{{ caddy_data_dir }}</string>
<key>EnvironmentVariables</key>
<dict>
<key>XDG_DATA_HOME</key>
<string>/Users/erichblume/.local/share</string>
<key>XDG_CONFIG_HOME</key>
<string>/Users/erichblume/.config</string>
</dict>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>{{ caddy_log_dir }}/mcquack.caddy.out.log</string>
<key>StandardErrorPath</key>
<string>{{ caddy_log_dir }}/mcquack.caddy.err.log</string>
</dict>
</plist>

View file

@ -0,0 +1,10 @@
---
# CV / resume static site (native, replaces minikube Deployment)
# Caddy serves cv_content_dir directly via the static-kind service block.
cv_version: "v1.0.3"
cv_release_url: "https://forge.ops.eblu.me/api/packages/eblume/generic/cv/{{ cv_version }}/cv-{{ cv_version }}.tar.gz"
cv_home: /Users/erichblume/blumeops/cv
cv_content_dir: "{{ cv_home }}/content"
cv_version_sentinel: "{{ cv_home }}/.installed-version"

View file

@ -0,0 +1,57 @@
---
# cv role — download and extract the CV release tarball into cv_content_dir.
# Caddy serves the directory directly; there is no daemon to manage.
#
# Idempotency: a sentinel file records the installed cv_version. The
# download/extract steps only run when the sentinel doesn't match cv_version.
#
# We use curl rather than ansible.builtin.get_url because the forge generic-
# packages endpoint returns 405 on HEAD requests, which get_url issues before
# downloading.
- name: Ensure cv home exists
ansible.builtin.file:
path: "{{ cv_home }}"
state: directory
mode: '0755'
- name: Read installed cv version sentinel
ansible.builtin.slurp:
src: "{{ cv_version_sentinel }}"
register: cv_installed_raw
failed_when: false
changed_when: false
- name: Set installed cv version fact
ansible.builtin.set_fact:
cv_installed_version: >-
{{ (cv_installed_raw.content | b64decode).strip()
if (cv_installed_raw.content is defined) else '' }}
- name: Recreate cv content dir
ansible.builtin.file:
path: "{{ cv_content_dir }}"
state: "{{ item }}"
mode: '0755'
loop:
- absent
- directory
when: cv_installed_version != cv_version
- name: Download and extract cv release tarball
ansible.builtin.shell:
cmd: >-
set -euo pipefail;
curl -fsSL {{ cv_release_url | quote }} -o {{ cv_home }}/cv.tar.gz &&
tar -xzf {{ cv_home }}/cv.tar.gz -C {{ cv_content_dir }} &&
rm -f {{ cv_home }}/cv.tar.gz
executable: /bin/bash
when: cv_installed_version != cv_version
changed_when: true
- name: Write cv version sentinel
ansible.builtin.copy:
content: "{{ cv_version }}\n"
dest: "{{ cv_version_sentinel }}"
mode: '0644'
when: cv_installed_version != cv_version

View file

@ -0,0 +1,21 @@
---
# devpi PyPI caching mirror (native launchd, replaces minikube StatefulSet)
devpi_home: /Users/erichblume/devpi
devpi_venv: "{{ devpi_home }}/venv"
devpi_server_dir: "{{ devpi_home }}/server-dir"
devpi_binary: "{{ devpi_venv }}/bin/devpi-server"
devpi_init_binary: "{{ devpi_venv }}/bin/devpi-init"
devpi_python_version: "3.12"
devpi_server_version: "6.19.3"
devpi_web_version: "5.0.2"
devpi_host: 127.0.0.1
devpi_port: 3141
devpi_outside_url: "https://pypi.ops.eblu.me"
devpi_log_dir: /Users/erichblume/Library/Logs
# uv binary on indri — mise shim so version bumps via `mise upgrade uv` flow through transparently
devpi_uv_binary: /Users/erichblume/.local/share/mise/shims/uv

View file

@ -0,0 +1,6 @@
---
- name: Restart devpi
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.devpi.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.devpi.plist
changed_when: true

View file

@ -0,0 +1,71 @@
---
# devpi role — devpi-server in a uv-managed venv, run via LaunchAgent.
# Replaces the prior minikube StatefulSet; see [[devpi-on-indri]].
#
# The root password is fetched in the indri.yml playbook pre_tasks and
# exposed as `devpi_root_password`.
- name: Ensure devpi home exists
ansible.builtin.file:
path: "{{ devpi_home }}"
state: directory
mode: '0755'
- name: Ensure devpi server-dir exists
ansible.builtin.file:
path: "{{ devpi_server_dir }}"
state: directory
mode: '0700'
- name: Create devpi venv if missing
ansible.builtin.command:
cmd: "{{ devpi_uv_binary }} venv --python {{ devpi_python_version }} {{ devpi_venv }}"
creates: "{{ devpi_venv }}/bin/python"
- name: Install devpi-server and devpi-web into venv
# Always bootstrap from upstream PyPI — devpi is the index it would otherwise resolve through,
# and that's a circular dependency (devpi cannot install itself from itself).
ansible.builtin.command:
cmd: >-
{{ devpi_uv_binary }} pip install
--python {{ devpi_venv }}/bin/python
--index-url https://pypi.org/simple/
devpi-server=={{ devpi_server_version }}
devpi-web=={{ devpi_web_version }}
register: devpi_pip_install
changed_when: "'Installed' in devpi_pip_install.stdout or 'Uninstalled' in devpi_pip_install.stdout"
notify: Restart devpi
- name: Check if devpi server-dir is initialized
ansible.builtin.stat:
path: "{{ devpi_server_dir }}/.serverversion"
register: devpi_serverversion
- name: Initialize devpi server-dir
ansible.builtin.command:
cmd: >-
{{ devpi_init_binary }}
--serverdir {{ devpi_server_dir }}
--root-passwd {{ devpi_root_password }}
when: not devpi_serverversion.stat.exists
changed_when: true
no_log: true
- name: Deploy devpi LaunchAgent plist
ansible.builtin.template:
src: devpi.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.devpi.plist
mode: '0644'
notify: Restart devpi
- name: Check if devpi LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.devpi
register: devpi_launchctl_check
changed_when: false
failed_when: false
- name: Load devpi LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.devpi.plist
when: devpi_launchctl_check.rc != 0
changed_when: true
failed_when: false

View file

@ -0,0 +1,34 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- {{ ansible_managed }} -->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.devpi</string>
<key>ProgramArguments</key>
<array>
<string>{{ devpi_binary }}</string>
<string>--serverdir</string>
<string>{{ devpi_server_dir }}</string>
<string>--host</string>
<string>{{ devpi_host }}</string>
<string>--port</string>
<string>{{ devpi_port }}</string>
<string>--outside-url</string>
<string>{{ devpi_outside_url }}</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>{{ devpi_venv }}/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
<key>StandardOutPath</key>
<string>{{ devpi_log_dir }}/mcquack.devpi.out.log</string>
<key>StandardErrorPath</key>
<string>{{ devpi_log_dir }}/mcquack.devpi.err.log</string>
</dict>
</plist>

View file

@ -0,0 +1,10 @@
---
# Docs (Quartz-built static site) — replaces minikube Deployment.
# Caddy serves docs_content_dir directly via the static-kind service block,
# with Quartz-style try_files (path → path/ → path.html → 404).
docs_version: "v1.17.0"
docs_release_url: "https://forge.eblu.me/eblume/blumeops/releases/download/{{ docs_version }}/docs-{{ docs_version }}.tar.gz"
docs_home: /Users/erichblume/blumeops/docs
docs_content_dir: "{{ docs_home }}/content"
docs_version_sentinel: "{{ docs_home }}/.installed-version"

View file

@ -0,0 +1,57 @@
---
# docs role — download and extract the Quartz-built docs tarball into
# docs_content_dir. Caddy serves the directory directly with Quartz-style
# try_files; there is no daemon to manage.
#
# Idempotency: a sentinel file records the installed docs_version. The
# download/extract steps only run when the sentinel doesn't match docs_version.
#
# Mirrors the cv role's curl-based download for consistency, even though the
# forge releases endpoint here does support HEAD.
- name: Ensure docs home exists
ansible.builtin.file:
path: "{{ docs_home }}"
state: directory
mode: '0755'
- name: Read installed docs version sentinel
ansible.builtin.slurp:
src: "{{ docs_version_sentinel }}"
register: docs_installed_raw
failed_when: false
changed_when: false
- name: Set installed docs version fact
ansible.builtin.set_fact:
docs_installed_version: >-
{{ (docs_installed_raw.content | b64decode).strip()
if (docs_installed_raw.content is defined) else '' }}
- name: Recreate docs content dir
ansible.builtin.file:
path: "{{ docs_content_dir }}"
state: "{{ item }}"
mode: '0755'
loop:
- absent
- directory
when: docs_installed_version != docs_version
- name: Download and extract docs release tarball
ansible.builtin.shell:
cmd: >-
set -euo pipefail;
curl -fsSL {{ docs_release_url | quote }} -o {{ docs_home }}/docs.tar.gz &&
tar -xzf {{ docs_home }}/docs.tar.gz -C {{ docs_content_dir }} &&
rm -f {{ docs_home }}/docs.tar.gz
executable: /bin/bash
when: docs_installed_version != docs_version
changed_when: true
- name: Write docs version sentinel
ansible.builtin.copy:
content: "{{ docs_version }}\n"
dest: "{{ docs_version_sentinel }}"
mode: '0644'
when: docs_installed_version != docs_version

View file

@ -4,22 +4,27 @@
forgejo_app_name: Forgejo
forgejo_app_slogan: "Beyond coding. We Forge."
forgejo_run_user: forgejo
forgejo_run_user: erichblume
forgejo_run_mode: prod
# Paths (brew-managed for now, will change to mcquack in Phase 3)
forgejo_work_path: /opt/homebrew/var/forgejo
# Source build paths
forgejo_repo_dir: /Users/erichblume/code/3rd/forgejo
forgejo_binary: "{{ forgejo_repo_dir }}/forgejo"
# Data paths (migrated from brew to ~/forgejo)
forgejo_work_path: /Users/erichblume/forgejo
forgejo_config_path: "{{ forgejo_work_path }}/custom/conf/app.ini"
forgejo_data_path: "{{ forgejo_work_path }}/data"
forgejo_repo_root: "{{ forgejo_data_path }}/forgejo-repositories"
forgejo_lfs_path: "{{ forgejo_data_path }}/lfs"
forgejo_log_path: "{{ forgejo_work_path }}/log"
forgejo_log_dir: /Users/erichblume/Library/Logs
# Server settings
forgejo_http_addr: 0.0.0.0
forgejo_http_port: 3001
forgejo_domain: forge.tail8d86e.ts.net
forgejo_ssh_domain: "{{ forgejo_domain }}"
forgejo_domain: forge.eblu.me
forgejo_ssh_domain: forge.ops.eblu.me
forgejo_root_url: "https://{{ forgejo_domain }}/"
forgejo_offline_mode: true
@ -27,7 +32,7 @@ forgejo_offline_mode: true
forgejo_disable_ssh: false
forgejo_start_ssh_server: true
forgejo_builtin_ssh_user: forgejo
forgejo_ssh_port: 22
forgejo_ssh_port: 2222
forgejo_ssh_listen_port: 2200
forgejo_lfs_start_server: true

View file

@ -1,4 +1,6 @@
---
- name: Restart forgejo
ansible.builtin.command: brew services restart forgejo
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist
changed_when: true

View file

@ -1,16 +1,34 @@
---
# Forgejo role
# Forgejo role — source-built binary with LaunchAgent
#
# Currently uses brew-managed forgejo. Phase 3 of ci-cd-bootstrap will
# transition to mcquack LaunchAgent with CI-built binary.
# ONE-TIME SETUP (before running ansible):
#
# 1. Clone forgejo from codeberg (avoid circular dependency):
# ssh indri 'git clone https://codeberg.org/forgejo/forgejo.git ~/code/3rd/forgejo'
#
# 2. Add forge mirror as secondary remote:
# ssh indri 'cd ~/code/3rd/forgejo && git remote add forge https://forge.eblu.me/mirrors/forgejo.git'
#
# 3. Build (mise.toml handles Go/Node versions and build tags):
# ssh indri 'cd ~/code/3rd/forgejo && mise run build'
#
# 4. Run ansible to deploy config and LaunchAgent
#
# Secrets (lfs_jwt_secret, internal_token, oauth2_jwt_secret) are fetched
# from 1Password in the playbook pre_tasks.
- name: Install forgejo via homebrew
community.general.homebrew:
name: forgejo
state: present
- name: Verify forgejo binary exists
ansible.builtin.stat:
path: "{{ forgejo_binary }}"
register: forgejo_binary_stat
- name: Fail if forgejo binary not found
ansible.builtin.fail:
msg: |
Forgejo binary not found at {{ forgejo_binary }}.
Please build from source first:
ssh indri 'cd ~/code/3rd/forgejo && mise run build'
when: not forgejo_binary_stat.stat.exists
- name: Ensure forgejo config directory exists
ansible.builtin.file:
@ -25,8 +43,21 @@
mode: '0600'
notify: Restart forgejo
- name: Ensure forgejo service is started
ansible.builtin.command: brew services start forgejo
register: forgejo_brew_start
changed_when: "'Successfully started' in forgejo_brew_start.stdout"
- name: Deploy forgejo LaunchAgent plist
ansible.builtin.template:
src: forgejo.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist
mode: '0644'
notify: Restart forgejo
- name: Check if forgejo LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.forgejo
register: forgejo_launchctl_check
changed_when: false
failed_when: false
- name: Load forgejo LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist
when: forgejo_launchctl_check.rc != 0
changed_when: true
failed_when: false

View file

@ -20,6 +20,8 @@ SSH_LISTEN_PORT = {{ forgejo_ssh_listen_port }}
LFS_START_SERVER = {{ forgejo_lfs_start_server | lower }}
LFS_JWT_SECRET = {{ forgejo_lfs_jwt_secret }}
OFFLINE_MODE = {{ forgejo_offline_mode | lower }}
REVERSE_PROXY_LIMIT = 2
REVERSE_PROXY_TRUSTED_PROXIES = *
[database]
DB_TYPE = {{ forgejo_db_type }}
@ -40,7 +42,7 @@ ENABLED = false
REGISTER_EMAIL_CONFIRM = false
ENABLE_NOTIFY_MAIL = false
DISABLE_REGISTRATION = {{ forgejo_disable_registration | lower }}
ALLOW_ONLY_EXTERNAL_REGISTRATION = false
ALLOW_ONLY_EXTERNAL_REGISTRATION = true
ENABLE_CAPTCHA = false
REQUIRE_SIGNIN_VIEW = {{ forgejo_require_signin_view | lower }}
DEFAULT_KEEP_EMAIL_PRIVATE = false
@ -52,9 +54,19 @@ NO_REPLY_ADDRESS = noreply.indri
ENABLE_OPENID_SIGNIN = false
ENABLE_OPENID_SIGNUP = false
[mirror]
DEFAULT_INTERVAL = 8h
MIN_INTERVAL = 10m
[cron.update_checker]
ENABLED = false
[cron.archive_cleanup]
ENABLED = true
RUN_AT_START = true
SCHEDULE = @midnight
OLDER_THAN = 2h
[session]
PROVIDER = {{ forgejo_session_provider }}
@ -77,6 +89,17 @@ PASSWORD_HASH_ALGO = pbkdf2_hi
[oauth2]
JWT_SECRET = {{ forgejo_oauth2_jwt_secret }}
[oauth2_client]
ENABLE_AUTO_REGISTRATION = true
ACCOUNT_LINKING = login
USERNAME = nickname
REGISTER_EMAIL_CONFIRM = false
[metrics]
ENABLED = true
ENABLED_ISSUE_BY_LABEL = false
ENABLED_ISSUE_BY_REPOSITORY = false
[actions]
ENABLED = {{ forgejo_actions_enabled | lower }}
DEFAULT_ACTIONS_URL = {{ forgejo_actions_default_url }}

View file

@ -0,0 +1,26 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- {{ ansible_managed }} -->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.forgejo</string>
<key>ProgramArguments</key>
<array>
<string>{{ forgejo_binary }}</string>
<string>-w</string>
<string>{{ forgejo_work_path }}</string>
<string>-c</string>
<string>{{ forgejo_config_path }}</string>
<string>web</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>{{ forgejo_log_dir }}/mcquack.forgejo.out.log</string>
<key>StandardErrorPath</key>
<string>{{ forgejo_log_dir }}/mcquack.forgejo.err.log</string>
</dict>
</plist>

View file

@ -0,0 +1,24 @@
---
# Forgejo Actions Secrets role configuration
#
# This role syncs repository-level Actions secrets from 1Password to Forgejo
# via the Forgejo API.
forgejo_actions_secrets_api_url: "https://forge.eblu.me/api/v1"
forgejo_actions_secrets_owner: eblume
# Secrets to sync per repo.
# Each entry: {repo: "name", secrets: [{name: "SECRET_NAME", value_var: "ansible_fact_name"}]}
forgejo_actions_secrets_repos:
- repo: blumeops
secrets:
- name: ARGOCD_AUTH_TOKEN
value_var: forgejo_secret_argocd_token
- name: FLY_DEPLOY_TOKEN
value_var: forgejo_secret_fly_deploy_token
- name: ZOT_CI_API_KEY
value_var: forgejo_secret_zot_ci_api_key
- repo: cv
secrets:
- name: FORGE_TOKEN
value_var: forgejo_api_token

View file

@ -0,0 +1,32 @@
---
# Forgejo Actions Secrets role
#
# Syncs repository-level Actions secrets from 1Password to Forgejo via API.
#
# NOTE: This role runs on indri, which is also where Forgejo runs. The API
# calls go from indri back to itself (via the public URL through Caddy).
# This is intentional - it keeps the role simple and uses the same URL
# that workflows use.
#
# Secrets (forgejo_api_token, forgejo_secret_*) are fetched from 1Password
# in the playbook pre_tasks to minimize password prompts during provisioning.
- name: Sync Actions secrets to Forgejo
ansible.builtin.uri:
url: "{{ forgejo_actions_secrets_api_url }}/repos/{{ forgejo_actions_secrets_owner }}/{{ item.0.repo }}/actions/secrets/{{ item.1.name }}"
method: PUT
headers:
Authorization: "token {{ forgejo_api_token }}"
Content-Type: "application/json"
body_format: json
body:
data: "{{ lookup('vars', item.1.value_var) }}"
status_code: [201, 204]
register: forgejo_actions_secrets_result
# API returns 201 for create, 204 for update. We can't check if value changed
# (secrets are write-only), so only report changed when creating new secrets.
changed_when: forgejo_actions_secrets_result.status == 201
loop: "{{ forgejo_actions_secrets_repos | subelements('secrets') }}"
loop_control:
label: "{{ item.0.repo }}/{{ item.1.name }}"
no_log: true

View file

@ -0,0 +1,20 @@
---
# Forgejo metrics collection configuration
# Forgejo server URL
forgejo_metrics_url: "http://localhost:3001"
# Path to file containing Forgejo API token (should have 600 permissions)
forgejo_metrics_api_key_file: "/Users/erichblume/.forgejo-api-key"
# Metrics collection interval in seconds
forgejo_metrics_interval: 60
# Output directory for prometheus textfile collector
forgejo_metrics_dir: /opt/homebrew/var/node_exporter/textfile
# Script installation path
forgejo_metrics_script: /Users/erichblume/.local/bin/forgejo-metrics
# Log directory for metrics script output
forgejo_metrics_log_dir: /opt/homebrew/var/log

View file

@ -0,0 +1,6 @@
---
- name: Reload forgejo-metrics
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo-metrics.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo-metrics.plist
changed_when: true

View file

@ -0,0 +1,55 @@
---
- name: Fetch Forgejo API token (when running with --tags forgejo_metrics)
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/w3663ffnvkewbftncqxtcpeavy/api-token"
delegate_to: localhost
register: forgejo_metrics_api_key_fallback
changed_when: false
no_log: true
check_mode: false
when: forgejo_metrics_api_key is not defined
- name: Set Forgejo API token fact (fallback)
ansible.builtin.set_fact:
forgejo_metrics_api_key: "{{ forgejo_metrics_api_key_fallback.stdout }}"
no_log: true
when: forgejo_metrics_api_key is not defined
- name: Write Forgejo API token file
ansible.builtin.copy:
content: "{{ forgejo_metrics_api_key }}"
dest: "{{ forgejo_metrics_api_key_file }}"
mode: '0600'
no_log: true
- name: Ensure bin directory exists
ansible.builtin.file:
path: "{{ forgejo_metrics_script | dirname }}"
state: directory
mode: '0755'
- name: Deploy forgejo metrics collection script
ansible.builtin.template:
src: forgejo-metrics.sh.j2
dest: "{{ forgejo_metrics_script }}"
mode: '0755'
notify: Reload forgejo-metrics
- name: Deploy forgejo-metrics LaunchAgent plist
ansible.builtin.template:
src: forgejo-metrics.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.forgejo-metrics.plist
mode: '0644'
notify: Reload forgejo-metrics
- name: Check if forgejo-metrics LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.forgejo-metrics
register: forgejo_metrics_launchctl_check
changed_when: false
failed_when: false
- name: Load forgejo-metrics LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo-metrics.plist
when: forgejo_metrics_launchctl_check.rc != 0
changed_when: true
failed_when: false

View file

@ -0,0 +1,26 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- {{ ansible_managed }} -->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.forgejo-metrics</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/bin:/bin</string>
</dict>
<key>ProgramArguments</key>
<array>
<string>{{ forgejo_metrics_script }}</string>
</array>
<key>StartInterval</key>
<integer>{{ forgejo_metrics_interval }}</integer>
<key>RunAtLoad</key>
<true/>
<key>StandardErrorPath</key>
<string>{{ forgejo_metrics_log_dir }}/mcquack.forgejo-metrics.err.log</string>
<key>StandardOutPath</key>
<string>{{ forgejo_metrics_log_dir }}/mcquack.forgejo-metrics.out.log</string>
</dict>
</plist>

View file

@ -0,0 +1,162 @@
#!/bin/bash
# {{ ansible_managed }}
# Collects Forgejo repository health metrics for node_exporter textfile collector
set -euo pipefail
FORGEJO_URL="{{ forgejo_metrics_url }}"
API_KEY_FILE="{{ forgejo_metrics_api_key_file }}"
OUTPUT_FILE="{{ forgejo_metrics_dir }}/forgejo.prom"
TEMP_FILE="${OUTPUT_FILE}.tmp"
TOKEN=$(cat "$API_KEY_FILE" 2>/dev/null | tr -d '\n' || true)
# Authenticated API request; returns empty string on failure
api() {
curl -sf -H "Authorization: token ${TOKEN}" -H "Accept: application/json" \
"${FORGEJO_URL}/api/v1${1}" 2>/dev/null || echo ""
}
# jq helper: convert ISO 8601 timestamp (with any tz offset) to epoch seconds
# jq's fromdate only handles Z, so we parse the offset and apply it manually
JQ_EPOCH='def epoch: sub("[.][0-9]+"; "") | if test("[+-][0-9]{2}:[0-9]{2}$") then capture("^(?<dt>.*)(?<sign>[+-])(?<h>[0-9]{2}):(?<m>[0-9]{2})$") | (.dt + "Z" | fromdate) as $base | ((.h | tonumber) * 3600 + (.m | tonumber) * 60) as $off | if .sign == "-" then $base + $off else $base - $off end else sub("Z$"; "") + "Z" | fromdate end;'
forgejo_up=0
if curl -sf "${FORGEJO_URL}/api/v1/version" >/dev/null 2>&1; then
forgejo_up=1
fi
{
# --- Metric type declarations ---
cat << 'HEADER'
# HELP forgejo_up Forgejo server is up and responding
# TYPE forgejo_up gauge
# HELP forgejo_repo_open_pull_requests Number of open pull requests
# TYPE forgejo_repo_open_pull_requests gauge
# HELP forgejo_repo_open_issues Number of open issues
# TYPE forgejo_repo_open_issues gauge
# HELP forgejo_repo_language_bytes Repository language size in bytes
# TYPE forgejo_repo_language_bytes gauge
# HELP forgejo_repo_releases_total Total number of releases
# TYPE forgejo_repo_releases_total gauge
# HELP forgejo_repo_latest_release_timestamp_seconds Unix timestamp of the latest release
# TYPE forgejo_repo_latest_release_timestamp_seconds gauge
# HELP forgejo_repo_latest_commit_timestamp_seconds Unix timestamp of the latest commit on default branch
# TYPE forgejo_repo_latest_commit_timestamp_seconds gauge
# HELP forgejo_actions_runs_total Action runs by status from most recent 30
# TYPE forgejo_actions_runs_total gauge
# HELP forgejo_actions_run_duration_seconds Duration of the latest completed run per workflow in seconds
# TYPE forgejo_actions_run_duration_seconds gauge
# HELP forgejo_actions_last_success_timestamp_seconds Unix timestamp of last successful run per workflow
# TYPE forgejo_actions_last_success_timestamp_seconds gauge
# HELP forgejo_actions_jobs_waiting Number of action runs currently waiting or queued
# TYPE forgejo_actions_jobs_waiting gauge
# HELP forgejo_actions_jobs_running Number of action runs currently in progress
# TYPE forgejo_actions_jobs_running gauge
HEADER
echo "forgejo_up ${forgejo_up}"
if [ "$forgejo_up" -eq 1 ] && [ -n "$TOKEN" ]; then
# Discover all repos accessible to the token owner
repos_json=$(api "/repos/search?limit=50")
[ -z "$repos_json" ] && repos_json='{"data":[]}'
repo_count=$(echo "$repos_json" | jq '.data | length' 2>/dev/null || echo "0")
for i in $(seq 0 $((repo_count - 1))); do
repo_data=$(echo "$repos_json" | jq ".data[$i]")
full_name=$(echo "$repo_data" | jq -r '.full_name')
[ -z "$full_name" ] || [ "$full_name" = "null" ] && continue
r="$full_name"
# Basic repo metrics (from search results — no extra API call)
echo "forgejo_repo_open_pull_requests{repo=\"${r}\"} $(echo "$repo_data" | jq '.open_pr_counter // 0')"
echo "forgejo_repo_open_issues{repo=\"${r}\"} $(echo "$repo_data" | jq '.open_issues_count // 0')"
default_branch=$(echo "$repo_data" | jq -r '.default_branch // "main"')
# --- Languages ---
langs=$(api "/repos/${r}/languages")
if [ -n "$langs" ] && echo "$langs" | jq -e 'type == "object" and length > 0' >/dev/null 2>&1; then
echo "$langs" | jq -r --arg r "$r" \
'to_entries[] | "forgejo_repo_language_bytes{repo=\"\($r)\",language=\"\(.key)\"} \(.value)"' \
2>/dev/null || true
fi
# --- Releases ---
releases=$(api "/repos/${r}/releases?limit=50")
if [ -n "$releases" ] && echo "$releases" | jq -e 'type == "array"' >/dev/null 2>&1; then
echo "forgejo_repo_releases_total{repo=\"${r}\"} $(echo "$releases" | jq 'length')"
# Latest release timestamp and version
echo "$releases" | jq -r --arg r "$r" "${JQ_EPOCH}"'
if length > 0 then
.[0] |
"forgejo_repo_latest_release_timestamp_seconds{repo=\"\($r)\",version=\"\(.tag_name)\"} \((.published_at // .created_at // .created) | epoch)"
else empty end' 2>/dev/null || true
else
echo "forgejo_repo_releases_total{repo=\"${r}\"} 0"
fi
# --- Latest commit on default branch ---
commits=$(api "/repos/${r}/commits?limit=1&sha=${default_branch}")
if [ -n "$commits" ] && echo "$commits" | jq -e 'type == "array" and length > 0' >/dev/null 2>&1; then
echo "$commits" | jq -r --arg r "$r" "${JQ_EPOCH}"'
.[0] |
"forgejo_repo_latest_commit_timestamp_seconds{repo=\"\($r)\"} \((.created // .commit.committer.date) | epoch)"' \
2>/dev/null || true
fi
# --- Action runs ---
runs_json=$(api "/repos/${r}/actions/runs?limit=30")
if [ -n "$runs_json" ] && echo "$runs_json" | jq -e '.workflow_runs | type == "array"' >/dev/null 2>&1; then
# Count by status
echo "$runs_json" | jq -r --arg r "$r" '
.workflow_runs | group_by(.status) | .[] |
"forgejo_actions_runs_total{repo=\"\($r)\",status=\"\(.[0].status)\"} \(length)"' \
2>/dev/null || true
# Jobs waiting/running
waiting=$(echo "$runs_json" | jq '[.workflow_runs[] | select(.status == "waiting" or .status == "queued")] | length' 2>/dev/null || echo "0")
running=$(echo "$runs_json" | jq '[.workflow_runs[] | select(.status == "running")] | length' 2>/dev/null || echo "0")
echo "forgejo_actions_jobs_waiting{repo=\"${r}\"} ${waiting}"
echo "forgejo_actions_jobs_running{repo=\"${r}\"} ${running}"
# Discover current workflow files on the default branch (.forgejo/ or .github/)
current_wfs=""
for wf_dir in .forgejo/workflows .github/workflows; do
wf_list=$(api "/repos/${r}/contents/${wf_dir}?ref=${default_branch}")
if [ -n "$wf_list" ] && echo "$wf_list" | jq -e 'type == "array"' >/dev/null 2>&1; then
current_wfs=$(echo "$wf_list" | jq -r '[.[].name] | join(",")' 2>/dev/null || true)
break
fi
done
# Per-workflow: latest completed run duration and last success timestamp
# Only include workflows that currently exist on the default branch
# Forgejo fields: workflow_id (filename), created/stopped, duration (nanoseconds)
if [ -n "$current_wfs" ]; then
echo "$runs_json" | jq -r --arg r "$r" --arg wfs "$current_wfs" "${JQ_EPOCH}"'
($wfs | split(",")) as $current |
[.workflow_runs[] | select((.status == "success" or .status == "failure") and (.workflow_id | IN($current[])))] |
if length > 0 then
group_by(.workflow_id) | .[] |
(sort_by(.created) | reverse) as $sorted |
($sorted[0]) as $latest |
($latest.workflow_id | sub("[.]ya?ml$"; "")) as $wf |
"forgejo_actions_run_duration_seconds{repo=\"\($r)\",workflow=\"\($wf)\"} \(($latest.duration // 0) / 1000000000 | floor)",
([$sorted[] | select(.status == "success")] |
if length > 0 then
.[0] as $last_ok |
"forgejo_actions_last_success_timestamp_seconds{repo=\"\($r)\",workflow=\"\($wf)\"} \($last_ok.stopped | epoch)"
else empty end)
else empty end' 2>/dev/null || true
fi
fi
done
fi
} > "$TEMP_FILE"
# Atomic move
mv "$TEMP_FILE" "$OUTPUT_FILE"

View file

@ -0,0 +1,49 @@
---
# hephaestus hub — the canonical heph replica (server mode) on indri.
# Other devices (e.g. gilbert) are spokes that sync against this hub.
# See [[set-up-sync-hub]] and [[host-heph-pwa]] in the hephaestus repo.
# Pinned release used for the initial `cargo install` and the PWA shell.
# After bootstrap, hephd's own --self-update keeps the binary current; this
# pin only governs the first install and the bundled PWA shell version.
heph_version: v1.2.1
# Anonymous public HTTPS clone — matches hephd's INSTALL_GIT_URL so the initial
# install and unattended self-update build from the same source (no ssh-agent).
heph_repo_url: https://forge.eblu.me/eblume/hephaestus.git
heph_bin_dir: /Users/erichblume/.cargo/bin
heph_binary: "{{ heph_bin_dir }}/hephd"
# rustc/cargo here are rustup shims. The bare (non-mise) environment that the
# launchagent and ansible run in falls back to rustup's *default* toolchain,
# which can lag behind heph's rust-version floor (Cargo.toml: 1.89). Pin the
# channel explicitly so both the bootstrap build and unattended self-update
# always use a current toolchain regardless of the host's rustup default.
heph_rust_toolchain: stable
heph_data_dir: /Users/erichblume/.local/share/heph
heph_db: "{{ heph_data_dir }}/heph.db"
heph_socket: "{{ heph_data_dir }}/hephd.sock"
heph_log_dir: /Users/erichblume/Library/Logs
# Version-pinned source checkout; the PWA static shell is served directly from
# its heph-pwa/ subdir (no copy), keeping shell and hub in lockstep at heph_version.
heph_pwa_src_dir: /Users/erichblume/.cache/heph-pwa-src
heph_web_root: "{{ heph_pwa_src_dir }}/heph-pwa"
# Hub listens on all interfaces so tailnet spokes can reach it directly
# (http://indri.tail8d86e.ts.net:8787) and Caddy can proxy heph.ops.eblu.me.
# Access is gated by Authentik OIDC regardless — tailnet reachability is not
# enough (this is the owner's most sensitive data).
heph_http_addr: 0.0.0.0:8787
heph_port: 8787
heph_external_url: https://heph.ops.eblu.me
# Authentik OIDC — issuer + audience together turn hub auth on. The audience is
# the device-code client id (see argocd/manifests/authentik heph blueprint).
heph_oidc_issuer: https://authentik.ops.eblu.me/application/o/heph/
heph_oidc_audience: heph
# Self-update poll interval (seconds). 10 minutes.
heph_self_update_interval_secs: 600

View file

@ -0,0 +1,6 @@
---
- name: Restart heph
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.heph.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.heph.plist
changed_when: true

View file

@ -0,0 +1,82 @@
---
# hephaestus hub (server mode) on indri.
#
# DATA SEEDING (one-time, Path A — do this BEFORE the first provision so the hub
# adopts gilbert's existing data instead of being born empty):
#
# 1. On the seed device (gilbert): heph daemon stop
# 2. Copy its store to indri: scp ~/.local/share/heph/heph.db \
# indri:~/.local/share/heph/heph.db
# 3. On indri, give the hub its OWN device origin (keeps gilbert's owner_id +
# data; hephd regenerates a fresh origin on next start when it is missing):
# sqlite3 ~/.local/share/heph/heph.db "DELETE FROM meta WHERE key='origin';"
# 4. Run this role (installs hephd, stages the PWA, loads the launchagent).
#
# hephd auto-creates an empty store on first start if none exists, so seeding is
# optional — skip it only if you intend a fresh, empty hub.
- name: Ensure heph data directory exists
ansible.builtin.file:
path: "{{ heph_data_dir }}"
state: directory
mode: '0700'
- name: Check for installed hephd binary
ansible.builtin.stat:
path: "{{ heph_binary }}"
register: heph_binary_stat
# Bootstrap install only when hephd is absent. Thereafter hephd's own
# --self-update keeps it current; ansible must not fight (or downgrade) it.
# This builds from source and can take several minutes on a cold cargo cache.
- name: Bootstrap-install heph + hephd from the forge ({{ heph_version }})
ansible.builtin.command:
cmd: >-
{{ heph_bin_dir }}/cargo install --locked
--git {{ heph_repo_url }}
--tag {{ heph_version }}
heph hephd
environment:
PATH: "{{ heph_bin_dir }}:/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin"
RUSTUP_TOOLCHAIN: "{{ heph_rust_toolchain }}"
when: not heph_binary_stat.stat.exists
changed_when: true
notify: Restart heph
# Checkout provides the PWA shell at {{ heph_web_root }} (heph-pwa/ subdir),
# served directly by hephd. Static files are read from disk per request, so a
# version bump needs no restart; the service worker (CACHE = "heph-pwa-vN")
# evicts stale assets on next load.
- name: Ensure heph cache parent directory exists
ansible.builtin.file:
path: "{{ heph_pwa_src_dir | dirname }}"
state: directory
mode: '0755'
- name: Stage heph-pwa source at {{ heph_version }}
ansible.builtin.git:
repo: "{{ heph_repo_url }}"
dest: "{{ heph_pwa_src_dir }}"
version: "{{ heph_version }}"
depth: 1
single_branch: true
force: true
- name: Deploy heph LaunchAgent plist
ansible.builtin.template:
src: heph.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.heph.plist
mode: '0644'
notify: Restart heph
- name: Check if heph LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.heph
register: heph_launchctl_check
changed_when: false
failed_when: false
- name: Load heph LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.heph.plist
when: heph_launchctl_check.rc != 0
changed_when: true
failed_when: false

View file

@ -0,0 +1,50 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- {{ ansible_managed }} -->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.heph</string>
<key>ProgramArguments</key>
<array>
<string>{{ heph_binary }}</string>
<string>--mode</string>
<string>server</string>
<string>--http-addr</string>
<string>{{ heph_http_addr }}</string>
<string>--db</string>
<string>{{ heph_db }}</string>
<string>--socket</string>
<string>{{ heph_socket }}</string>
<string>--web-root</string>
<string>{{ heph_web_root }}</string>
<string>--oidc-issuer</string>
<string>{{ heph_oidc_issuer }}</string>
<string>--oidc-audience</string>
<string>{{ heph_oidc_audience }}</string>
<string>--self-update</string>
<string>--self-update-interval-secs</string>
<string>{{ heph_self_update_interval_secs }}</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>EnvironmentVariables</key>
<dict>
<!-- cargo + toolchain on PATH so --self-update can run `cargo install`. -->
<key>PATH</key>
<string>{{ heph_bin_dir }}:/opt/homebrew/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
<key>HOME</key>
<string>/Users/erichblume</string>
<!-- Pin the rustup channel: the launchagent runs without mise, so a bare
cargo shim would otherwise use rustup's (stale) default toolchain. -->
<key>RUSTUP_TOOLCHAIN</key>
<string>{{ heph_rust_toolchain }}</string>
</dict>
<key>StandardOutPath</key>
<string>{{ heph_log_dir }}/mcquack.heph.out.log</string>
<key>StandardErrorPath</key>
<string>{{ heph_log_dir }}/mcquack.heph.err.log</string>
</dict>
</plist>

View file

@ -0,0 +1,30 @@
---
# Jellyfin media server configuration
# Port Jellyfin listens on
jellyfin_port: 8096
# Data directory (standard macOS location)
jellyfin_data_dir: "{{ ansible_env.HOME }}/Library/Application Support/jellyfin"
# Media path (NFS mount from sifaka)
jellyfin_media_path: /Volumes/allisonflix
# Homebrew cask application path
jellyfin_cask_app_path: /Applications/Jellyfin.app
# Binary path inside the cask app
jellyfin_binary: "{{ jellyfin_cask_app_path }}/Contents/MacOS/jellyfin"
# Web client path (different from binary location in Homebrew cask)
jellyfin_webdir: "{{ jellyfin_cask_app_path }}/Contents/Resources/jellyfin-web"
# Log directory
jellyfin_log_dir: "{{ ansible_env.HOME }}/Library/Logs"
# SSO plugin configuration
jellyfin_sso_plugin_version: "4.0.0.3"
jellyfin_sso_client_id: jellyfin
jellyfin_sso_client_secret: ""
jellyfin_sso_provider_name: authentik
jellyfin_plugins_dir: "{{ jellyfin_data_dir }}/plugins"

View file

@ -0,0 +1,6 @@
---
- name: Reload jellyfin
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.jellyfin.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.jellyfin.plist
changed_when: true

View file

@ -0,0 +1,77 @@
---
- name: Install Jellyfin via Homebrew cask
community.general.homebrew_cask:
name: jellyfin
state: present
- name: Ensure Jellyfin data directory exists
ansible.builtin.file:
path: "{{ jellyfin_data_dir }}"
state: directory
mode: '0755'
- name: Deploy Jellyfin LaunchAgent plist
ansible.builtin.template:
src: mcquack.jellyfin.plist.j2
dest: ~/Library/LaunchAgents/mcquack.jellyfin.plist
mode: '0644'
notify: Reload jellyfin
- name: Check if Jellyfin LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.jellyfin
register: jellyfin_launchctl_check
changed_when: false
failed_when: false
- name: Load Jellyfin LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.jellyfin.plist
when: jellyfin_launchctl_check.rc != 0
changed_when: true
failed_when: false
# SSO plugin installation
- name: Ensure SSO-Auth plugin directory exists
ansible.builtin.file:
path: "{{ jellyfin_plugins_dir }}/SSO-Auth_{{ jellyfin_sso_plugin_version }}"
state: directory
mode: '0755'
- name: Download SSO-Auth plugin archive
ansible.builtin.get_url:
url: "https://github.com/9p4/jellyfin-plugin-sso/releases/download/v{{ jellyfin_sso_plugin_version }}/sso-authentication_{{ jellyfin_sso_plugin_version }}.zip"
dest: "/tmp/sso-authentication_{{ jellyfin_sso_plugin_version }}.zip"
mode: '0644'
- name: Extract SSO-Auth plugin
ansible.builtin.unarchive:
src: "/tmp/sso-authentication_{{ jellyfin_sso_plugin_version }}.zip"
dest: "{{ jellyfin_plugins_dir }}/SSO-Auth_{{ jellyfin_sso_plugin_version }}"
remote_src: true
notify: Reload jellyfin
- name: Ensure plugin configurations directory exists
ansible.builtin.file:
path: "{{ jellyfin_plugins_dir }}/configurations"
state: directory
mode: '0755'
- name: Deploy SSO-Auth plugin configuration
ansible.builtin.template:
src: sso-auth.xml.j2
dest: "{{ jellyfin_plugins_dir }}/configurations/SSO-Auth.xml"
mode: '0644'
notify: Reload jellyfin
# Branding — add SSO login button to login page
- name: Ensure Jellyfin config directory exists
ansible.builtin.file:
path: "{{ jellyfin_data_dir }}/config"
state: directory
mode: '0755'
- name: Deploy Jellyfin branding configuration
ansible.builtin.template:
src: branding.xml.j2
dest: "{{ jellyfin_data_dir }}/config/branding.xml"
mode: '0644'
notify: Reload jellyfin

View file

@ -0,0 +1,7 @@
<?xml version="1.0" encoding="utf-8"?>
<!-- {{ ansible_managed }} -->
<BrandingOptions xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<LoginDisclaimer>&lt;form action="/sso/OID/start/{{ jellyfin_sso_provider_name }}"&gt;&lt;button class="raised block emby-button button-submit" type="submit" style="margin:2em 0"&gt;Sign in with Authentik&lt;/button&gt;&lt;/form&gt;</LoginDisclaimer>
<CustomCss />
<SplashscreenEnabled>false</SplashscreenEnabled>
</BrandingOptions>

View file

@ -0,0 +1,33 @@
<?xml version="1.0" encoding="UTF-8"?>
<!-- {{ ansible_managed }} -->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.jellyfin</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/opt/homebrew/bin:/usr/bin:/bin</string>
</dict>
<key>ProgramArguments</key>
<array>
<string>{{ jellyfin_binary }}</string>
<string>--service</string>
<string>--datadir</string>
<string>{{ jellyfin_data_dir }}</string>
<string>--webdir</string>
<string>{{ jellyfin_webdir }}</string>
</array>
<key>WorkingDirectory</key>
<string>{{ jellyfin_data_dir }}</string>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardErrorPath</key>
<string>{{ jellyfin_log_dir }}/mcquack.jellyfin.err.log</string>
<key>StandardOutPath</key>
<string>{{ jellyfin_log_dir }}/mcquack.jellyfin.out.log</string>
</dict>
</plist>

View file

@ -0,0 +1,33 @@
<?xml version="1.0" encoding="utf-8"?>
<!-- {{ ansible_managed }} -->
<PluginConfiguration xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<SamlConfigs />
<OidConfigs>
<item>
<key><string>{{ jellyfin_sso_provider_name }}</string></key>
<value>
<PluginConfiguration>
<OidEndpoint>https://authentik.ops.eblu.me/application/o/jellyfin</OidEndpoint>
<OidClientId>{{ jellyfin_sso_client_id }}</OidClientId>
<OidSecret>{{ jellyfin_sso_client_secret }}</OidSecret>
<Enabled>true</Enabled>
<EnableAuthorization>true</EnableAuthorization>
<EnableAllFolders>true</EnableAllFolders>
<EnabledFolders />
<AdminRoles><string>admins</string></AdminRoles>
<Roles />
<EnableFolderRoles>false</EnableFolderRoles>
<FolderRoleMappings />
<RoleClaim>groups</RoleClaim>
<OidScopes>
<string>openid</string>
<string>email</string>
<string>profile</string>
</OidScopes>
<SchemeOverride>https</SchemeOverride>
<CanonicalLinks />
</PluginConfiguration>
</value>
</item>
</OidConfigs>
</PluginConfiguration>

View file

@ -0,0 +1,20 @@
---
# Jellyfin metrics collection configuration
# Jellyfin server URL
jellyfin_metrics_url: "http://localhost:8096"
# Path to file containing Jellyfin API key (should have 600 permissions)
jellyfin_metrics_api_key_file: "/Users/erichblume/.jellyfin-api-key"
# Metrics collection interval in seconds
jellyfin_metrics_interval: 60
# Output directory for prometheus textfile collector
jellyfin_metrics_dir: /opt/homebrew/var/node_exporter/textfile
# Script installation path
jellyfin_metrics_script: /Users/erichblume/.local/bin/jellyfin-metrics
# Log directory for metrics script output
jellyfin_metrics_log_dir: /opt/homebrew/var/log

View file

@ -0,0 +1,6 @@
---
- name: Reload jellyfin-metrics
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.jellyfin-metrics.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.jellyfin-metrics.plist
changed_when: true

View file

@ -0,0 +1,55 @@
---
- name: Fetch Jellyfin API key (when running with --tags jellyfin_metrics)
ansible.builtin.command:
cmd: op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/ceywxkcd3z7najsy2nmmbs2vke/credential"
delegate_to: localhost
register: jellyfin_metrics_api_key_fallback
changed_when: false
no_log: true
check_mode: false
when: jellyfin_metrics_api_key is not defined
- name: Set Jellyfin API key fact (fallback)
ansible.builtin.set_fact:
jellyfin_metrics_api_key: "{{ jellyfin_metrics_api_key_fallback.stdout }}"
no_log: true
when: jellyfin_metrics_api_key is not defined
- name: Write Jellyfin API key file
ansible.builtin.copy:
content: "{{ jellyfin_metrics_api_key }}"
dest: "{{ jellyfin_metrics_api_key_file }}"
mode: '0600'
no_log: true
- name: Ensure bin directory exists
ansible.builtin.file:
path: "{{ jellyfin_metrics_script | dirname }}"
state: directory
mode: '0755'
- name: Deploy jellyfin metrics collection script
ansible.builtin.template:
src: jellyfin-metrics.sh.j2
dest: "{{ jellyfin_metrics_script }}"
mode: '0755'
notify: Reload jellyfin-metrics
- name: Deploy jellyfin-metrics LaunchAgent plist
ansible.builtin.template:
src: jellyfin-metrics.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.jellyfin-metrics.plist
mode: '0644'
notify: Reload jellyfin-metrics
- name: Check if jellyfin-metrics LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.jellyfin-metrics
register: jellyfin_metrics_launchctl_check
changed_when: false
failed_when: false
- name: Load jellyfin-metrics LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.jellyfin-metrics.plist
when: jellyfin_metrics_launchctl_check.rc != 0
changed_when: true
failed_when: false

View file

@ -4,7 +4,7 @@
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.plex-metrics</string>
<string>mcquack.eblume.jellyfin-metrics</string>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
@ -12,15 +12,15 @@
</dict>
<key>ProgramArguments</key>
<array>
<string>{{ plex_metrics_script }}</string>
<string>{{ jellyfin_metrics_script }}</string>
</array>
<key>StartInterval</key>
<integer>{{ plex_metrics_interval }}</integer>
<integer>{{ jellyfin_metrics_interval }}</integer>
<key>RunAtLoad</key>
<true/>
<key>StandardErrorPath</key>
<string>{{ plex_metrics_log_dir }}/plex-metrics.err.log</string>
<string>{{ jellyfin_metrics_log_dir }}/jellyfin-metrics.err.log</string>
<key>StandardOutPath</key>
<string>{{ plex_metrics_log_dir }}/plex-metrics.out.log</string>
<string>{{ jellyfin_metrics_log_dir }}/jellyfin-metrics.out.log</string>
</dict>
</plist>

View file

@ -0,0 +1,137 @@
#!/bin/bash
# {{ ansible_managed }}
# Collects Jellyfin Media Server metrics for node_exporter textfile collector
set -euo pipefail
JELLYFIN_URL="{{ jellyfin_metrics_url }}"
API_KEY_FILE="{{ jellyfin_metrics_api_key_file }}"
OUTPUT_FILE="{{ jellyfin_metrics_dir }}/jellyfin.prom"
TEMP_FILE="${OUTPUT_FILE}.tmp"
# Read API key from file
get_api_key() {
if [ -f "$API_KEY_FILE" ]; then
cat "$API_KEY_FILE" | tr -d '\n'
else
echo ""
fi
}
# Make API request with optional API key
api_request() {
local endpoint="$1"
local use_auth="${2:-true}"
local api_key
local url="${JELLYFIN_URL}${endpoint}"
if [ "$use_auth" = "true" ]; then
api_key=$(get_api_key)
if [ -n "$api_key" ]; then
curl -s -H "Accept: application/json" -H "X-Emby-Token: $api_key" "$url" 2>/dev/null
else
curl -s -H "Accept: application/json" "$url" 2>/dev/null
fi
else
curl -s -H "Accept: application/json" "$url" 2>/dev/null
fi
}
# Initialize metrics
jellyfin_up=0
jellyfin_version=""
jellyfin_sessions_total=0
jellyfin_sessions_playing=0
jellyfin_sessions_paused=0
jellyfin_transcode_sessions_total=0
# Library metrics will be built dynamically
library_metrics=""
# Check server health (no auth required)
health=$(api_request "/health" false)
if [ "$health" = "Healthy" ]; then
jellyfin_up=1
fi
# Get system info for version (requires auth)
if [ "$jellyfin_up" -eq 1 ] && [ -f "$API_KEY_FILE" ]; then
system_info=$(api_request "/System/Info")
if [ -n "$system_info" ]; then
jellyfin_version=$(echo "$system_info" | jq -r '.Version // ""')
fi
# Get library counts (virtual folders)
libraries=$(api_request "/Library/VirtualFolders")
if [ -n "$libraries" ] && echo "$libraries" | jq -e '.' > /dev/null 2>&1; then
# Process each library
while IFS=$'\t' read -r lib_name lib_type lib_id; do
if [ -n "$lib_name" ] && [ -n "$lib_type" ]; then
# Get item count for this library
# Map collection type to item type for counting
case "$lib_type" in
movies) item_type="Movie" ;;
tvshows) item_type="Series" ;;
music) item_type="MusicAlbum" ;;
*) item_type="" ;;
esac
if [ -n "$item_type" ] && [ -n "$lib_id" ]; then
items=$(api_request "/Items?parentId=${lib_id}&recursive=true&includeItemTypes=${item_type}&limit=0")
item_count=$(echo "$items" | jq -r '.TotalRecordCount // 0' 2>/dev/null || echo "0")
library_metrics="${library_metrics}jellyfin_library_items{library=\"${lib_name}\",type=\"${lib_type}\"} ${item_count}
"
fi
fi
done < <(echo "$libraries" | jq -r '.[] | [.Name, .CollectionType, .ItemId] | @tsv' 2>/dev/null || true)
fi
# Get active sessions
sessions=$(api_request "/Sessions")
if [ -n "$sessions" ] && echo "$sessions" | jq -e '.' > /dev/null 2>&1; then
jellyfin_sessions_total=$(echo "$sessions" | jq -r 'length')
# Count playing sessions (NowPlayingItem is present and IsPaused is false)
jellyfin_sessions_playing=$(echo "$sessions" | jq -r '[.[] | select(.NowPlayingItem != null and .PlayState.IsPaused == false)] | length')
# Count paused sessions
jellyfin_sessions_paused=$(echo "$sessions" | jq -r '[.[] | select(.NowPlayingItem != null and .PlayState.IsPaused == true)] | length')
# Count transcode sessions (TranscodingInfo is present)
jellyfin_transcode_sessions_total=$(echo "$sessions" | jq -r '[.[] | select(.TranscodingInfo != null)] | length')
fi
fi
# Write metrics
cat > "$TEMP_FILE" << EOF
# HELP jellyfin_up Jellyfin Media Server is up and responding
# TYPE jellyfin_up gauge
jellyfin_up ${jellyfin_up}
# HELP jellyfin_version_info Jellyfin Media Server version information
# TYPE jellyfin_version_info gauge
jellyfin_version_info{version="${jellyfin_version}"} 1
# HELP jellyfin_sessions_total Total number of active Jellyfin sessions
# TYPE jellyfin_sessions_total gauge
jellyfin_sessions_total ${jellyfin_sessions_total}
# HELP jellyfin_sessions_playing Number of sessions currently playing
# TYPE jellyfin_sessions_playing gauge
jellyfin_sessions_playing ${jellyfin_sessions_playing}
# HELP jellyfin_sessions_paused Number of sessions currently paused
# TYPE jellyfin_sessions_paused gauge
jellyfin_sessions_paused ${jellyfin_sessions_paused}
# HELP jellyfin_transcode_sessions_total Number of sessions being transcoded
# TYPE jellyfin_transcode_sessions_total gauge
jellyfin_transcode_sessions_total ${jellyfin_transcode_sessions_total}
# HELP jellyfin_library_items Number of items in each Jellyfin library
# TYPE jellyfin_library_items gauge
${library_metrics}
EOF
# Atomic move
mv "$TEMP_FILE" "$OUTPUT_FILE"

View file

@ -37,9 +37,9 @@
msg: "WARNING: Docker does not appear to be running. Please start Docker Desktop manually."
when: minikube_docker_status.rc != 0
- name: Check if minikube cluster exists
- name: Check minikube cluster status
ansible.builtin.command:
cmd: minikube status --format={% raw %}'{{.Host}}'{% endraw %}
cmd: minikube status
register: minikube_status
changed_when: false
failed_when: false
@ -63,11 +63,11 @@
failed_when: false # Don't fail - may need manual intervention
when:
- minikube_docker_status.rc == 0
- minikube_status.rc != 0 or 'Running' not in minikube_status.stdout
- minikube_status.rc != 0
- name: Check minikube status after start attempt
ansible.builtin.command:
cmd: minikube status --format={% raw %}'{{.Host}}'{% endraw %}
cmd: minikube status
register: minikube_final_status
changed_when: false
failed_when: false
@ -75,7 +75,38 @@
- name: Warn if minikube failed to start
ansible.builtin.debug:
msg: "WARNING: minikube may not have started properly. Run 'minikube start' manually on indri if needed. Status: {{ minikube_final_status.stdout | default('unknown') }}"
when: minikube_final_status.rc != 0 or 'Running' not in minikube_final_status.stdout
when: minikube_final_status.rc != 0
# The storage-provisioner is a bare Pod (no controller). If the node restarts
# via Docker Desktop rather than `minikube start`, kubelet brings back static
# pods (apiserver, etcd) but bare pods like storage-provisioner are lost.
# `minikube start` on a running cluster is safe and re-applies all addons.
- name: Check storage-provisioner pod is running
ansible.builtin.command:
cmd: kubectl -n kube-system get pod storage-provisioner -o jsonpath='{.status.phase}'
register: minikube_storage_provisioner
changed_when: false
failed_when: false
when: minikube_final_status.rc == 0
- name: Re-run minikube start to restore addons
ansible.builtin.command:
cmd: >
minikube start
--driver={{ minikube_driver }}
--container-runtime={{ minikube_container_runtime }}
--cpus={{ minikube_cpus }}
--memory={{ minikube_memory }}
--disk-size={{ minikube_disk_size }}
{% for name in minikube_apiserver_names %}
--apiserver-names={{ name }}
{% endfor %}
--apiserver-port={{ minikube_apiserver_port }}
--listen-address={{ minikube_listen_address }}
when:
- minikube_final_status.rc == 0
- minikube_storage_provisioner.stdout | default('') != 'Running'
changed_when: true
# Configure containerd to use zot registry as pull-through cache
# With docker driver, use host.minikube.internal to reach the host
@ -85,32 +116,32 @@
ansible.builtin.command:
cmd: minikube ssh --native-ssh=false "sudo mkdir -p /etc/containerd/certs.d/{{ item }}"
loop:
- registry.tail8d86e.ts.net
- registry.ops.eblu.me
- docker.io
- ghcr.io
- quay.io
changed_when: false
when: minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
when: minikube_final_status.rc == 0
# Private registry (registry.tail8d86e.ts.net) - direct to zot
- name: Check registry.tail8d86e.ts.net config
# Private registry (registry.ops.eblu.me) - direct to zot
- name: Check registry.ops.eblu.me config
ansible.builtin.command:
cmd: minikube ssh --native-ssh=false "cat /etc/containerd/certs.d/registry.tail8d86e.ts.net/hosts.toml 2>/dev/null || echo ''"
cmd: minikube ssh --native-ssh=false "cat /etc/containerd/certs.d/registry.ops.eblu.me/hosts.toml 2>/dev/null || echo ''"
register: minikube_registry_config
changed_when: false
when: minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
when: minikube_final_status.rc == 0
- name: Configure registry.tail8d86e.ts.net mirror
- name: Configure registry.ops.eblu.me mirror
ansible.builtin.command:
cmd: |
minikube ssh --native-ssh=false 'echo "server = \"http://host.minikube.internal:5050\"
[host.\"http://host.minikube.internal:5050\"]
capabilities = [\"pull\", \"resolve\", \"push\"]
skip_verify = true" | sudo tee /etc/containerd/certs.d/registry.tail8d86e.ts.net/hosts.toml'
skip_verify = true" | sudo tee /etc/containerd/certs.d/registry.ops.eblu.me/hosts.toml'
changed_when: true
when:
- minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
- minikube_final_status.rc == 0
- "'host.minikube.internal:5050' not in minikube_registry_config.stdout"
notify: Restart containerd in minikube
@ -120,7 +151,7 @@
cmd: minikube ssh --native-ssh=false "cat /etc/containerd/certs.d/docker.io/hosts.toml 2>/dev/null || echo ''"
register: minikube_dockerio_config
changed_when: false
when: minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
when: minikube_final_status.rc == 0
- name: Configure docker.io mirror through zot
ansible.builtin.command:
@ -132,7 +163,7 @@
skip_verify = true" | sudo tee /etc/containerd/certs.d/docker.io/hosts.toml'
changed_when: true
when:
- minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
- minikube_final_status.rc == 0
- "'host.minikube.internal:5050' not in minikube_dockerio_config.stdout"
notify: Restart containerd in minikube
@ -142,7 +173,7 @@
cmd: minikube ssh --native-ssh=false "cat /etc/containerd/certs.d/ghcr.io/hosts.toml 2>/dev/null || echo ''"
register: minikube_ghcr_config
changed_when: false
when: minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
when: minikube_final_status.rc == 0
- name: Configure ghcr.io mirror through zot
ansible.builtin.command:
@ -154,7 +185,7 @@
skip_verify = true" | sudo tee /etc/containerd/certs.d/ghcr.io/hosts.toml'
changed_when: true
when:
- minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
- minikube_final_status.rc == 0
- "'host.minikube.internal:5050' not in minikube_ghcr_config.stdout"
notify: Restart containerd in minikube
@ -164,7 +195,7 @@
cmd: minikube ssh --native-ssh=false "cat /etc/containerd/certs.d/quay.io/hosts.toml 2>/dev/null || echo ''"
register: minikube_quay_config
changed_when: false
when: minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
when: minikube_final_status.rc == 0
- name: Configure quay.io mirror through zot
ansible.builtin.command:
@ -176,7 +207,7 @@
skip_verify = true" | sudo tee /etc/containerd/certs.d/quay.io/hosts.toml'
changed_when: true
when:
- minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
- minikube_final_status.rc == 0
- "'host.minikube.internal:5050' not in minikube_quay_config.stdout"
notify: Restart containerd in minikube
@ -188,13 +219,13 @@
cmd: kubectl config view --minify -o jsonpath="{.clusters[0].cluster.server}"
register: minikube_api_url
changed_when: false
when: minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
when: minikube_final_status.rc == 0
- name: Extract API server port from URL
ansible.builtin.set_fact:
minikube_api_port: "{{ minikube_api_url.stdout | regex_search(':([0-9]+)$', '\\1') | first }}"
when:
- minikube_final_status.rc == 0 and 'Running' in minikube_final_status.stdout
- minikube_final_status.rc == 0
- minikube_api_url.stdout is defined
- name: Check current tailscale serve config for k8s

View file

@ -1,6 +1,6 @@
---
minikube_metrics_dir: /opt/homebrew/var/node_exporter/textfile
minikube_metrics_script: /Users/erichblume/bin/minikube-metrics
minikube_metrics_script: /Users/erichblume/.local/bin/minikube-metrics
minikube_metrics_interval: 60 # seconds between metric collection
minikube_metrics_log_dir: /opt/homebrew/var/log
minikube_metrics_user_home: /Users/erichblume

View file

@ -1,20 +0,0 @@
---
# Plex metrics collection configuration
# Plex server URL
plex_metrics_url: "http://localhost:32400"
# Path to file containing Plex token (should have 600 permissions)
plex_metrics_token_file: "/Users/erichblume/.plex-token"
# Metrics collection interval in seconds
plex_metrics_interval: 60
# Output directory for prometheus textfile collector
plex_metrics_dir: /opt/homebrew/var/node_exporter/textfile
# Script installation path
plex_metrics_script: /Users/erichblume/bin/plex-metrics
# Log directory for metrics script output
plex_metrics_log_dir: /opt/homebrew/var/log

View file

@ -1,6 +0,0 @@
---
- name: Reload plex-metrics
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.plex-metrics.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.plex-metrics.plist
changed_when: true

View file

@ -1,4 +0,0 @@
---
# Role ordering is controlled by indri.yml playbook - do not add dependencies here
# (Ansible's tag accumulation prevents proper deduplication when using meta dependencies)
dependencies: []

View file

@ -1,32 +0,0 @@
---
- name: Ensure bin directory exists
ansible.builtin.file:
path: "{{ plex_metrics_script | dirname }}"
state: directory
mode: '0755'
- name: Deploy plex metrics collection script
ansible.builtin.template:
src: plex-metrics.sh.j2
dest: "{{ plex_metrics_script }}"
mode: '0755'
notify: Reload plex-metrics
- name: Deploy plex-metrics LaunchAgent plist
ansible.builtin.template:
src: plex-metrics.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.plex-metrics.plist
mode: '0644'
notify: Reload plex-metrics
- name: Check if plex-metrics LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.plex-metrics
register: plex_metrics_launchctl_check
changed_when: false
failed_when: false
- name: Load plex-metrics LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.plex-metrics.plist
when: plex_metrics_launchctl_check.rc != 0
changed_when: true
failed_when: false

View file

@ -1,133 +0,0 @@
#!/bin/bash
# {{ ansible_managed }}
# Collects Plex Media Server metrics for node_exporter textfile collector
set -euo pipefail
PLEX_URL="{{ plex_metrics_url }}"
TOKEN_FILE="{{ plex_metrics_token_file }}"
OUTPUT_FILE="{{ plex_metrics_dir }}/plex.prom"
TEMP_FILE="${OUTPUT_FILE}.tmp"
# Read token from file
get_token() {
if [ -f "$TOKEN_FILE" ]; then
cat "$TOKEN_FILE" | tr -d '\n'
else
echo ""
fi
}
# Make API request with optional token
api_request() {
local endpoint="$1"
local use_token="${2:-true}"
local token
local url="${PLEX_URL}${endpoint}"
if [ "$use_token" = "true" ]; then
token=$(get_token)
if [ -n "$token" ]; then
curl -s -H "Accept: application/json" -H "X-Plex-Token: $token" "$url" 2>/dev/null
else
curl -s -H "Accept: application/json" "$url" 2>/dev/null
fi
else
curl -s -H "Accept: application/json" "$url" 2>/dev/null
fi
}
# Initialize metrics
plex_up=0
plex_version=""
plex_sessions_total=0
plex_sessions_playing=0
plex_sessions_paused=0
plex_transcode_sessions_total=0
plex_transcode_video=0
plex_transcode_audio=0
# Library metrics will be built dynamically
library_metrics=""
# Check server identity (no auth required)
identity=$(api_request "/identity" false)
if echo "$identity" | jq -e '.MediaContainer.machineIdentifier' > /dev/null 2>&1; then
plex_up=1
plex_version=$(echo "$identity" | jq -r '.MediaContainer.version // ""')
fi
# If server is up, get additional metrics (require auth)
if [ "$plex_up" -eq 1 ] && [ -f "$TOKEN_FILE" ]; then
# Get library sections
sections=$(api_request "/library/sections")
# Process each library using jq
while IFS=$'\t' read -r lib_key lib_type lib_title; do
if [ -n "$lib_key" ] && [ -n "$lib_type" ]; then
# Get library details for item count
lib_detail=$(api_request "/library/sections/${lib_key}/all?X-Plex-Container-Start=0&X-Plex-Container-Size=0")
lib_size=$(echo "$lib_detail" | jq -r '.MediaContainer.totalSize // .MediaContainer.size // 0')
library_metrics="${library_metrics}plex_library_items{library=\"${lib_title}\",type=\"${lib_type}\"} ${lib_size}
"
fi
done < <(echo "$sections" | jq -r '.MediaContainer.Directory[] | [.key, .type, .title] | @tsv' 2>/dev/null || true)
# Get active sessions
sessions=$(api_request "/status/sessions")
if echo "$sessions" | jq -e '.MediaContainer' > /dev/null 2>&1; then
plex_sessions_total=$(echo "$sessions" | jq -r '.MediaContainer.size // 0')
# Count playing vs paused
plex_sessions_playing=$(echo "$sessions" | jq -r '[.MediaContainer.Metadata[]? | select(.Player.state == "playing")] | length')
plex_sessions_paused=$(echo "$sessions" | jq -r '[.MediaContainer.Metadata[]? | select(.Player.state == "paused")] | length')
# Count transcode sessions
plex_transcode_video=$(echo "$sessions" | jq -r '[.MediaContainer.Metadata[]? | select(.TranscodeSession.videoDecision == "transcode")] | length')
plex_transcode_audio=$(echo "$sessions" | jq -r '[.MediaContainer.Metadata[]? | select(.TranscodeSession.audioDecision == "transcode")] | length')
plex_transcode_sessions_total=$(echo "$sessions" | jq -r '[.MediaContainer.Metadata[]? | select(.TranscodeSession)] | length')
fi
fi
# Write metrics
cat > "$TEMP_FILE" << EOF
# HELP plex_up Plex Media Server is up and responding
# TYPE plex_up gauge
plex_up ${plex_up}
# HELP plex_version_info Plex Media Server version information
# TYPE plex_version_info gauge
plex_version_info{version="${plex_version}"} 1
# HELP plex_sessions_total Total number of active Plex sessions
# TYPE plex_sessions_total gauge
plex_sessions_total ${plex_sessions_total}
# HELP plex_sessions_playing Number of sessions currently playing
# TYPE plex_sessions_playing gauge
plex_sessions_playing ${plex_sessions_playing}
# HELP plex_sessions_paused Number of sessions currently paused
# TYPE plex_sessions_paused gauge
plex_sessions_paused ${plex_sessions_paused}
# HELP plex_transcode_sessions_total Number of sessions being transcoded
# TYPE plex_transcode_sessions_total gauge
plex_transcode_sessions_total ${plex_transcode_sessions_total}
# HELP plex_transcode_video_sessions Number of sessions transcoding video
# TYPE plex_transcode_video_sessions gauge
plex_transcode_video_sessions ${plex_transcode_video}
# HELP plex_transcode_audio_sessions Number of sessions transcoding audio
# TYPE plex_transcode_audio_sessions gauge
plex_transcode_audio_sessions ${plex_transcode_audio}
# HELP plex_library_items Number of items in each Plex library
# TYPE plex_library_items gauge
${library_metrics}
EOF
# Atomic move
mv "$TEMP_FILE" "$OUTPUT_FILE"

View file

@ -0,0 +1,15 @@
---
# Docker images for Prometheus exporters on sifaka NAS
# Ports are defined in group_vars/all.yml (shared with caddy role)
sifaka_exporters_docker: /volume1/@appstore/ContainerManager/usr/bin/docker
sifaka_exporters_node_exporter_image: "prom/node-exporter:latest"
sifaka_exporters_node_exporter_name: "prom-node-exporter-1"
sifaka_exporters_smartctl_exporter_image: "prometheuscommunity/smartctl-exporter:latest"
sifaka_exporters_smartctl_exporter_name: "smartctl-exporter"
# Synology uses /dev/sata* instead of /dev/sd* — smartctl can't auto-detect them
sifaka_exporters_smartctl_devices:
- /dev/sata1
- /dev/sata2
- /dev/sata3
- /dev/sata4

View file

@ -0,0 +1,12 @@
---
- name: Restart node_exporter
ansible.builtin.command: "{{ sifaka_exporters_docker }} restart {{ sifaka_exporters_node_exporter_name }}"
become: true
listen: Restart node_exporter
changed_when: true
- name: Restart smartctl_exporter
ansible.builtin.command: "{{ sifaka_exporters_docker }} restart {{ sifaka_exporters_smartctl_exporter_name }}"
become: true
listen: Restart smartctl_exporter
changed_when: true

View file

@ -0,0 +1,91 @@
---
# Manage Prometheus exporter containers on sifaka NAS
# Uses command module to avoid requiring docker Python SDK on Synology
# Requires passwordless sudo for docker — see docs/reference/storage/sifaka.md
# --- node_exporter ---
- name: Pull node_exporter image
ansible.builtin.command: "{{ sifaka_exporters_docker }} pull {{ sifaka_exporters_node_exporter_image }}"
become: true
register: sifaka_exporters_node_pull
changed_when: "'Downloaded newer image' in sifaka_exporters_node_pull.stdout"
- name: Check if node_exporter container exists
ansible.builtin.command: "{{ sifaka_exporters_docker }} inspect {{ sifaka_exporters_node_exporter_name }} --format {% raw %}'{{.Config.Image}}'{% endraw %}"
become: true
register: sifaka_exporters_node_inspect
changed_when: false
failed_when: false
- name: Remove node_exporter container if image changed
ansible.builtin.command: "{{ sifaka_exporters_docker }} rm -f {{ sifaka_exporters_node_exporter_name }}"
become: true
when:
- sifaka_exporters_node_inspect.rc == 0
- sifaka_exporters_node_inspect.stdout != sifaka_exporters_node_exporter_image
changed_when: true
- name: Start node_exporter container
ansible.builtin.command:
argv:
- "{{ sifaka_exporters_docker }}"
- run
- -d
- "--name={{ sifaka_exporters_node_exporter_name }}"
- --restart=always
- --net=host
- "{{ sifaka_exporters_node_exporter_image }}"
become: true
register: sifaka_exporters_node_start
when: >
sifaka_exporters_node_inspect.rc != 0 or
sifaka_exporters_node_inspect.stdout != sifaka_exporters_node_exporter_image
changed_when: sifaka_exporters_node_start.rc == 0
# --- smartctl_exporter ---
- name: Pull smartctl_exporter image
ansible.builtin.command: "{{ sifaka_exporters_docker }} pull {{ sifaka_exporters_smartctl_exporter_image }}"
become: true
register: sifaka_exporters_smartctl_pull
changed_when: "'Downloaded newer image' in sifaka_exporters_smartctl_pull.stdout"
- name: Check if smartctl_exporter container exists
ansible.builtin.command: "{{ sifaka_exporters_docker }} inspect {{ sifaka_exporters_smartctl_exporter_name }} --format {% raw %}'{{.Config.Image}}'{% endraw %}"
become: true
register: sifaka_exporters_smartctl_inspect
changed_when: false
failed_when: false
- name: Remove smartctl_exporter container if image changed
ansible.builtin.command: "{{ sifaka_exporters_docker }} rm -f {{ sifaka_exporters_smartctl_exporter_name }}"
become: true
when:
- sifaka_exporters_smartctl_inspect.rc == 0
- sifaka_exporters_smartctl_inspect.stdout != sifaka_exporters_smartctl_exporter_image
changed_when: true
- name: Build smartctl_exporter device arguments
ansible.builtin.set_fact:
sifaka_exporters_smartctl_device_args: >-
{{ sifaka_exporters_smartctl_devices | map('regex_replace', '^(.*)$', '--smartctl.device=\1') | list }}
- name: Start smartctl_exporter container
ansible.builtin.command:
argv: >-
{{ [
sifaka_exporters_docker, 'run', '-d',
'--name=' + sifaka_exporters_smartctl_exporter_name,
'--restart=always',
'--privileged',
'--user=root',
'-p', sifaka_smartctl_exporter_port | string + ':' + sifaka_smartctl_exporter_port | string,
sifaka_exporters_smartctl_exporter_image
] + sifaka_exporters_smartctl_device_args }}
become: true
register: sifaka_exporters_smartctl_start
when: >
sifaka_exporters_smartctl_inspect.rc != 0 or
sifaka_exporters_smartctl_inspect.stdout != sifaka_exporters_smartctl_exporter_image
changed_when: sifaka_exporters_smartctl_start.rc == 0

View file

@ -1,17 +0,0 @@
---
# Tailscale serve configuration for this host
# Each service maps a Tailscale service name to local endpoints
tailscale_serve_services:
- name: svc:forge
https:
port: 443
upstream: http://localhost:3001
tcp:
port: 22
upstream: tcp://localhost:2200
- name: svc:registry
https:
port: 443
upstream: http://localhost:5050

View file

@ -1,4 +0,0 @@
---
# Role ordering is controlled by indri.yml playbook - do not add dependencies here
# (Ansible's tag accumulation prevents proper deduplication when using meta dependencies)
dependencies: []

View file

@ -1,38 +0,0 @@
---
- name: Get current tailscale serve status
ansible.builtin.command: tailscale serve status --json
register: tailscale_serve_status
changed_when: false
- name: Parse serve status
ansible.builtin.set_fact:
tailscale_serve_config: "{{ ((tailscale_serve_status.stdout | default('{}', true)) | from_json).Services | default({}) }}"
# Configure HTTPS if service doesn't have Web config yet
- name: Configure HTTPS services
ansible.builtin.command: >
tailscale serve --service="{{ item.name }}"
--https={{ item.https.port }} {{ item.https.upstream }}
loop: "{{ tailscale_serve_services }}"
when:
- item.https is defined
- tailscale_serve_config[item.name] is not defined or tailscale_serve_config[item.name].Web is not defined
register: tailscale_serve_https_result
changed_when: true
failed_when: false
# Configure TCP if service doesn't have the specific port configured yet
- name: Configure TCP services
ansible.builtin.command: >
tailscale serve --service="{{ item.name }}"
--tcp={{ item.tcp.port }} {{ item.tcp.upstream }}
loop: "{{ tailscale_serve_services }}"
when:
- item.tcp is defined
- tailscale_serve_config[item.name] is not defined or
tailscale_serve_config[item.name].TCP is not defined or
tailscale_serve_config[item.name].TCP[item.tcp.port | string] is not defined or
tailscale_serve_config[item.name].TCP[item.tcp.port | string].TCPForward is not defined
register: tailscale_serve_tcp_result
changed_when: true
failed_when: false

View file

@ -5,6 +5,8 @@ zot_data_dir: /Users/erichblume/zot
zot_config_dir: /Users/erichblume/.config/zot
zot_port: 5050
zot_log_dir: /Users/erichblume/Library/Logs
zot_external_url: https://registry.ops.eblu.me
zot_oidc_issuer: https://authentik.ops.eblu.me/application/o/zot/
# Pull-through cache registries (on-demand sync)
zot_sync_registries:

View file

@ -46,6 +46,14 @@
mode: '0644'
notify: Restart zot
- name: Deploy zot OIDC credentials
ansible.builtin.template:
src: oidc-credentials.json.j2
dest: "{{ zot_config_dir }}/oidc-credentials.json"
mode: '0600'
notify: Restart zot
when: zot_oidc_client_secret is defined
- name: Deploy zot LaunchAgent plist
ansible.builtin.template:
src: zot.plist.j2

View file

@ -8,7 +8,44 @@
},
"http": {
"address": "0.0.0.0",
"port": "{{ zot_port }}"
"port": "{{ zot_port }}",
"externalUrl": "{{ zot_external_url }}",
"auth": {
"openid": {
"providers": {
"oidc": {
"credentialsFile": "{{ zot_config_dir }}/oidc-credentials.json",
"issuer": "{{ zot_oidc_issuer }}",
"scopes": ["openid", "email", "profile"],
"claimMapping": {
"username": "preferred_username"
}
}
}
},
"apikey": true
},
"accessControl": {
"metrics": {
"users": [""]
},
"repositories": {
"**": {
"policies": [
{
"groups": ["artifact-workloads"],
"actions": ["read", "create"]
},
{
"groups": ["admins"],
"actions": ["read", "create", "update", "delete"]
}
],
"anonymousPolicy": ["read"],
"defaultPolicy": ["read"]
}
}
}
},
"log": {
"level": "info"

View file

@ -0,0 +1,4 @@
{
"clientid": "zot",
"clientsecret": "{{ zot_oidc_client_secret }}"
}

View file

@ -16,6 +16,11 @@
<true/>
<key>KeepAlive</key>
<true/>
<key>EnvironmentVariables</key>
<dict>
<key>PATH</key>
<string>/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin</string>
</dict>
<key>StandardOutPath</key>
<string>{{ zot_log_dir }}/mcquack.zot.out.log</string>
<key>StandardErrorPath</key>

Some files were not shown because too many files have changed in this diff Show more