blumeops

Author	SHA1	Message	Date
Erich Blume	a2f1e06224	Add hephaestus sync hub to indri (launchagent, PWA, device-code OIDC) (#369 ) Makes indri the canonical heph hub for the hub-and-spoke task/context system, deployed as a self-updating LaunchAgent managed by Ansible. Other devices (gilbert) attach as offline-capable spokes. ## What's here - `ansible/roles/heph` (tag `heph`) — bootstrap `cargo install hephd` (only if absent; `--self-update` keeps it current after), version-pinned `heph-pwa` checkout served via `--web-root`, launchagent `mcquack.eblume.heph`: ``` hephd --mode server --http-addr 0.0.0.0:8787 --db … --web-root … --oidc-issuer …/o/heph/ --oidc-audience heph --self-update --self-update-interval-secs 600 ``` `~/.cargo/bin` is on the agent `PATH` so self-update's `cargo install` works. - Caddy — `heph.ops.eblu.me → localhost:8787` (TLS for the PWA secure context). - Authentik — new `heph` public device-code OIDC app + `default-device-code-flow` bound to the default brand's `flow_device_code` (verified live: brand `authentik-default`, field currently unset → additive). - Docs — `services/hephaestus.md` (Path-A seeding runbook + spoke caveat), `indri.md`, changelog fragment. ## Three features requested - Autoupdate — 10-min interval (`--self-update-interval-secs 600`). - PWA — `--web-root` (confirmed shipped in v1.2.0). - Spoke — gilbert reconfig documented (post-merge step). ## Deploy plan (not done yet — awaiting review) 1. Seed from gilbert (Path A): `heph daemon stop` → copy `heph.db` → `DELETE FROM meta WHERE key='origin'`. 2. Sync Authentik `apps`/blueprint; verify blueprint status via API (not just logs). 3. `provision-indri --tags heph,caddy` from this branch. 4. Point gilbert at the hub + `heph auth login`. ## Known follow-ups (heph-side, tracked in the Hephaestus project) - `heph daemon` can't bake hub/spoke config or pass `--self-update-interval-secs` → worked around by the ansible plist. - Path-A seeding lacks a clean `hephd --owner-id`/seed command → manual `meta.origin` reset for now. - Self-update moves hephd ahead of the ansible-pinned PWA shell over time (drift; tolerated by the SW cache, revisit on next release). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #369	2026-06-05 06:46:58 -07:00
Erich Blume	4e25180b0a	C0: clone blumeops via tailnet on ringtail provision Switch ringtail.yml from forge.eblu.me (Fly proxy, WAN) to forge.ops.eblu.me (Caddy on indri, tailnet). Ringtail is always on the tailnet — the WAN round-trip was overhead and made provision-ringtail fail any time Fly was slow or down.	2026-05-28 07:13:40 -07:00
Erich Blume	8d634861f6	C1: migrate cv + docs from minikube to indri-native (#342 ) ## Summary Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down. ## What's in this commit - `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/` - `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`) - `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy` - `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit - `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi) - Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment ## What is not in this commit The dead artifacts. After PR review and successful cutover, a follow-up commit deletes: - `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/` - `containers/cv/`, `containers/quartz/` - `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check` - bumps `docs.current-version` in `service-versions.yaml` to the release tag ## Cutover plan (manual, from gilbert, after review) 1. Take down old: - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade` - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns \| grep -E '^(cv\|docs)\\b'` (should be empty) - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail 2. Bring up new: - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch) - `mise run provision-indri -- --tags cv,docs,caddy` - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"` 3. Verify: `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md` 4. Cleanup commit + merge. Total expected downtime: minutes (not the few-hour budget you authorized). ## Test plan - [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean - [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread - [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200 - [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment` - [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page - [ ] `mise run services-check` clean - [ ] `mise run service-review --type ansible` shows cv and docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #342	2026-04-29 14:55:11 -07:00
Erich Blume	14ca0160ba	Migrate devpi from minikube to indri (launchd) (#341 ) ## Summary Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent. ## What changed - New ansible role `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box). - Caddy: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`. - Playbook: `indri.yml` gains pre_tasks for the root password and the new role. - service-versions.yaml: devpi flipped from `type: argocd` to `type: ansible`. - ArgoCD: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted. - Docs: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list. ## Already deployed (live on indri) - Service running: `launchctl list mcquack.eblume.devpi` → PID 53888 - `curl https://pypi.ops.eblu.me/+api` returns 200 ✅ - `mise run docs-mikado` works again ✅ - 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/` - Minikube namespace and PVC fully reclaimed ## Test plan - [ ] `mise run services-check` (after merge) - [ ] CI workflows that use devpi succeed - [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines) ## Context This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain. Reviewed-on: #341	2026-04-29 13:38:36 -07:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	2c081eed28	Add Forgejo repository health metrics and Grafana dashboard (#245 ) ## Summary - New `forgejo_metrics` Ansible role that queries the Forgejo REST API every 60s and writes Prometheus textfile metrics (open PRs, issues, languages, releases, commits, Actions runs/duration/success) - Grafana dashboard "Forgejo Repository Health" with 12 panels across 4 rows: overview stats, CI/CD health, repository info, and staleness tracking - Deletes superseded `forgejo-actions-dashboard` plan doc (this implementation covers a broader scope) ## Deployment and Testing - [ ] `mise run provision-indri -- --tags forgejo_metrics` to deploy the collector - [ ] `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/forgejo.prom'` to verify metrics - [ ] `argocd app sync grafana-config` to deploy the dashboard - [ ] Check Grafana dashboard "Forgejo Repository Health" loads with data - [ ] `mise run services-check` passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/245	2026-02-22 11:16:03 -08:00
Erich Blume	07fb48626d	Add Authentik SSO integration for Jellyfin (#239 ) ## Summary - Add Authentik OIDC provider + application for Jellyfin via blueprint (all authenticated users allowed, no policy binding) - Wire `jellyfin-client-secret` through ExternalSecret and Authentik worker deployment - Install [jellyfin-plugin-sso](https://github.com/9p4/jellyfin-plugin-sso) v4.0.0.3 via Ansible, with OIDC config template - Authentik `admins` group maps to Jellyfin administrator role - Local login left enabled; SSO is additive ## Deployment and Testing - [ ] Sync ArgoCD `authentik` app on branch — verify provider + application appear in Authentik admin - [ ] `mise run provision-indri -- --tags jellyfin --check --diff` (dry run) - [ ] `mise run provision-indri -- --tags jellyfin` (deploy plugin + config) - [ ] Test SSO flow: `https://jellyfin.ops.eblu.me/sso/OID/start/authentik` - [ ] Verify `eblume` account auto-links via `preferred_username` match - [ ] Verify admins group → Jellyfin admin - [ ] Reset ArgoCD app revision to main after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/239	2026-02-21 20:05:44 -08:00
Erich Blume	ff63679efb	Enable zot registry auth + wire CI credentials (#237 ) ## Summary - Enable OIDC + API key authentication on zot registry with three-tier accessControl - `anonymousPolicy: ["read"]` — anyone can pull - `artifact-workloads` group: `["read", "create"]` — CI push, no overwrite/delete - `admins` group: `["read", "create", "update", "delete"]` — break-glass - Wire both CI push paths (Dagger and Nix/skopeo) with `ZOT_CI_API_KEY` credentials - Add `artifact-workloads` PolicyBinding in Authentik blueprint for zot app access - Add `ZOT_CI_API_KEY` to Forgejo Actions secrets via existing ansible role Completes the `wire-ci-registry-auth` and `harden-zot-registry` Mikado cards. ## Manual Deployment Steps (after merge) 1. Deploy Authentik blueprint: `argocd app sync authentik` 2. In Authentik admin UI: set a password for the `zot-ci` service account 3. Deploy zot config: `mise run provision-indri -- --tags zot` 4. Log in to `https://registry.ops.eblu.me` as `zot-ci` via OIDC → generate API key 5. Store API key in 1Password as `zot-ci-apikey` in blumeops vault 6. Sync Forgejo secrets: `mise run provision-indri -- --tags forgejo_actions_secrets` 7. Trigger a test container build to verify CI push 8. Verify anonymous pull: `curl -sf https://registry.ops.eblu.me/v2/_catalog` ## Uncertainties - Zot `accessControl` group matching with OIDC: Groups from Authentik's `profile` scope claim should map to zot policy groups, but the exact claim-to-group matching needs runtime verification - `http.auth.apikey: true`: This config key is documented but needs verification against the specific zot version built from source on indri - API key permissions: Need to confirm zot API keys inherit the generating user's group for accessControl evaluation ## Test Plan - [ ] `mise run provision-indri -- --check --diff --tags zot` shows expected config changes - [ ] Anonymous pull works after deploy - [ ] Unauthenticated push fails (401) - [ ] OIDC browser login redirects to Authentik and back - [ ] API key push works after key generation - [ ] CI push succeeds with both Dagger and skopeo paths - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/237	2026-02-21 12:20:29 -08:00
Erich Blume	21b6533aea	Register Zot as OIDC client in Authentik (#236 ) ## Summary - Add Authentik blueprint (`zot.yaml`) with OAuth2 provider, application, `artifact-workloads` group, and `zot-ci` service account - Wire `zot-client-secret` through ExternalSecret → worker Deployment env var → blueprint `!Env` - Add Ansible pre_task to fetch OIDC secret from 1Password (item ID `oor7os5kapczgpbwv7obkca4y4`) - Add `oidc-credentials.json.j2` template and deploy task in zot role (with `when` guard) ## Manual Steps Required Before Deploy 1. Generate client secret: `openssl rand -hex 32` 2. Store in 1Password: add field `zot-client-secret` to "Authentik (blumeops)" item in vault `blumeops` ## What This Does NOT Do - Does NOT modify `config.json.j2` (that's the root goal `harden-zot-registry`) - Does NOT wire CI auth (that's `wire-ci-registry-auth`) - Does NOT set service account password or API keys (manual post-deploy) ## Verification After ArgoCD sync: - [ ] Authentik admin UI shows "Zot Registry" application - [ ] OIDC discovery at `https://authentik.ops.eblu.me/application/o/zot/.well-known/openid-configuration` returns valid JSON - [ ] Blueprint status is `successful` - [ ] `artifact-workloads` group exists with `zot-ci` service account 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/236	2026-02-21 08:45:06 -08:00
Erich Blume	16a4a9a616	Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216 ) ## Summary - Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired - Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s - Move ntfy ArgoCD app from indri minikube to ringtail k3s - Update Frigate docs to reflect detector removal and planned RTX 4080 migration - Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail ## Deployment After merge: 1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps: ``` argocd app sync apps --prune ``` 2. Sync new ringtail apps: ``` argocd app sync mosquitto-ringtail argocd app sync ntfy-ringtail ``` 3. Manually clean up the detector LaunchAgent on indri: ``` ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ``` ## Notes - Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns) - ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup - Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed - Frigate + frigate-notify will be ported to ringtail in a follow-up PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216	2026-02-19 11:22:44 -08:00
Erich Blume	b475a1fcd7	Fix 1Password secret tasks always reporting changed in ringtail playbook (#213 ) ## Summary - Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml` - Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created` ## Test plan - [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213	2026-02-19 07:25:24 -08:00
Erich Blume	918df9e642	Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209 ) ## Summary Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions runner for building containers with Nix. ### K3s cluster - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint) - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`) - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access - `kubectl` added to system packages ### 1Password Connect + External Secrets Operator - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`, `external-secrets-config-ringtail` - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password` namespace in post_tasks ### Systemd Forgejo Actions runner - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env` - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access ### Nix container build workflow - New `.forgejo/workflows/build-container-nix.yaml` triggers on `-nix-v[0-9]` tags (e.g. `nettest-nix-v1.0.0`) - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy` - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering ### Mise task updates - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`) - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`) ## Post-merge 1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild 2. Register k3s cluster in ArgoCD (first time only): ```fish ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' \| \ sed 's\|127.0.0.1\|ringtail.tail8d86e.ts.net\|' > /tmp/k3s-ringtail.yaml set -x KUBECONFIG /tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail 3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail -> external-secrets-config-ringtail 4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder' 5. Check Forgejo admin panel for ringtail-nix-builder runner online 6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209	2026-02-18 21:15:30 -08:00
Erich Blume	535f897054	Polish ringtail NixOS config and add documentation (#208 ) ## Summary - Fix Super+Return keybinding to launch wezterm in sway - Set fish as default login shell - Remove `initialPassword` (real password already set) - Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep) - Add ringtail reference card, update host inventory and reference index - Changelog fragment ## Post-merge deployment - `mise run provision-ringtail` to rebuild NixOS - On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration) - Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208	2026-02-18 17:53:47 -08:00
Erich Blume	b76f2314c2	Add force: true to ringtail git task nixos-rebuild can dirty the tree (e.g. flake.lock updates), which blocks the Ansible git module. Force ensures we always reset to the upstream state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:32:23 -08:00
Erich Blume	b9d813cde1	Add NixOS configuration for ringtail workstation (#207 ) ## Summary - NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/` - Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe) - NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam - Tailscale integration for tailnet connectivity - Ansible playbook + `mise run provision-ringtail` for ongoing management - Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap ## Deployment Order 1. Merge PR 2. `pulumi up` in tailscale stack → creates auth key 3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets` 4. On ringtail NixOS installer: - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo) - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail` 5. Reboot, `tailscale up --auth-key=<key>` 6. Verify: `tailscale status`, SSH from gilbert ## Test plan - [ ] Review NixOS configuration for completeness - [ ] Verify disko partition layout matches ringtail hardware - [ ] Run `pulumi preview` for tailscale stack - [ ] Install NixOS on ringtail - [ ] Confirm tailscale connectivity - [ ] Confirm sway desktop works - [ ] Test `mise run provision-ringtail` for ongoing management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207	2026-02-18 08:24:25 -08:00
Erich Blume	5f9b024b4a	Add Apple Silicon ZMQ detector for Frigate (#206 ) ## Summary - New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri - Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms) - Removes detect FPS cap (no longer needed with fast inference) - Updates Frigate docs and adds changelog fragment ## Deployment ### Phase 1: Deploy detector on indri (one-time setup + ansible) ```fish ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector' ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install' mise run provision-indri -- --tags frigate_detector --check --diff # dry run mise run provision-indri -- --tags frigate_detector # apply ssh indri 'launchctl list mcquack.eblume.frigate-detector' # verify running ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log' # verify bound ``` ### Phase 2: Test connectivity ```fish kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555 ``` ### Phase 3: Deploy Frigate config (branch workflow) ```fish argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate ``` ### Phase 4: Post-deploy checks - [ ] Pod starts, no config errors - [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms - [ ] detect_fps uncapped - [ ] Recordings and MQTT events flowing - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206	2026-02-17 19:03:28 -08:00
Erich Blume	d045a5d76a	Add BorgBase offsite backup repository (#142 ) ## Summary - Adds BorgBase as a second borgmatic repository for offsite backups (US region, append-only) - SSH key managed via 1Password, deployed to indri by Ansible - Borgmatic `ssh_command` configured to use the dedicated BorgBase key - BorgBase host key pinned in known_hosts via Ansible ## Post-merge deployment steps 1. Provision borgmatic: `mise run provision-indri -- --tags borgmatic` 2. Initialize the BorgBase repo: `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'` 3. Export and store the borg repokey: `ssh indri 'borg key export ssh://k04ljcd7@k04ljcd7.repo.borgbase.com/./repo'` → save to 1Password 4. Verify first backup: `ssh indri 'mise x -- borgmatic create --repository borgbase-offsite --verbosity 1'` ## BorgBase setup (already done) - Account created, API token in 1Password (`borgbase` item in blumeops vault) - SSH keypair generated, stored in 1Password, public key uploaded to BorgBase (ID: 200815) - Repository `indri-borgmatic` created (ID: k04ljcd7, US region, append-only, 2-day alert) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/142	2026-02-10 12:47:02 -08:00
Erich Blume	85e36cd807	Operations and observability for sifaka NAS (#135 ) ## Summary - Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring - Formalize existing `node_exporter` container under Ansible management - Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus - Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime) - Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS - Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles ## Prerequisites before deploy - [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP) - [ ] Verify `ssh eblume@sifaka 'docker ps'` works - [ ] Run `mise run provision-sifaka` to deploy containers - [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes - [ ] `argocd app sync prometheus` + `argocd app sync grafana-config` ## Test plan - [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics` - [ ] Verify Prometheus targets page shows both sifaka jobs as UP - [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135	2026-02-09 17:44:05 -08:00
Erich Blume	7f41621c7f	Migrate Ansible op calls to op read URI syntax (#125 ) ## Summary - Convert all 12 `op item get ... --fields ... --reveal` calls in Ansible to the newer `op read "op://vault/item/field"` syntax - Remove the `regex_replace` workaround on the Fly deploy token (no longer needed since `op read` returns clean unquoted values) - Covers `ansible/playbooks/indri.yml`, `ansible/roles/caddy/tasks/main.yml`, `ansible/roles/jellyfin_metrics/tasks/main.yml`, and `ansible/roles/alloy/tasks/main.yml` ## Test plan - [x] `mise run provision-indri -- --check --diff` dry run passes (ok=67, failed=0) - [x] No `op item get` calls remain in `ansible/` directory - [x] All pre-commit hooks pass (yaml, ansible-lint, TruffleHog, etc.) - [ ] Full provision run after merge to confirm secrets resolve correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/125	2026-02-08 10:52:43 -08:00
Erich Blume	6b1167a345	Fix Fly.io deploy token quoting (#121 ) ## Summary - Strip literal quotes from Fly.io deploy token when syncing to Forgejo Actions secrets - The `op` CLI wraps values containing spaces in quotes; the Fly.io token format is `FlyV1 <key>` (contains a space) - This caused the CI deploy workflow to fail with a 401 auth error ## Test plan - [x] `mise run provision-indri -- --tags forgejo_actions_secrets` succeeds - [ ] Re-trigger deploy-fly workflow after merge — should authenticate successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/121	2026-02-08 02:43:45 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00
Erich Blume	74bd5abe54	Add IaC for Forgejo Actions secrets via Ansible (#107 ) ## Summary - New `forgejo_actions_secrets` Ansible role syncs repository-level Actions secrets from 1Password to Forgejo via the Forgejo API - Replaces manual process of copying secrets from 1Password to Forgejo UI - Documents the one-time PAT setup requirement in forgejo.md ## Manual Setup Required Before this role can run, a Forgejo PAT must be created: 1. Go to https://forge.ops.eblu.me/user/settings/applications 2. Create a new token with `write:repository` scope 3. Store it in 1Password → "Forgejo Secrets" item → `api-token` field This has already been done. ## Test Plan - [x] Ran `mise run provision-indri -- --tags forgejo_actions_secrets` successfully - [x] Verified secret synced (API returned 204 = updated existing) - [x] Ansible-lint passes 🤖 Generated with [Claude Code](https://claude.ai/code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/107	2026-02-04 09:11:01 -08:00
Erich Blume	b8b33b76c8	Remove Plex media server (#78 ) ## Summary - Remove plex_metrics ansible role - Remove Plex Grafana dashboard - Remove Plex log collection from Alloy config - Update indri-services-check to check Jellyfin instead of Plex ## Deployment and Testing - [x] Unloaded plex-metrics LaunchAgent on indri - [x] Deleted plex-metrics plist and script - [x] Deleted plex.prom textfile - [ ] Deploy Alloy config update - [ ] Sync grafana-config to remove dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78	2026-01-30 17:06:00 -08:00
Erich Blume	bcc8685316	Add Jellyfin media server deployment (#77 ) ## Summary - Add Jellyfin ansible role for native macOS deployment via Homebrew cask - Add jellyfin_metrics role for Prometheus textfile metrics collection - Add Caddy routing for jellyfin.ops.eblu.me - Add Alloy log collection for Jellyfin stdout/stderr - Add Grafana dashboard for Jellyfin monitoring ## Architecture Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams. ## Deployment and Testing - [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy` - [ ] Sync Grafana dashboard: `argocd app sync grafana-config` - [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me - [ ] Generate API key and save to `~/.jellyfin-api-key` - [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV) - [ ] Enable VideoToolbox hardware transcoding - [ ] Verify metrics in Grafana dashboard - [ ] Verify logs in Loki: `{service="jellyfin"}` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77	2026-01-30 16:57:26 -08:00
Erich Blume	3971670832	Remove immich-sync ansible role (#65 ) ## Summary - Remove immich_sync ansible role (server-side photo sync via osxphotos) - The Immich iOS app has built-in automatic backup that replaces this functionality - iOS app supports foreground/background backup and can sync iCloud photos directly ## Deployment and Testing - [ ] Clean up files on indri (see manual cleanup commands below) - [ ] Configure Immich iOS app for automatic backup ### Manual cleanup on indri: ```bash # Unload and remove LaunchAgent launchctl unload ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist rm ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist # Remove script and credentials rm ~/bin/immich-sync.sh rm ~/.immich-api-key # Remove logs rm ~/Library/Logs/mcquack.immich-sync.*.log # Optionally remove export directory (check if empty first) ls ~/Pictures/immich-export # rm -r ~/Pictures/immich-export ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/65	2026-01-28 08:49:22 -08:00
Erich Blume	54945be0e3	Add immich-sync ansible role for photo library sync (#63 ) ## Summary - Add `immich_sync` role that syncs macOS Photos Library to Immich - Uses osxphotos to export photos with metadata to staging directory - Uses immich-cli (via Docker) to upload to Immich server - LaunchAgent schedules hourly syncs following mcquack pattern - API key fetched from 1Password in playbook pre_tasks ## Architecture ``` Photos Library → osxphotos export → ~/Pictures/immich-export/ → immich-cli upload → Immich ``` ## Prerequisites (manual) - Install osxphotos on indri: Add `"pipx:osxphotos" = "latest"` to `~/.config/mise/config.toml`, run `mise install` - Docker is already installed on indri ## Deployment and Testing - [ ] Dry run: `mise run provision-indri -- --tags immich_sync --check --diff` - [ ] Deploy: `mise run provision-indri -- --tags immich_sync` - [ ] Verify LaunchAgent: `ssh indri 'launchctl list \| grep immich'` - [ ] Test manual sync: `ssh indri '~/bin/immich-sync.sh'` - [ ] Check logs: `ssh indri 'tail -50 ~/Library/Logs/mcquack.immich-sync.out.log'` - [ ] Verify photos in Immich at https://photos.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/63	2026-01-26 12:38:38 -08:00
Erich Blume	ea42362b6f	Migrate Forgejo runner to Kubernetes with DinD (#60 ) ## Summary - Deploy Forgejo runner to k8s with Docker-in-Docker sidecar - Add job execution image with Node.js and Docker CLI - Retire host-mode runner on indri - All CI jobs now run containerized in k8s ## Components Added - `containers/forgejo-runner/Dockerfile` - Job execution image - `argocd/apps/forgejo-runner.yaml` - ArgoCD Application - `argocd/manifests/forgejo-runner/` - Kubernetes manifests ## Components Removed - `ansible/roles/forgejo_runner/` - No longer needed ## Changes to Existing Files - `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env - `.github/actionlint.yaml` - Only `k8s` label now valid ## Deployment 1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl \| kubectl --context=minikube-indri apply -f -` 2. Sync ArgoCD: `argocd app sync forgejo-runner` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60	2026-01-25 19:56:17 -08:00
Erich Blume	d6e6b48f6a	Migrate registry to Caddy (registry.ops.eblu.me) (#58 ) ## Summary - Update all references from `registry.tail8d86e.ts.net` to `registry.ops.eblu.me` - Remove `tailscale_serve` ansible role (no longer needed - all services migrated to Caddy) - Update minikube containerd config for new registry URL - Update devpi manifest, CI actions, and mise tasks ## Deployment and Testing - [ ] Run `mise run provision-indri -- --check --diff` (dry run) - [ ] Run `mise run provision-indri -- --tags minikube` to update containerd config - [ ] Sync devpi ArgoCD app: `argocd app sync devpi` - [ ] Manually remove old Tailscale serve entry: `ssh indri 'tailscale serve --service=svc:registry off'` - [ ] Test registry access: `curl https://registry.ops.eblu.me/v2/_catalog` - [ ] Run `mise run indri-services-check` to verify all services healthy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/58	2026-01-25 12:06:15 -08:00
Erich Blume	682a68dc9c	Add Caddy reverse proxy for blumeops services (#55 ) ## Summary - Add Caddy ansible role following zot pattern (manual build, ansible deploy) - Caddy built with Gandi DNS plugin for ACME DNS-01 challenges - Gandi PAT fetched from 1Password and written to secured file on indri - Configure wildcard TLS for `*.ops.eblu.me` - Initial services: forge, registry (indri-local) - Uses port 8443 during testing to avoid Tailscale serve conflicts ## Build Instructions (already done) On indri: ```bash cd ~/code/3rd/caddy && mise run build ``` ## Deployment and Testing - [ ] Review Caddyfile configuration - [ ] Run `mise run provision-indri -- --tags caddy` to deploy - [ ] Test: `curl -v https://forge.ops.eblu.me:8443` (should get TLS cert) - [ ] Test: `curl -v https://registry.ops.eblu.me:8443/v2/` (should return `{}`) - [ ] Once verified, switch to port 443 and migrate services from Tailscale serve ## Files Changed - `ansible/playbooks/indri.yml` - Add pre_task for Gandi PAT, add caddy role - `ansible/roles/caddy/` - New role with Caddyfile and LaunchAgent templates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/55	2026-01-25 09:35:06 -08:00
Erich Blume	8ca8798121	Switch to Buildah for container builds (#51 ) All checks were successful Test CI / test (push) Successful in 4s Details ## Summary - Replace Docker with Buildah for container image builds - No Docker socket required - buildah is daemonless - Cleaner security model (no privileged containers or socket mounting) - Remove Docker-related security context from deployment ## Changes - Update Dockerfile to install buildah/podman instead of docker-cli - Configure buildah storage with overlay driver and fuse-overlayfs - Update composite action to use `buildah bud` and `buildah push` - Add `imagePullPolicy: Always` to ensure fresh image pulls - Update test workflow to verify buildah/podman ## Testing - [ ] Runner pod starts successfully - [ ] Buildah is available in runner - [ ] Test workflow verifies buildah/podman versions - [ ] Container build workflow builds and pushes to zot 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51	2026-01-24 13:30:26 -08:00
Erich Blume	7893c41020	Enable Forgejo Actions (Phase 1) (#48 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password - Enable Forgejo Actions in config (`[actions] ENABLED = true`) - Add `repo.actions` to DEFAULT_REPO_UNITS - Clean up unused MySQL database fields (we use SQLite) ## Phase 1 Progress This PR covers the first part of Phase 1 (ci-cd-bootstrap plan): - [x] Refactor app.ini to ansible template - [x] Store secrets in 1Password - [x] Enable Actions in config - [ ] Deploy config changes (pending review) - [ ] Create runner registration token - [ ] Deploy runner to k8s - [ ] Test with simple workflow ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags forgejo` to deploy - [ ] Verify Forgejo restarts correctly - [ ] Verify Actions tab appears in repo settings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48	2026-01-23 17:00:12 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	7ec98210a9	P6: Migrate Kiwix and Transmission to Kubernetes (#39 ) ## Summary - Add Transmission BitTorrent daemon to k8s (torrent namespace) - Add Kiwix ZIM archive server to k8s (kiwix namespace) - NFS storage from sifaka for shared torrent/ZIM data - Torrent-sync sidecar in kiwix deployment to manage declarative ZIM list - ZIM-watcher CronJob to auto-restart kiwix when new archives appear - Remove transmission, transmission_metrics, and kiwix ansible roles from indri - Remove svc:kiwix from tailscale_serve defaults ## Key Decisions - Direct NFS mount for kiwix (no PVC) since it shares storage with transmission - Shell wrapper for kiwix-serve command (glob expansion) - Accept HTTP 409 as "ready" in torrent sync (transmission session ID mechanism) - Completed downloads stored in `/downloads/complete/` on sifaka ## Deployment and Testing - [x] Deployed transmission to k8s - [x] Verified transmission web UI at torrent.tail8d86e.ts.net - [x] Moved existing ZIM files to complete folder - [x] Deployed kiwix to k8s - [x] Verified kiwix web UI at kiwix.tail8d86e.ts.net - [x] Stopped old services on indri - [x] Cleared svc:kiwix from Tailscale serve on indri - [x] Updated zk documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/39	2026-01-21 18:07:40 -08:00
Erich Blume	21848a7919	P5.1: Migrate minikube from podman to QEMU2 driver (#38 ) ## Summary - Migrate minikube from podman driver to qemu2 driver for proper NFS/SMB volume mount support - Update ansible minikube role with qemu installation and containerd runtime - Remove podman role dependency from indri.yml - Add synology user creation steps and post-migration zot reconfiguration notes ## Why Phase 6 (Kiwix/Transmission migration) was blocked because the podman driver lacks kernel capabilities for filesystem mounts. QEMU2 creates an actual VM with full mount support. ## Deployment and Testing - [ ] Create k8s-storage user on Synology DSM - [ ] Store credentials in 1Password (synology-k8s-storage) - [ ] Export current k8s state - [ ] Stop and delete podman-based minikube cluster - [ ] Run ansible to create QEMU2 cluster - [ ] Test NFS volume mount with test pod - [ ] Redeploy ArgoCD and all apps - [ ] Verify all services healthy - [ ] Reconfigure zot registry mirrors for containerd (post-migration) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/38	2026-01-21 16:03:37 -08:00
Erich Blume	0439fbb704	P5: Migrate devpi to Kubernetes (#34 ) ## Summary - Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes - Custom container image with devpi-server + devpi-web + auto-init - StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net - Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow ## Key Changes - Add CRI-O registry mirror config for registry.tail8d86e.ts.net - Change ArgoCD apps to manual sync (was auto-sync causing issues) - 2Gi memory limit for Whoosh indexer (reclaimed after startup) ## Deployment and Testing - [x] devpi pod healthy in k8s - [x] pip install through proxy works - [x] mcquack 1.0.0 uploaded and installable - [x] Old devpi stopped on indri ## Post-Merge Reset ArgoCD to main: ``` argocd app set apps --revision main && argocd app sync apps argocd app set devpi --revision main && argocd app sync devpi ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34	2026-01-20 14:55:37 -08:00
Erich Blume	735b643429	P4: Miniflux migration + PostgreSQL consolidation (#33 ) ## Summary - Deploy miniflux in k8s via ArgoCD - Expose via Tailscale Ingress at feed.tail8d86e.ts.net - Retire brew PostgreSQL (no longer needed) - Rename k8s-pg to pg (canonical hostname) - Remove ansible miniflux and postgresql roles - Update borgmatic to backup pg.tail8d86e.ts.net - Update all zk documentation ## Deployment and Testing - [x] Miniflux pod running in k8s - [x] User login works at https://feed.tail8d86e.ts.net - [x] Feeds and entries visible - [x] brew miniflux and postgresql stopped - [x] Tailscale services migrated (feed, pg) - [x] zk documentation updated - [x] Run ansible to apply role removals - [ ] Verify borgmatic backup with new pg hostname 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33	2026-01-20 09:04:47 -08:00
Erich Blume	7e6742ad24	K8s Migration Phase 2: Grafana to Kubernetes (#30 ) ## Summary - Migrate Grafana from Homebrew/Ansible to Kubernetes deployment - Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed) - Add Grafana Helm chart deployment via ArgoCD with multi-source pattern - Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps) - Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal` ## Key Changes - `argocd/apps/grafana.yaml` - Grafana Helm chart Application - `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps - `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo - `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0 ## Deployment and Testing - [x] Deploy Loki config change: `mise run provision-indri -- --tags loki` - [x] Create namespace: `ki create namespace monitoring` - [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl \| ki apply -f -` - [x] Sync ArgoCD apps (grafana, grafana-config) - [x] Verify Grafana works at https://grafana.tail8d86e.ts.net - [x] Remove svc:grafana from ansible tailscale_serve - [x] Stop brew grafana: `ssh indri 'brew services stop grafana'` - [x] Delete ansible grafana role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30	2026-01-19 14:40:25 -08:00
Erich Blume	19a82373d5	K8s Migration Phase 0: Foundation Infrastructure (#26 ) ## Summary - Step 0.1: Update Pulumi ACLs with tag:registry - Step 0.3: Create Zot registry ansible role with mcquack LaunchAgent - Step 0.4: Add Zot to Tailscale Serve configuration - Step 0.5: Create Zot metrics role for Prometheus scraping - Step 0.6: Add Zot log collection to Alloy - Step 0.7: Update indri-services-check with zot checks - Step 0.8: Add podman role for container runtime - Step 0.9: Add minikube role for Kubernetes cluster - Step 0.10: Configure remote kubectl access with 1Password credentials ## Remaining Steps - [ ] Step 0.11: Add minikube to indri-services-check - [ ] Step 0.12: Create zettelkasten documentation - [ ] Step 0.13: Verify main playbook (already done - roles added) ## Deployment and Testing - [x] Zot registry deployed and accessible at https://registry.tail8d86e.ts.net - [x] Podman machine running on indri - [x] Minikube cluster running on indri - [x] kubectl access from gilbert working with 1Password credentials - [ ] indri-services-check passes all checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/26	2026-01-18 12:06:28 -08:00
Erich Blume	3962e5a7de	Fix borgmatic PostgreSQL backup and update backup sources (#21 ) ## Summary - Fix PostgreSQL backup failure by adding explicit `pg_dump_command` path (was failing with "pg_dump: command not found" in LaunchAgent) - Remove `~/code/3rd/kiwix-tools` from backups (was just symlinks to ZIM archives in transmission) - Enable Loki log backup by removing from exclude_patterns ## Deployment and Testing - [x] Dry run with `--check --diff` shows expected changes - [ ] Deploy with `mise run provision-indri -- --tags borgmatic` - [ ] Verify config deployed: `ssh indri 'cat ~/.config/borgmatic/config.yaml'` - [ ] Run manual backup to test: `ssh indri 'mise x -- borgmatic create --verbosity 1'` - [ ] Verify PostgreSQL dump succeeds (no "pg_dump: command not found" error) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/21	2026-01-17 09:22:01 -08:00
Erich Blume	9931829d03	Add pre-commit hooks for code quality (#19 ) ## Summary - Add pre-commit framework with hooks for YAML, Ansible, Python, shell, TOML, JSON, and secret detection - Fix all 91+ ansible-lint violations (variable naming, handler capitalization, changed_when) - Fix shellcheck warnings in mise-tasks scripts - Document pre-commit setup in README.md ## Deployment and Testing - [x] All pre-commit hooks pass (`uvx pre-commit run --all-files`) - [x] Test ansible playbook with `--check` mode - [x] Run `mise run indri-services-check` after deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/19	2026-01-16 19:33:02 -08:00
Erich Blume	812b78bf61	Use explicit PostgreSQL superuser name and fix check mode (#17 ) ## Summary - Add `postgresql_superuser` variable (`eblume`) to prevent PostgreSQL from inheriting OS username during initdb - Update all psql/createdb commands to use explicit `-U` flag - Add `check_mode: false` to op commands so 1Password fetches run during `--check` mode - Add PostgreSQL and Miniflux health checks to indri-services-check ## Test plan - [x] Renamed existing superuser from `erichblume` to `eblume` - [x] Ran `mise run provision-indri -- --tags postgresql --check --diff` successfully - [x] Verified connection as `eblume` superuser via Tailscale - [x] Ran `mise run indri-services-check` - all services healthy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/17	2026-01-16 14:41:36 -08:00
Erich Blume	adf6f4fbe9	Add PostgreSQL and Miniflux services to tailnet (#16 ) ## Summary - Add PostgreSQL 18 as a new service at `pg.tail8d86e.ts.net:5432` - Add Miniflux RSS/Atom feed reader at `feed.tail8d86e.ts.net` - Both services managed via homebrew/brew services - Pulumi ACL tags added (tag:pg, tag:feed) - Alloy log collection configured for both services - Zettelkasten documentation updated ## Manual Setup Required Before running ansible, the following steps are needed on indri: ### 1. Apply Pulumi tags ```bash mise run tailnet-up ``` Then apply tags to indri in Tailscale admin console. ### 2. Create 1Password entries - miniflux PostgreSQL user password - miniflux admin password (for first run) ### 3. Set PostgreSQL user password (after ansible installs postgres) ```bash ssh indri '/opt/homebrew/opt/postgresql@18/bin/psql -c "ALTER USER miniflux PASSWORD '\''your-password'\'';"' ``` ### 4. Create password files on indri ```bash ssh indri 'echo "your-db-password" > ~/.miniflux-db-password && chmod 600 ~/.miniflux-db-password' ssh indri 'echo "your-admin-password" > ~/.miniflux-admin-password && chmod 600 ~/.miniflux-admin-password' ``` ### 5. Create ~/.pgpass for borgmatic ```bash ssh indri 'echo "localhost:5432:miniflux:miniflux:YOUR_PASSWORD" > ~/.pgpass && chmod 600 ~/.pgpass' ``` ### 6. Run ansible with first-run admin creation ```bash mise run provision-indri -- -e miniflux_create_admin=1 ``` ### 7. Update borgmatic config Add to `~/.config/borgmatic/config.yaml` on indri: ```yaml postgresql_databases: - name: miniflux hostname: localhost port: 5432 username: miniflux ``` ### 8. Cleanup after first run ```bash ssh indri 'rm ~/.miniflux-admin-password' ``` ## Test plan - [ ] Run `mise run tailnet-up` and verify Pulumi changes - [ ] Apply tags to indri in Tailscale admin - [ ] Run `mise run provision-indri -- --check --diff` for dry run - [ ] Run `mise run provision-indri -- -e miniflux_create_admin=1` - [ ] Approve services in Tailscale admin - [ ] Verify PostgreSQL: `ssh indri '/opt/homebrew/opt/postgresql@18/bin/pg_isready'` - [ ] Verify Miniflux: `curl https://feed.tail8d86e.ts.net/healthcheck` - [ ] Run `mise run indri-services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/16	2026-01-16 12:30:20 -08:00
Erich Blume	3f4e40f3ae	Add Pulumi for tailnet IaC management (#15 ) ## Summary - Manage tail8d86e.ts.net ACLs, tags, and DNS via Pulumi + Python - State stored in Pulumi Cloud (free tier) to avoid circular dependency - OAuth authentication via 1Password for secure credential management - New mise tasks: `tailnet-preview`, `tailnet-up` ## Architecture Two-layer approach: - Layer 1 (Pulumi): Tailnet-wide config (ACLs, tags, DNS) - Layer 2 (Ansible): Node-local `tailscale serve` config (unchanged) ## Test plan - [x] Exported current ACL from Tailscale API - [x] Imported existing ACL into Pulumi state - [x] Verified `mise run tailnet-preview` shows no changes - [x] Verified `mise run tailnet-up` applies successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/15	2026-01-15 20:55:25 -08:00
Erich Blume	ae1513e7e9	Add Plex Media Server observability (#13 ) ## Summary - Add `plex_metrics` ansible role with textfile collector for Prometheus metrics - Add Plex log collection to Alloy (forwards to Loki) - Add Grafana dashboard for Plex monitoring (status, library counts, sessions, transcoding, logs) ## Metrics Collected - `plex_up` - server health - `plex_version_info` - server version - `plex_sessions_total/playing/paused` - active sessions - `plex_transcode_sessions_total/video/audio` - transcoding status - `plex_library_items{library,type}` - library item counts ## Prerequisites Plex token must be stored at `~/.plex-token` on indri (already done). ## Test plan - [x] Dry-run passed (`mise run provision-indri -- --check --diff`) - [ ] Apply changes (`mise run provision-indri`) - [ ] Verify metrics: `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/plex.prom'` - [ ] Verify logs in Grafana Explore: `{service="plex"}` - [ ] Check Plex dashboard in Grafana 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/13	2026-01-15 15:27:59 -08:00
Erich Blume	242c1880de	Add Grafana Alloy and Loki for unified observability (#11 ) ## Summary - Add Grafana Alloy to replace node_exporter for metrics collection - Add Loki for log aggregation and storage - Configure Alloy to collect logs from all services (grafana, forgejo, prometheus, tailscale, transmission, devpi, kiwix, borgmatic) - Update Prometheus to accept metrics via remote_write - Add Loki datasource to Grafana ## Test plan - [ ] Run \`mise run provision-indri -- --check --diff\` to verify changes - [ ] Apply with \`mise run provision-indri\` - [ ] Verify services: \`mise run indri-services-check\` - [ ] Check Grafana Explore with Loki datasource - [ ] Query logs: \`{service="grafana"}\` - [ ] Verify metrics still flowing to Prometheus dashboards 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/11	2026-01-15 12:24:13 -08:00
Erich Blume	d8a0ef6482	Add devpi PyPI caching proxy role for indri (#9 ) ## Summary - Add ansible role for devpi-server as a transparent PyPI caching proxy - LaunchAgent with KeepAlive runs via `mise x -- devpi-server` - Listens on port 3141, data stored in `~/devpi` - Health checks added to `indri-services-check` script ## Manual Setup Required (on indri, before provisioning) 1. Add to `~/.config/mise/config.toml`: ```toml [tools] "pipx:devpi-server" = "latest" "pipx:devpi-web" = "latest" "pipx:devpi-client" = "latest" ``` 2. Run `mise install` 3. Initialize: `mise x -- devpi-init --serverdir ~/devpi` ## Post-Provisioning - Set up Tailscale service `pypi` on port 443 → 3141 - Configure client pip.conf with index-url ## Test plan - [x] Ansible syntax check passes - [x] Dry-run: `mise run provision-indri -- --check --diff` - [x] Apply: `mise run provision-indri` - [x] Health check: `mise run indri-services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/9	2026-01-15 08:31:09 -08:00
Erich Blume	7468023cd2	Add transmission dashboard to grafana - Add node_exporter ansible role to enable textfile collector - Add transmission_metrics role with script and LaunchAgent - Collects metrics every 60s via transmission RPC - Writes to /opt/homebrew/var/node_exporter/textfile/transmission.prom - Update grafana role to provision dashboards from files - Add transmission.json dashboard with: - Status indicator, torrent counts - Transfer speeds, cumulative stats - Time series graphs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-14 13:46:51 -08:00
Erich Blume	b865d70456	Add transmission role for torrent-based ZIM downloads - Add new transmission ansible role using homebrew + brew services - Configure transmission to download to ~/transmission with localhost-only RPC - Modify kiwix role to use transmission for downloading ZIM archives via BitTorrent - Add role dependency so running --tags kiwix auto-runs transmission - Keep fallback to direct HTTP download when kiwix_use_transmission: false - Symlink completed downloads from transmission dir to kiwix-tools dir This reduces load on kiwix.org servers and allows downloads to continue in the background without blocking ansible runs. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-14 11:59:24 -08:00
Erich Blume	4b365a1294	Fix borgmatic LaunchAgent to work with mise-installed binaries The LaunchAgent was failing because launchd runs with a minimal PATH that doesn't include mise-installed binaries or homebrew. This adds: - Use `mise x` wrapper to run borgmatic (survives version updates) - Add /opt/homebrew/bin to PATH for borg dependency - Add ansible tags to indri playbook for targeted role runs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-14 10:33:48 -08:00
Erich Blume	d1396b1cfb	Add forgejo role to ansible playbook Manages installation and service via homebrew. Config at /opt/homebrew/var/forgejo/custom/conf/app.ini contains secrets and is not templated - backed up by borgmatic instead. Includes check that fails with restore instructions if config missing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-13 23:00:46 -08:00

1 2

53 commits