blumeops

Author	SHA1	Message	Date
Erich Blume	21b6533aea	Register Zot as OIDC client in Authentik (#236 ) ## Summary - Add Authentik blueprint (`zot.yaml`) with OAuth2 provider, application, `artifact-workloads` group, and `zot-ci` service account - Wire `zot-client-secret` through ExternalSecret → worker Deployment env var → blueprint `!Env` - Add Ansible pre_task to fetch OIDC secret from 1Password (item ID `oor7os5kapczgpbwv7obkca4y4`) - Add `oidc-credentials.json.j2` template and deploy task in zot role (with `when` guard) ## Manual Steps Required Before Deploy 1. Generate client secret: `openssl rand -hex 32` 2. Store in 1Password: add field `zot-client-secret` to "Authentik (blumeops)" item in vault `blumeops` ## What This Does NOT Do - Does NOT modify `config.json.j2` (that's the root goal `harden-zot-registry`) - Does NOT wire CI auth (that's `wire-ci-registry-auth`) - Does NOT set service account password or API keys (manual post-deploy) ## Verification After ArgoCD sync: - [ ] Authentik admin UI shows "Zot Registry" application - [ ] OIDC discovery at `https://authentik.ops.eblu.me/application/o/zot/.well-known/openid-configuration` returns valid JSON - [ ] Blueprint status is `successful` - [ ] `artifact-workloads` group exists with `zot-ci` service account 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/236	2026-02-21 08:45:06 -08:00
Erich Blume	cd50c1454a	Integrate Forgejo with Authentik OIDC (#228 ) ## Summary - Refactor Authentik blueprints: extract shared `admins` group into `common.yaml`, add `groups` scope mapping to all providers for group-based admin propagation - Add Forgejo OAuth2 provider and application blueprint (`forgejo.yaml`) - Add `forgejo-client-secret` to ExternalSecret and worker deployment env - Configure Forgejo `[oauth2_client]` with `ACCOUNT_LINKING=login` to safely link existing accounts - Update documentation (forgejo.md, authentik.md, federated-login.md) ## Deployment and Testing After merge, deployment requires these steps in order: 1. Authentik (ArgoCD): - `argocd app set authentik --revision feature/forgejo-authentik-oidc && argocd app sync authentik` - Verify: Forgejo app/provider visible in Authentik admin UI - Verify: Grafana SSO still works (blueprint refactor) 2. Forgejo app.ini (Ansible): - `mise run provision-indri -- --tags forgejo --check --diff` (dry run) - `mise run provision-indri -- --tags forgejo` (apply, restarts Forgejo) 3. Create Forgejo auth source (CLI on indri): ``` ssh indri 'sudo -u forgejo /opt/homebrew/bin/forgejo admin auth add-oauth \ --name authentik \ --provider openidConnect \ --key forgejo \ --secret "$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/Authentik (blumeops)/forgejo-client-secret")" \ --auto-discover-url https://authentik.ops.eblu.me/application/o/forgejo/.well-known/openid-configuration \ --scopes "openid email profile groups" \ --group-claim-name groups \ --admin-group admins' ``` 4. Link eblume account: Sign in with Authentik on Forgejo, confirm link with local password 5. Verify: `tea repo list`, Forgejo Actions, local password break-glass After merge: `argocd app set authentik --revision main && argocd app sync authentik` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/228	2026-02-20 17:39:50 -08:00
Erich Blume	71cb256527	Deploy Authentik identity provider (C2 Mikado) (#227 ) ## Summary C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex. This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved. ## Current Mikado State - Goal: `deploy-authentik` (active) - Leaf prerequisites: - `build-authentik-container` — Build Nix container image - `provision-authentik-database` — Create PostgreSQL database on CNPG cluster - `create-authentik-secrets` — Create 1Password item with credentials ## Process refinements - Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early ## Test plan - [ ] `mise run docs-mikado` shows correct dependency chain - [ ] Leaf nodes can be worked independently - [ ] Container builds on ringtail - [ ] Authentik starts and reaches healthy state - [ ] Forgejo OAuth2 connector works Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227	2026-02-20 12:55:59 -08:00
Erich Blume	0cdc143227	Deploy Dex OIDC identity provider with Grafana SSO (#222 ) ## Summary - Deploys Dex OIDC identity provider on ringtail k3s cluster as central authentication service - Integrates Grafana as first SSO client via `auth.generic_oauth` - Uses Kubernetes CRD storage backend (no PVC needed) - All secrets (bcrypt hash, client secrets) injected via ExternalSecrets from 1Password item "Dex (blumeops)" - NixOS-built container image via `containers/dex/default.nix` ## Pre-requisites (manual, before deployment) 1. Create 1Password item "Dex (blumeops)" in `blumeops` vault with fields: - `password`: strong generated password for Dex login - `static-password-hash`: bcrypt hash of above (`htpasswd -BnC 10 eblume`, copy hash after `eblume:`) - `grafana-client-secret`: random 32-char hex (`openssl rand -hex 16`) 2. Build container: `mise run container-tag-and-release dex v1.0.0` ## Deployment sequence 1. Build container: `mise run container-tag-and-release dex v1.0.0` 2. Deploy Caddy: `mise run provision-indri -- --tags caddy` 3. Sync ArgoCD: `argocd app sync apps` → `argocd app sync dex` 4. Verify Dex: `curl https://dex.ops.eblu.me/.well-known/openid-configuration` 5. Sync Grafana: `argocd app sync grafana-config` → `argocd app sync grafana` 6. Test SSO: Visit `https://grafana.ops.eblu.me/login`, click "Sign in with Dex" ## Verification - [ ] Container image exists: `mise run container-list` shows `dex:v1.0.0-nix` - [ ] `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid OIDC discovery - [ ] `curl https://dex.ops.eblu.me/healthz` returns healthy - [ ] Grafana login shows "Sign in with Dex" button alongside local login - [ ] OIDC flow: click Dex → enter credentials → redirect back → logged in as Admin - [ ] Break-glass: local admin login still works - [ ] `mise run services-check` passes ## Files changed \| File \| Action \| Purpose \| \|------\|--------\|---------\| \| `containers/dex/default.nix` \| Create \| NixOS container build \| \| `argocd/apps/dex.yaml` \| Create \| ArgoCD app targeting ringtail \| \| `argocd/manifests/dex/*` (8 files) \| Create \| K8s manifests (RBAC, ExternalSecret, Deployment, Service, Ingress) \| \| `argocd/manifests/grafana-config/external-secret-dex-oauth.yaml` \| Create \| Grafana OIDC client secret \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Modify \| Add new ExternalSecret resource \| \| `argocd/manifests/grafana/values.yaml` \| Modify \| Add `auth.generic_oauth` config + envFromSecrets \| \| `ansible/roles/caddy/defaults/main.yml` \| Modify \| Add `dex.ops.eblu.me` reverse proxy entry \| \| `docs/changelog.d/feature-dex-oidc.feature.md` \| Create \| Changelog fragment \| Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/222	2026-02-19 20:24:24 -08:00
Erich Blume	16a4a9a616	Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216 ) ## Summary - Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired - Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s - Move ntfy ArgoCD app from indri minikube to ringtail k3s - Update Frigate docs to reflect detector removal and planned RTX 4080 migration - Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail ## Deployment After merge: 1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps: ``` argocd app sync apps --prune ``` 2. Sync new ringtail apps: ``` argocd app sync mosquitto-ringtail argocd app sync ntfy-ringtail ``` 3. Manually clean up the detector LaunchAgent on indri: ``` ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ``` ## Notes - Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns) - ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup - Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed - Frigate + frigate-notify will be ported to ringtail in a follow-up PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216	2026-02-19 11:22:44 -08:00
Erich Blume	b475a1fcd7	Fix 1Password secret tasks always reporting changed in ringtail playbook (#213 ) ## Summary - Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml` - Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created` ## Test plan - [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213	2026-02-19 07:25:24 -08:00
Erich Blume	918df9e642	Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209 ) ## Summary Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions runner for building containers with Nix. ### K3s cluster - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint) - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`) - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access - `kubectl` added to system packages ### 1Password Connect + External Secrets Operator - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`, `external-secrets-config-ringtail` - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password` namespace in post_tasks ### Systemd Forgejo Actions runner - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env` - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access ### Nix container build workflow - New `.forgejo/workflows/build-container-nix.yaml` triggers on `-nix-v[0-9]` tags (e.g. `nettest-nix-v1.0.0`) - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy` - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering ### Mise task updates - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`) - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`) ## Post-merge 1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild 2. Register k3s cluster in ArgoCD (first time only): ```fish ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' \| \ sed 's\|127.0.0.1\|ringtail.tail8d86e.ts.net\|' > /tmp/k3s-ringtail.yaml set -x KUBECONFIG /tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail 3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail -> external-secrets-config-ringtail 4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder' 5. Check Forgejo admin panel for ringtail-nix-builder runner online 6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209	2026-02-18 21:15:30 -08:00
Erich Blume	535f897054	Polish ringtail NixOS config and add documentation (#208 ) ## Summary - Fix Super+Return keybinding to launch wezterm in sway - Set fish as default login shell - Remove `initialPassword` (real password already set) - Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep) - Add ringtail reference card, update host inventory and reference index - Changelog fragment ## Post-merge deployment - `mise run provision-ringtail` to rebuild NixOS - On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration) - Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208	2026-02-18 17:53:47 -08:00
Erich Blume	b76f2314c2	Add force: true to ringtail git task nixos-rebuild can dirty the tree (e.g. flake.lock updates), which blocks the Ansible git module. Force ensures we always reset to the upstream state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:32:23 -08:00
Erich Blume	b9d813cde1	Add NixOS configuration for ringtail workstation (#207 ) ## Summary - NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/` - Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe) - NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam - Tailscale integration for tailnet connectivity - Ansible playbook + `mise run provision-ringtail` for ongoing management - Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap ## Deployment Order 1. Merge PR 2. `pulumi up` in tailscale stack → creates auth key 3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets` 4. On ringtail NixOS installer: - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo) - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail` 5. Reboot, `tailscale up --auth-key=<key>` 6. Verify: `tailscale status`, SSH from gilbert ## Test plan - [ ] Review NixOS configuration for completeness - [ ] Verify disko partition layout matches ringtail hardware - [ ] Run `pulumi preview` for tailscale stack - [ ] Install NixOS on ringtail - [ ] Confirm tailscale connectivity - [ ] Confirm sway desktop works - [ ] Test `mise run provision-ringtail` for ongoing management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207	2026-02-18 08:24:25 -08:00
Erich Blume	5f9b024b4a	Add Apple Silicon ZMQ detector for Frigate (#206 ) ## Summary - New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri - Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms) - Removes detect FPS cap (no longer needed with fast inference) - Updates Frigate docs and adds changelog fragment ## Deployment ### Phase 1: Deploy detector on indri (one-time setup + ansible) ```fish ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector' ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install' mise run provision-indri -- --tags frigate_detector --check --diff # dry run mise run provision-indri -- --tags frigate_detector # apply ssh indri 'launchctl list mcquack.eblume.frigate-detector' # verify running ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log' # verify bound ``` ### Phase 2: Test connectivity ```fish kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555 ``` ### Phase 3: Deploy Frigate config (branch workflow) ```fish argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate ``` ### Phase 4: Post-deploy checks - [ ] Pod starts, no config errors - [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms - [ ] detect_fps uncapped - [ ] Recordings and MQTT events flowing - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206	2026-02-17 19:03:28 -08:00
Erich Blume	04c7f3c45a	Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190 ) ## Summary Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159): - Mosquitto — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth) - Ntfy — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me` - Frigate — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me` - frigate-notify — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT Also includes: - Prometheus scrape target for Frigate metrics - Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage) - Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me` ## Prerequisites - [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri) - [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields ## Deployment (after merge) ```bash argocd app sync apps argocd app sync mosquitto argocd app sync ntfy argocd app sync frigate argocd app sync grafana-config argocd app sync prometheus mise run provision-indri -- --tags caddy mise run services-check ``` ## Verification - [ ] Mosquitto pod running, accepting connections on 1883 - [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me` - [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed - [ ] Object detection working (ONNX, person/car/dog/cat) - [ ] Recordings appearing in NFS share on sifaka - [ ] frigate-notify sending detection alerts to Ntfy - [ ] Prometheus scraping Frigate metrics - [ ] Grafana dashboard showing Frigate data Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190	2026-02-14 21:27:44 -08:00
Erich Blume	cd25dae8f9	Extend forgejo_actions_secrets role to support multiple repos Uses subelements loop to sync secrets across repos. Adds FORGE_TOKEN to the cv repo for package uploads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:15:28 -08:00
Erich Blume	01e19023ee	Add CV/resume web app at cv.ops.eblu.me (#169 ) ## Summary - nginx container (`containers/cv/`) downloads and serves a content tarball at startup (same pattern as quartz) - ArgoCD app + k8s manifests (deployment, service, Tailscale ingress) - Caddy route for `cv.ops.eblu.me` - Deploy workflow: resolves "latest" or specific version from Forgejo packages, updates deployment, syncs ArgoCD - Content is built and released from the separate [cv repo](https://forge.ops.eblu.me/eblume/cv) ## Deployment steps (after merge) 1. `mise run container-tag-and-release cv v1.0.0` 2. Run "Release CV" workflow in cv repo (SPECIFIC_VERSION `v0.1.0`) 3. Run "Deploy CV" workflow in blumeops (default: latest) 4. `mise run provision-indri -- --tags caddy` 5. Verify at `https://cv.ops.eblu.me/` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/169	2026-02-12 11:09:41 -08:00
Erich Blume	f65d11d55b	Update BorgBase repo ID after recreation (#144 ) ## Summary - Previous BorgBase repo (k04ljcd7) had corrupted segments from interrupted backup attempts - Recreated as u3ugi1x1 (same US region, same SSH key, same append-only settings) - Updates repo path in Ansible defaults and known_hosts hostname in tasks ## Post-merge 1. `mise run provision-indri -- --tags borgmatic` 2. `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'` 3. `mise x -- borgmatic create --repository borgbase-offsite --verbosity 1 --progress` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/144	2026-02-10 13:19:15 -08:00
Erich Blume	d045a5d76a	Add BorgBase offsite backup repository (#142 ) ## Summary - Adds BorgBase as a second borgmatic repository for offsite backups (US region, append-only) - SSH key managed via 1Password, deployed to indri by Ansible - Borgmatic `ssh_command` configured to use the dedicated BorgBase key - BorgBase host key pinned in known_hosts via Ansible ## Post-merge deployment steps 1. Provision borgmatic: `mise run provision-indri -- --tags borgmatic` 2. Initialize the BorgBase repo: `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'` 3. Export and store the borg repokey: `ssh indri 'borg key export ssh://k04ljcd7@k04ljcd7.repo.borgbase.com/./repo'` → save to 1Password 4. Verify first backup: `ssh indri 'mise x -- borgmatic create --repository borgbase-offsite --verbosity 1'` ## BorgBase setup (already done) - Account created, API token in 1Password (`borgbase` item in blumeops vault) - SSH keypair generated, stored in 1Password, public key uploaded to BorgBase (ID: 200815) - Repository `indri-borgmatic` created (ID: k04ljcd7, US region, append-only, 2-day alert) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/142	2026-02-10 12:47:02 -08:00
Erich Blume	d76d675b29	Fix minikube role skipping start when kubelet/apiserver are stopped (#137 ) ## Summary - After a power loss, minikube's Docker container (host) restarts but kubelet/apiserver remain stopped - The ansible role's status check used `--format='{{.Host}}'` which only examined the host VM state - When host=Running but kubelet/apiserver=Stopped, the role skipped `minikube start` - Fixed to use full `minikube status` exit code (returns non-zero when any component is unhealthy) - Simplified all downstream conditions to use exit code instead of string matching ## Test plan - [x] Verified the fix correctly skips `minikube start` when cluster is already fully running - [x] Pre-commit hooks pass (ansible-lint, yamllint, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/137	2026-02-09 23:03:01 -08:00
Erich Blume	85e36cd807	Operations and observability for sifaka NAS (#135 ) ## Summary - Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring - Formalize existing `node_exporter` container under Ansible management - Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus - Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime) - Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS - Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles ## Prerequisites before deploy - [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP) - [ ] Verify `ssh eblume@sifaka 'docker ps'` works - [ ] Run `mise run provision-sifaka` to deploy containers - [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes - [ ] `argocd app sync prometheus` + `argocd app sync grafana-config` ## Test plan - [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics` - [ ] Verify Prometheus targets page shows both sifaka jobs as UP - [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135	2026-02-09 17:44:05 -08:00
Erich Blume	7f41621c7f	Migrate Ansible op calls to op read URI syntax (#125 ) ## Summary - Convert all 12 `op item get ... --fields ... --reveal` calls in Ansible to the newer `op read "op://vault/item/field"` syntax - Remove the `regex_replace` workaround on the Fly deploy token (no longer needed since `op read` returns clean unquoted values) - Covers `ansible/playbooks/indri.yml`, `ansible/roles/caddy/tasks/main.yml`, `ansible/roles/jellyfin_metrics/tasks/main.yml`, and `ansible/roles/alloy/tasks/main.yml` ## Test plan - [x] `mise run provision-indri -- --check --diff` dry run passes (ok=67, failed=0) - [x] No `op item get` calls remain in `ansible/` directory - [x] All pre-commit hooks pass (yaml, ansible-lint, TruffleHog, etc.) - [ ] Full provision run after merge to confirm secrets resolve correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/125	2026-02-08 10:52:43 -08:00
Erich Blume	6b1167a345	Fix Fly.io deploy token quoting (#121 ) ## Summary - Strip literal quotes from Fly.io deploy token when syncing to Forgejo Actions secrets - The `op` CLI wraps values containing spaces in quotes; the Fly.io token format is `FlyV1 <key>` (contains a space) - This caused the CI deploy workflow to fail with a 401 auth error ## Test plan - [x] `mise run provision-indri -- --tags forgejo_actions_secrets` succeeds - [ ] Re-trigger deploy-fly workflow after merge — should authenticate successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/121	2026-02-08 02:43:45 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00
Erich Blume	74bd5abe54	Add IaC for Forgejo Actions secrets via Ansible (#107 ) ## Summary - New `forgejo_actions_secrets` Ansible role syncs repository-level Actions secrets from 1Password to Forgejo via the Forgejo API - Replaces manual process of copying secrets from 1Password to Forgejo UI - Documents the one-time PAT setup requirement in forgejo.md ## Manual Setup Required Before this role can run, a Forgejo PAT must be created: 1. Go to https://forge.ops.eblu.me/user/settings/applications 2. Create a new token with `write:repository` scope 3. Store it in 1Password → "Forgejo Secrets" item → `api-token` field This has already been done. ## Test Plan - [x] Ran `mise run provision-indri -- --tags forgejo_actions_secrets` successfully - [x] Verified secret synced (API returned 204 = updated existing) - [x] Ansible-lint passes 🤖 Generated with [Claude Code](https://claude.ai/code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/107	2026-02-04 09:11:01 -08:00
Erich Blume	72f9f21d46	Remove iCloud Photos from borgmatic backup (#100 ) ## Summary - Remove ~/Pictures from borgmatic source directories - Update borgmatic and backup policy documentation - Add Sifaka-Native Data section to clarify that photos (via Immich), music (via Navidrome), and video (via Jellyfin) are stored directly on Sifaka ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags borgmatic --check --diff` to preview changes - [ ] Run `mise run provision-indri -- --tags borgmatic` to apply - [ ] Verify borgmatic config no longer includes ~/Pictures 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/100	2026-02-04 07:09:28 -08:00
Erich Blume	1c86134a62	Phase 1b: Deploy docs hosting with Quartz (#85 ) ## Summary - Add ArgoCD Application and manifests for `quartz` service - Add `docs.ops.eblu.me` to Caddy reverse proxy configuration - ConfigMap points to blumeops v1.0.0 release tarball - Tailscale ingress with homepage annotations for auto-discovery ## Deployment and Testing Pre-deployment (container build): - [ ] Build and tag quartz container: `mise run container-tag-and-release quartz v1.0.0` K8s deployment: - [ ] Sync apps: `argocd app sync apps` - [ ] Point quartz at feature branch: `argocd app set quartz --revision feature/docs-phase-1b-hosting` - [ ] Sync quartz: `argocd app sync quartz` - [ ] Verify pod is running: `kubectl --context=minikube-indri get pods -n quartz` - [ ] Verify Tailscale ingress: `kubectl --context=minikube-indri get ingress -n quartz` Caddy deployment: - [ ] Dry run: `mise run provision-indri -- --tags caddy --check --diff` - [ ] Apply: `mise run provision-indri -- --tags caddy` Verification: - [ ] Test https://docs.tail8d86e.ts.net - [ ] Test https://docs.ops.eblu.me - [ ] Verify homepage dashboard shows docs link Post-merge: - [ ] Reset to main: `argocd app set quartz --revision main && argocd app sync quartz` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/85	2026-02-03 10:52:20 -08:00
Erich Blume	ade21cc49e	Add Navidrome music streaming server (#79 ) ## Summary - Deploy Navidrome music streaming server to k8s - NFS mount for music library from sifaka:/volume1/music (read-only) - Local PVC for SQLite database and config (10Gi) - Tailscale ingress for dj.tail8d86e.ts.net - Caddy reverse proxy for dj.ops.eblu.me - Homepage annotations for dashboard discovery in Media group ## Deployment and Testing - [ ] Sync `apps` application to pick up new Application definition - [ ] Set navidrome app to feature branch and sync - [ ] Verify NFS mount with `kubectl exec` - [ ] Provision Caddy for dj.ops.eblu.me - [ ] Access https://dj.ops.eblu.me and create initial admin user - [ ] Verify Homepage shows DJ in Media group - [ ] Reset to main and resync after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/79	2026-01-31 20:19:31 -08:00
Erich Blume	b8b33b76c8	Remove Plex media server (#78 ) ## Summary - Remove plex_metrics ansible role - Remove Plex Grafana dashboard - Remove Plex log collection from Alloy config - Update indri-services-check to check Jellyfin instead of Plex ## Deployment and Testing - [x] Unloaded plex-metrics LaunchAgent on indri - [x] Deleted plex-metrics plist and script - [x] Deleted plex.prom textfile - [ ] Deploy Alloy config update - [ ] Sync grafana-config to remove dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78	2026-01-30 17:06:00 -08:00
Erich Blume	bcc8685316	Add Jellyfin media server deployment (#77 ) ## Summary - Add Jellyfin ansible role for native macOS deployment via Homebrew cask - Add jellyfin_metrics role for Prometheus textfile metrics collection - Add Caddy routing for jellyfin.ops.eblu.me - Add Alloy log collection for Jellyfin stdout/stderr - Add Grafana dashboard for Jellyfin monitoring ## Architecture Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams. ## Deployment and Testing - [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy` - [ ] Sync Grafana dashboard: `argocd app sync grafana-config` - [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me - [ ] Generate API key and save to `~/.jellyfin-api-key` - [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV) - [ ] Enable VideoToolbox hardware transcoding - [ ] Verify metrics in Grafana dashboard - [ ] Verify logs in Loki: `{service="jellyfin"}` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77	2026-01-30 16:57:26 -08:00
Erich Blume	d1164c8aac	Add Hajimari service dashboard (#73 ) ## Summary - Add Hajimari as a service dashboard/start page at `go.ops.eblu.me` - Auto-discovers k8s services from ingress annotations - Custom apps for non-k8s services: Forgejo, Registry, Sifaka NAS - Add `nas.ops.eblu.me` Caddy proxy to Synology dashboard ## Services Configured Auto-discovered (k8s ingresses with hajimari.io annotations): - Grafana, ArgoCD, Prometheus, Loki (Observability) - Miniflux, Kiwix, Transmission, TeslaMate, Immich (Apps) - PyPI/devpi (Infrastructure) Custom apps (non-k8s): - Forgejo (forge.ops.eblu.me) - Registry (registry.ops.eblu.me) - Sifaka NAS (nas.ops.eblu.me) Bookmarks: - Tailscale Admin, 1Password, Pulumi ## Deployment and Testing - [ ] Sync `apps` application to pick up new Hajimari Application - [ ] Sync `hajimari` application - [ ] Run `mise run provision-indri -- --tags caddy` for go/nas proxy entries - [ ] Re-sync all k8s apps with hajimari annotations (or wait for natural drift) - [ ] Verify https://go.ops.eblu.me shows dashboard with all services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/73	2026-01-29 15:51:42 -08:00
Erich Blume	2bc826c31f	Move metrics scripts from ~/bin to ~/.local/bin (#70 ) ## Summary - Update all metrics role defaults to install scripts to ~/.local/bin following XDG conventions - Scripts already manually moved on indri from ~/bin to ~/.local/bin - Cleaned up orphaned scripts (devpi-metrics, transmission-metrics, mcquack) and plist files ## Deployment and Testing - [x] Manually moved scripts on indri - [x] Deleted orphaned plist files (devpi-metrics, devpi, kiwix-serve, transmission-metrics) - [x] Deleted orphaned scripts (devpi-metrics, transmission-metrics, mcquack) - [x] Verified no metrics dependencies on orphaned scripts (checked alloy config and textfile directory) - [ ] Run ansible to update LaunchAgent plist files with new paths - [ ] Verify metrics collection continues working 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/70	2026-01-29 09:59:38 -08:00
Erich Blume	3971670832	Remove immich-sync ansible role (#65 ) ## Summary - Remove immich_sync ansible role (server-side photo sync via osxphotos) - The Immich iOS app has built-in automatic backup that replaces this functionality - iOS app supports foreground/background backup and can sync iCloud photos directly ## Deployment and Testing - [ ] Clean up files on indri (see manual cleanup commands below) - [ ] Configure Immich iOS app for automatic backup ### Manual cleanup on indri: ```bash # Unload and remove LaunchAgent launchctl unload ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist rm ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist # Remove script and credentials rm ~/bin/immich-sync.sh rm ~/.immich-api-key # Remove logs rm ~/Library/Logs/mcquack.immich-sync.*.log # Optionally remove export directory (check if empty first) ls ~/Pictures/immich-export # rm -r ~/Pictures/immich-export ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/65	2026-01-28 08:49:22 -08:00
Erich Blume	54945be0e3	Add immich-sync ansible role for photo library sync (#63 ) ## Summary - Add `immich_sync` role that syncs macOS Photos Library to Immich - Uses osxphotos to export photos with metadata to staging directory - Uses immich-cli (via Docker) to upload to Immich server - LaunchAgent schedules hourly syncs following mcquack pattern - API key fetched from 1Password in playbook pre_tasks ## Architecture ``` Photos Library → osxphotos export → ~/Pictures/immich-export/ → immich-cli upload → Immich ``` ## Prerequisites (manual) - Install osxphotos on indri: Add `"pipx:osxphotos" = "latest"` to `~/.config/mise/config.toml`, run `mise install` - Docker is already installed on indri ## Deployment and Testing - [ ] Dry run: `mise run provision-indri -- --tags immich_sync --check --diff` - [ ] Deploy: `mise run provision-indri -- --tags immich_sync` - [ ] Verify LaunchAgent: `ssh indri 'launchctl list \| grep immich'` - [ ] Test manual sync: `ssh indri '~/bin/immich-sync.sh'` - [ ] Check logs: `ssh indri 'tail -50 ~/Library/Logs/mcquack.immich-sync.out.log'` - [ ] Verify photos in Immich at https://photos.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/63	2026-01-26 12:38:38 -08:00
Erich Blume	8621996343	Add Immich photo management + migrate forge URLs (#62 ) ## Summary - Migrate all ArgoCD app repo URLs from `indri.tail8d86e.ts.net:2200` to `forge.ops.eblu.me:2222` - Add Immich self-hosted photo management service with: - Helm chart deployment via ArgoCD - PostgreSQL cluster with pgvecto.rs for AI vector search (immich-pg) - NFS storage on sifaka for photo library (2Ti) - Tailscale Ingress + Caddy proxy for `photos.ops.eblu.me` - Machine learning service for face/object recognition ## Deployment and Testing - [x] Update ArgoCD repo-creds-forge secret with new URL (one-time manual step) - [ ] Sync `apps` to pick up new applications - [ ] Sync all existing apps to verify new forge URL works - [ ] Sync `blumeops-pg` to deploy immich-pg cluster - [ ] Wait for immich-pg to be healthy - [ ] Create immich-db secret from auto-generated password - [ ] Sync `immich-storage` (PV, PVC, Ingress) - [ ] Sync `immich` (Helm chart) - [ ] Run `mise run provision-indri -- --tags caddy` to add photos.ops.eblu.me - [ ] Verify Immich UI is accessible 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/62	2026-01-26 11:20:11 -08:00
Erich Blume	ea42362b6f	Migrate Forgejo runner to Kubernetes with DinD (#60 ) ## Summary - Deploy Forgejo runner to k8s with Docker-in-Docker sidecar - Add job execution image with Node.js and Docker CLI - Retire host-mode runner on indri - All CI jobs now run containerized in k8s ## Components Added - `containers/forgejo-runner/Dockerfile` - Job execution image - `argocd/apps/forgejo-runner.yaml` - ArgoCD Application - `argocd/manifests/forgejo-runner/` - Kubernetes manifests ## Components Removed - `ansible/roles/forgejo_runner/` - No longer needed ## Changes to Existing Files - `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env - `.github/actionlint.yaml` - Only `k8s` label now valid ## Deployment 1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl \| kubectl --context=minikube-indri apply -f -` 2. Sync ArgoCD: `argocd app sync forgejo-runner` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60	2026-01-25 19:56:17 -08:00
Erich Blume	66badfafd1	Migrate k8s services to Caddy (.ops.eblu.me) (#59 ) All checks were successful Build Container / build (push) Successful in 13s Details ## Summary - Add Caddy reverse proxy routes for all k8s services (grafana, argocd, prometheus, loki, miniflux, devpi, kiwix, torrent, teslamate) - Add PostgreSQL via Caddy L4 TCP proxy on port 5432 - Caddy proxies to existing Tailscale endpoints - traffic stays local on indri - Both `.ops.eblu.me` and `.tail8d86e.ts.net` URLs continue to work ## Updated References - Alloy: prometheus/loki push endpoints → `.ops.eblu.me` - Borgmatic: PostgreSQL backup host → `pg.ops.eblu.me` - Devpi: DEVPI_OUTSIDE_URL → `pypi.ops.eblu.me` - indri-services-check: health check URLs - CLAUDE.md: argocd login command ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags caddy` to deploy new Caddy config - [ ] Test HTTP services: `curl https://grafana.ops.eblu.me/api/health` - [ ] Test PostgreSQL: `pg_isready -h pg.ops.eblu.me -p 5432` - [ ] Run `mise run provision-indri -- --tags alloy` to update Alloy endpoints - [ ] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic - [ ] Sync devpi in ArgoCD: `argocd app sync devpi` - [ ] Re-login to ArgoCD: `argocd login argocd.ops.eblu.me ...` - [ ] Run `mise run indri-services-check` to verify all services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/59	2026-01-25 12:56:31 -08:00
Erich Blume	d6e6b48f6a	Migrate registry to Caddy (registry.ops.eblu.me) (#58 ) ## Summary - Update all references from `registry.tail8d86e.ts.net` to `registry.ops.eblu.me` - Remove `tailscale_serve` ansible role (no longer needed - all services migrated to Caddy) - Update minikube containerd config for new registry URL - Update devpi manifest, CI actions, and mise tasks ## Deployment and Testing - [ ] Run `mise run provision-indri -- --check --diff` (dry run) - [ ] Run `mise run provision-indri -- --tags minikube` to update containerd config - [ ] Sync devpi ArgoCD app: `argocd app sync devpi` - [ ] Manually remove old Tailscale serve entry: `ssh indri 'tailscale serve --service=svc:registry off'` - [ ] Test registry access: `curl https://registry.ops.eblu.me/v2/_catalog` - [ ] Run `mise run indri-services-check` to verify all services healthy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/58	2026-01-25 12:06:15 -08:00
Erich Blume	1184b4de1d	Add Caddy layer4 for Forgejo SSH (#56 ) ## Summary - Add layer4 TCP proxy configuration to Caddyfile template for SSH services - Configure Forgejo SSH on port 2222 → localhost:2200 - Switch HTTPS from port 8443 (testing) to 443 (production) - Requires Caddy rebuilt with `github.com/mholt/caddy-l4` plugin ## What This Enables Git+SSH access via `forge.ops.eblu.me:2222` is now accessible from: - Tailnet clients (gilbert) - Docker containers on indri - Kubernetes pods in minikube This solves the DNS resolution issues where containers couldn't reach Tailscale MagicDNS names. ## Testing Done - [x] Caddy rebuilt with layer4 plugin - [x] Validated Caddyfile syntax - [x] Cleared `svc:forge` from tailscale serve - [x] Verified HTTPS works: `curl https://forge.ops.eblu.me` - [x] Verified SSH works: `ssh -p 2222 forgejo@forge.ops.eblu.me` - [x] Verified git clone works via new endpoint - [x] Verified minikube pods can reach both HTTPS and SSH endpoints ## Deployment Caddy is already running with the new config on indri. This PR captures the ansible changes. ## Next Steps - Update zk docs with new git remote format - Migrate registry and other services to Caddy - Retire tailscale_services ansible role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/56	2026-01-25 11:37:23 -08:00
Erich Blume	682a68dc9c	Add Caddy reverse proxy for blumeops services (#55 ) ## Summary - Add Caddy ansible role following zot pattern (manual build, ansible deploy) - Caddy built with Gandi DNS plugin for ACME DNS-01 challenges - Gandi PAT fetched from 1Password and written to secured file on indri - Configure wildcard TLS for `*.ops.eblu.me` - Initial services: forge, registry (indri-local) - Uses port 8443 during testing to avoid Tailscale serve conflicts ## Build Instructions (already done) On indri: ```bash cd ~/code/3rd/caddy && mise run build ``` ## Deployment and Testing - [ ] Review Caddyfile configuration - [ ] Run `mise run provision-indri -- --tags caddy` to deploy - [ ] Test: `curl -v https://forge.ops.eblu.me:8443` (should get TLS cert) - [ ] Test: `curl -v https://registry.ops.eblu.me:8443/v2/` (should return `{}`) - [ ] Once verified, switch to port 443 and migrate services from Tailscale serve ## Files Changed - `ansible/playbooks/indri.yml` - Add pre_task for Gandi PAT, add caddy role - `ansible/roles/caddy/` - New role with Caddyfile and LaunchAgent templates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/55	2026-01-25 09:35:06 -08:00
Erich Blume	8ca8798121	Switch to Buildah for container builds (#51 ) All checks were successful Test CI / test (push) Successful in 4s Details ## Summary - Replace Docker with Buildah for container image builds - No Docker socket required - buildah is daemonless - Cleaner security model (no privileged containers or socket mounting) - Remove Docker-related security context from deployment ## Changes - Update Dockerfile to install buildah/podman instead of docker-cli - Configure buildah storage with overlay driver and fuse-overlayfs - Update composite action to use `buildah bud` and `buildah push` - Add `imagePullPolicy: Always` to ensure fresh image pulls - Update test workflow to verify buildah/podman ## Testing - [ ] Runner pod starts successfully - [ ] Buildah is available in runner - [ ] Test workflow verifies buildah/podman versions - [ ] Container build workflow builds and pushes to zot 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51	2026-01-24 13:30:26 -08:00
Erich Blume	7893c41020	Enable Forgejo Actions (Phase 1) (#48 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password - Enable Forgejo Actions in config (`[actions] ENABLED = true`) - Add `repo.actions` to DEFAULT_REPO_UNITS - Clean up unused MySQL database fields (we use SQLite) ## Phase 1 Progress This PR covers the first part of Phase 1 (ci-cd-bootstrap plan): - [x] Refactor app.ini to ansible template - [x] Store secrets in 1Password - [x] Enable Actions in config - [ ] Deploy config changes (pending review) - [ ] Create runner registration token - [ ] Deploy runner to k8s - [ ] Test with simple workflow ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags forgejo` to deploy - [ ] Verify Forgejo restarts correctly - [ ] Verify Actions tab appears in repo settings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48	2026-01-23 17:00:12 -08:00
Erich Blume	272ddb213b	Add TeslaMate deployment for Tesla Model Y data logging (#47 ) ## Summary - Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net - Add teslamate user to CloudNativePG blumeops-pg cluster - Add TeslaMate PostgreSQL datasource to Grafana - Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc. - Add teslamate database to borgmatic backup configuration ## Deployment and Testing - [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key" - [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl \| kubectl apply -f -` - [ ] Sync blumeops-pg: `argocd app sync blumeops-pg` - [ ] Create teslamate database - [ ] Apply teslamate secrets (encryption key, db connection) - [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl \| kubectl apply -f -` - [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config` - [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net - [ ] Verify data collection starts - [ ] Verify Grafana dashboards show data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47	2026-01-22 21:25:44 -08:00
Erich Blume	16bfe06b7b	Fix LaunchDaemon check to use become: true LaunchDaemons run in the system domain and require sudo to query. Without become: true, the check always fails and tries to reload. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 17:34:23 -08:00
Erich Blume	57bf8512dc	Log filtering cleanup and observability improvements (#45 ) ## Summary - Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue) - Disable thermal collector on indri Alloy (not supported on macOS M1) - Add macOS power/thermal metrics collection via powermetrics LaunchDaemon - Add Power & Thermal section to macOS Grafana dashboard - Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.) - Extract more fields from JSON logs (zot compatibility - uses "message" not "msg") - Silence logfmt parse errors for non-logfmt logs - Fix JSON escaping in devpi dashboard ## Deployment and Testing - [x] Deployed Alloy config changes to indri via ansible - [x] Synced alloy-k8s and grafana-config via ArgoCD - [x] Verified power metrics appearing in Prometheus - [x] Verified thermal collector errors stopped - [x] Verified logfmt parse errors silenced - [x] Verified devpi dashboard loads correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45	2026-01-22 17:30:08 -08:00
Erich Blume	e4a8405de7	Observability cleanup and k8s service monitoring (#43 ) (#43 ) ## Summary - Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s) - Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery - Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd - Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios) - Replace stale devpi dashboard with probe-based metrics (status, response time, uptime) - Add unified "K8s Services Health" dashboard for service uptime/response monitoring ## Manual cleanup already performed - Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom` - Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/` ## Deployment and Testing - [x] Sync `apps` application to pick up new alloy-k8s app - [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s` - [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent` - [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus` - [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config` - [x] Verify pod logs appear in Loki/Grafana - [x] Verify transmission metrics appear in Prometheus - [x] Verify service probe metrics appear in Prometheus - [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config - [ ] After merge, reset apps to main and resync 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43	2026-01-22 13:51:01 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	5a829e0afd	Remove unused indri tags and ansible roles (#41 ) ## Summary - Remove ansible roles for services migrated to k8s: devpi, kiwix, transmission - Also remove unused node_exporter and podman ansible roles - Remove service tags from indri for k8s-hosted services (grafana, kiwix, devpi, pg, feed) - Update indri description to reflect current architecture ## Changes Ansible roles removed (34 files, ~1000 lines): - devpi, devpi_metrics - kiwix - transmission, transmission_metrics - node_exporter - podman Pulumi indri tags removed: - tag:grafana, tag:kiwix, tag:devpi, tag:pg, tag:feed These services now run in k8s with their own Tailscale devices via tailscale-operator. ## Deployment and Testing - [x] Verified remaining ansible roles match indri.yml - [x] Verified no playbooks or role dependencies reference removed roles - [ ] Run `pulumi preview` to verify tag changes - [ ] Run `pulumi up` to apply tag changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/41	2026-01-21 20:18:53 -08:00
Erich Blume	7ec98210a9	P6: Migrate Kiwix and Transmission to Kubernetes (#39 ) ## Summary - Add Transmission BitTorrent daemon to k8s (torrent namespace) - Add Kiwix ZIM archive server to k8s (kiwix namespace) - NFS storage from sifaka for shared torrent/ZIM data - Torrent-sync sidecar in kiwix deployment to manage declarative ZIM list - ZIM-watcher CronJob to auto-restart kiwix when new archives appear - Remove transmission, transmission_metrics, and kiwix ansible roles from indri - Remove svc:kiwix from tailscale_serve defaults ## Key Decisions - Direct NFS mount for kiwix (no PVC) since it shares storage with transmission - Shell wrapper for kiwix-serve command (glob expansion) - Accept HTTP 409 as "ready" in torrent sync (transmission session ID mechanism) - Completed downloads stored in `/downloads/complete/` on sifaka ## Deployment and Testing - [x] Deployed transmission to k8s - [x] Verified transmission web UI at torrent.tail8d86e.ts.net - [x] Moved existing ZIM files to complete folder - [x] Deployed kiwix to k8s - [x] Verified kiwix web UI at kiwix.tail8d86e.ts.net - [x] Stopped old services on indri - [x] Cleared svc:kiwix from Tailscale serve on indri - [x] Updated zk documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/39	2026-01-21 18:07:40 -08:00
Erich Blume	21848a7919	P5.1: Migrate minikube from podman to QEMU2 driver (#38 ) ## Summary - Migrate minikube from podman driver to qemu2 driver for proper NFS/SMB volume mount support - Update ansible minikube role with qemu installation and containerd runtime - Remove podman role dependency from indri.yml - Add synology user creation steps and post-migration zot reconfiguration notes ## Why Phase 6 (Kiwix/Transmission migration) was blocked because the podman driver lacks kernel capabilities for filesystem mounts. QEMU2 creates an actual VM with full mount support. ## Deployment and Testing - [ ] Create k8s-storage user on Synology DSM - [ ] Store credentials in 1Password (synology-k8s-storage) - [ ] Export current k8s state - [ ] Stop and delete podman-based minikube cluster - [ ] Run ansible to create QEMU2 cluster - [ ] Test NFS volume mount with test pod - [ ] Redeploy ArgoCD and all apps - [ ] Verify all services healthy - [ ] Reconfigure zot registry mirrors for containerd (post-migration) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/38	2026-01-21 16:03:37 -08:00
Erich Blume	0439fbb704	P5: Migrate devpi to Kubernetes (#34 ) ## Summary - Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes - Custom container image with devpi-server + devpi-web + auto-init - StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net - Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow ## Key Changes - Add CRI-O registry mirror config for registry.tail8d86e.ts.net - Change ArgoCD apps to manual sync (was auto-sync causing issues) - 2Gi memory limit for Whoosh indexer (reclaimed after startup) ## Deployment and Testing - [x] devpi pod healthy in k8s - [x] pip install through proxy works - [x] mcquack 1.0.0 uploaded and installable - [x] Old devpi stopped on indri ## Post-Merge Reset ArgoCD to main: ``` argocd app set apps --revision main && argocd app sync apps argocd app set devpi --revision main && argocd app sync devpi ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34	2026-01-20 14:55:37 -08:00
Erich Blume	735b643429	P4: Miniflux migration + PostgreSQL consolidation (#33 ) ## Summary - Deploy miniflux in k8s via ArgoCD - Expose via Tailscale Ingress at feed.tail8d86e.ts.net - Retire brew PostgreSQL (no longer needed) - Rename k8s-pg to pg (canonical hostname) - Remove ansible miniflux and postgresql roles - Update borgmatic to backup pg.tail8d86e.ts.net - Update all zk documentation ## Deployment and Testing - [x] Miniflux pod running in k8s - [x] User login works at https://feed.tail8d86e.ts.net - [x] Feeds and entries visible - [x] brew miniflux and postgresql stopped - [x] Tailscale services migrated (feed, pg) - [x] zk documentation updated - [x] Run ansible to apply role removals - [ ] Verify borgmatic backup with new pg hostname 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33	2026-01-20 09:04:47 -08:00
Erich Blume	eb952aae01	P3: PostgreSQL disaster recovery test and borgmatic k8s-pg backup (#32 ) ## Summary - Fixed borgmatic `borg: command not found` by adding `local_path` config option - Successfully tested disaster recovery: restored miniflux data from borgmatic backup to k8s-pg - Added borgmatic user to k8s-pg via CloudNativePG managed roles - Configured borgmatic to backup both localhost and k8s-pg PostgreSQL databases - Added Tailscale ACL grant for `tag:homelab` → `tag:k8s` on port 5432 - Disabled selfHeal on apps app to allow manual revision changes during development ## Changes - `ansible/roles/borgmatic/` - Added `local_path` and k8s-pg database entry - `ansible/roles/postgresql/tasks/main.yml` - Added k8s-pg to `.pgpass` - `argocd/apps/apps.yaml` - Disabled selfHeal - `argocd/manifests/databases/blumeops-pg.yaml` - Added borgmatic managed role - `argocd/manifests/databases/secret-borgmatic.yaml.tpl` - New secret template - `pulumi/policy.hujson` - Added ACL grant for backup access ## Deployment and Testing - [x] Borgmatic backup runs successfully - [x] Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries verified) - [x] borgmatic user created in k8s-pg with pg_read_all_data role - [x] Both localhost and k8s-pg databases in backup archive - [x] zk documentation updated (borgmatic.md, postgresql.md) - [ ] After merge: set blumeops-pg app back to main revision 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/32	2026-01-19 18:00:32 -08:00

1 2

94 commits