blumeops

Author	SHA1	Message	Date
Erich Blume	8765ee8706	Deploy Dex OIDC identity provider on ringtail with Grafana SSO Adds Dex as a central OIDC identity provider running on ringtail's k3s cluster. Grafana is integrated as the first SSO client via generic_oauth. Dex uses Kubernetes CRD storage and ExternalSecrets for all sensitive config (bcrypt hash, client secrets from 1Password). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 19:18:23 -08:00
Erich Blume	16a4a9a616	Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216 ) ## Summary - Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired - Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s - Move ntfy ArgoCD app from indri minikube to ringtail k3s - Update Frigate docs to reflect detector removal and planned RTX 4080 migration - Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail ## Deployment After merge: 1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps: ``` argocd app sync apps --prune ``` 2. Sync new ringtail apps: ``` argocd app sync mosquitto-ringtail argocd app sync ntfy-ringtail ``` 3. Manually clean up the detector LaunchAgent on indri: ``` ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist' ``` ## Notes - Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns) - ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup - Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed - Frigate + frigate-notify will be ported to ringtail in a follow-up PR 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216	2026-02-19 11:22:44 -08:00
Erich Blume	b475a1fcd7	Fix 1Password secret tasks always reporting changed in ringtail playbook (#213 ) ## Summary - Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml` - Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created` ## Test plan - [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213	2026-02-19 07:25:24 -08:00
Erich Blume	918df9e642	Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209 ) ## Summary Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions runner for building containers with Nix. ### K3s cluster - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint) - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`) - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access - `kubectl` added to system packages ### 1Password Connect + External Secrets Operator - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`, `external-secrets-config-ringtail` - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password` namespace in post_tasks ### Systemd Forgejo Actions runner - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env` - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access ### Nix container build workflow - New `.forgejo/workflows/build-container-nix.yaml` triggers on `-nix-v[0-9]` tags (e.g. `nettest-nix-v1.0.0`) - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy` - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering ### Mise task updates - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`) - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`) ## Post-merge 1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild 2. Register k3s cluster in ArgoCD (first time only): ```fish ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' \| \ sed 's\|127.0.0.1\|ringtail.tail8d86e.ts.net\|' > /tmp/k3s-ringtail.yaml set -x KUBECONFIG /tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail 3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail -> external-secrets-config-ringtail 4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder' 5. Check Forgejo admin panel for ringtail-nix-builder runner online 6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209	2026-02-18 21:15:30 -08:00
Erich Blume	535f897054	Polish ringtail NixOS config and add documentation (#208 ) ## Summary - Fix Super+Return keybinding to launch wezterm in sway - Set fish as default login shell - Remove `initialPassword` (real password already set) - Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep) - Add ringtail reference card, update host inventory and reference index - Changelog fragment ## Post-merge deployment - `mise run provision-ringtail` to rebuild NixOS - On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration) - Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208	2026-02-18 17:53:47 -08:00
Erich Blume	b76f2314c2	Add force: true to ringtail git task nixos-rebuild can dirty the tree (e.g. flake.lock updates), which blocks the Ansible git module. Force ensures we always reset to the upstream state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-18 09:32:23 -08:00
Erich Blume	b9d813cde1	Add NixOS configuration for ringtail workstation (#207 ) ## Summary - NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/` - Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe) - NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam - Tailscale integration for tailnet connectivity - Ansible playbook + `mise run provision-ringtail` for ongoing management - Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap ## Deployment Order 1. Merge PR 2. `pulumi up` in tailscale stack → creates auth key 3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets` 4. On ringtail NixOS installer: - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo) - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail` 5. Reboot, `tailscale up --auth-key=<key>` 6. Verify: `tailscale status`, SSH from gilbert ## Test plan - [ ] Review NixOS configuration for completeness - [ ] Verify disko partition layout matches ringtail hardware - [ ] Run `pulumi preview` for tailscale stack - [ ] Install NixOS on ringtail - [ ] Confirm tailscale connectivity - [ ] Confirm sway desktop works - [ ] Test `mise run provision-ringtail` for ongoing management 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207	2026-02-18 08:24:25 -08:00
Erich Blume	5f9b024b4a	Add Apple Silicon ZMQ detector for Frigate (#206 ) ## Summary - New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri - Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms) - Removes detect FPS cap (no longer needed with fast inference) - Updates Frigate docs and adds changelog fragment ## Deployment ### Phase 1: Deploy detector on indri (one-time setup + ansible) ```fish ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector' ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install' mise run provision-indri -- --tags frigate_detector --check --diff # dry run mise run provision-indri -- --tags frigate_detector # apply ssh indri 'launchctl list mcquack.eblume.frigate-detector' # verify running ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log' # verify bound ``` ### Phase 2: Test connectivity ```fish kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555 ``` ### Phase 3: Deploy Frigate config (branch workflow) ```fish argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate ``` ### Phase 4: Post-deploy checks - [ ] Pod starts, no config errors - [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms - [ ] detect_fps uncapped - [ ] Recordings and MQTT events flowing - [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206	2026-02-17 19:03:28 -08:00
Erich Blume	04c7f3c45a	Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190 ) ## Summary Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159): - Mosquitto — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth) - Ntfy — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me` - Frigate — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me` - frigate-notify — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT Also includes: - Prometheus scrape target for Frigate metrics - Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage) - Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me` ## Prerequisites - [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri) - [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields ## Deployment (after merge) ```bash argocd app sync apps argocd app sync mosquitto argocd app sync ntfy argocd app sync frigate argocd app sync grafana-config argocd app sync prometheus mise run provision-indri -- --tags caddy mise run services-check ``` ## Verification - [ ] Mosquitto pod running, accepting connections on 1883 - [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me` - [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed - [ ] Object detection working (ONNX, person/car/dog/cat) - [ ] Recordings appearing in NFS share on sifaka - [ ] frigate-notify sending detection alerts to Ntfy - [ ] Prometheus scraping Frigate metrics - [ ] Grafana dashboard showing Frigate data Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190	2026-02-14 21:27:44 -08:00
Erich Blume	cd25dae8f9	Extend forgejo_actions_secrets role to support multiple repos Uses subelements loop to sync secrets across repos. Adds FORGE_TOKEN to the cv repo for package uploads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 11:15:28 -08:00
Erich Blume	01e19023ee	Add CV/resume web app at cv.ops.eblu.me (#169 ) ## Summary - nginx container (`containers/cv/`) downloads and serves a content tarball at startup (same pattern as quartz) - ArgoCD app + k8s manifests (deployment, service, Tailscale ingress) - Caddy route for `cv.ops.eblu.me` - Deploy workflow: resolves "latest" or specific version from Forgejo packages, updates deployment, syncs ArgoCD - Content is built and released from the separate [cv repo](https://forge.ops.eblu.me/eblume/cv) ## Deployment steps (after merge) 1. `mise run container-tag-and-release cv v1.0.0` 2. Run "Release CV" workflow in cv repo (SPECIFIC_VERSION `v0.1.0`) 3. Run "Deploy CV" workflow in blumeops (default: latest) 4. `mise run provision-indri -- --tags caddy` 5. Verify at `https://cv.ops.eblu.me/` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/169	2026-02-12 11:09:41 -08:00
Erich Blume	f65d11d55b	Update BorgBase repo ID after recreation (#144 ) ## Summary - Previous BorgBase repo (k04ljcd7) had corrupted segments from interrupted backup attempts - Recreated as u3ugi1x1 (same US region, same SSH key, same append-only settings) - Updates repo path in Ansible defaults and known_hosts hostname in tasks ## Post-merge 1. `mise run provision-indri -- --tags borgmatic` 2. `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'` 3. `mise x -- borgmatic create --repository borgbase-offsite --verbosity 1 --progress` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/144	2026-02-10 13:19:15 -08:00
Erich Blume	d045a5d76a	Add BorgBase offsite backup repository (#142 ) ## Summary - Adds BorgBase as a second borgmatic repository for offsite backups (US region, append-only) - SSH key managed via 1Password, deployed to indri by Ansible - Borgmatic `ssh_command` configured to use the dedicated BorgBase key - BorgBase host key pinned in known_hosts via Ansible ## Post-merge deployment steps 1. Provision borgmatic: `mise run provision-indri -- --tags borgmatic` 2. Initialize the BorgBase repo: `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'` 3. Export and store the borg repokey: `ssh indri 'borg key export ssh://k04ljcd7@k04ljcd7.repo.borgbase.com/./repo'` → save to 1Password 4. Verify first backup: `ssh indri 'mise x -- borgmatic create --repository borgbase-offsite --verbosity 1'` ## BorgBase setup (already done) - Account created, API token in 1Password (`borgbase` item in blumeops vault) - SSH keypair generated, stored in 1Password, public key uploaded to BorgBase (ID: 200815) - Repository `indri-borgmatic` created (ID: k04ljcd7, US region, append-only, 2-day alert) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/142	2026-02-10 12:47:02 -08:00
Erich Blume	d76d675b29	Fix minikube role skipping start when kubelet/apiserver are stopped (#137 ) ## Summary - After a power loss, minikube's Docker container (host) restarts but kubelet/apiserver remain stopped - The ansible role's status check used `--format='{{.Host}}'` which only examined the host VM state - When host=Running but kubelet/apiserver=Stopped, the role skipped `minikube start` - Fixed to use full `minikube status` exit code (returns non-zero when any component is unhealthy) - Simplified all downstream conditions to use exit code instead of string matching ## Test plan - [x] Verified the fix correctly skips `minikube start` when cluster is already fully running - [x] Pre-commit hooks pass (ansible-lint, yamllint, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/137	2026-02-09 23:03:01 -08:00
Erich Blume	85e36cd807	Operations and observability for sifaka NAS (#135 ) ## Summary - Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring - Formalize existing `node_exporter` container under Ansible management - Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus - Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime) - Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS - Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles ## Prerequisites before deploy - [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP) - [ ] Verify `ssh eblume@sifaka 'docker ps'` works - [ ] Run `mise run provision-sifaka` to deploy containers - [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes - [ ] `argocd app sync prometheus` + `argocd app sync grafana-config` ## Test plan - [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics` - [ ] Verify Prometheus targets page shows both sifaka jobs as UP - [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135	2026-02-09 17:44:05 -08:00
Erich Blume	7f41621c7f	Migrate Ansible op calls to op read URI syntax (#125 ) ## Summary - Convert all 12 `op item get ... --fields ... --reveal` calls in Ansible to the newer `op read "op://vault/item/field"` syntax - Remove the `regex_replace` workaround on the Fly deploy token (no longer needed since `op read` returns clean unquoted values) - Covers `ansible/playbooks/indri.yml`, `ansible/roles/caddy/tasks/main.yml`, `ansible/roles/jellyfin_metrics/tasks/main.yml`, and `ansible/roles/alloy/tasks/main.yml` ## Test plan - [x] `mise run provision-indri -- --check --diff` dry run passes (ok=67, failed=0) - [x] No `op item get` calls remain in `ansible/` directory - [x] All pre-commit hooks pass (yaml, ansible-lint, TruffleHog, etc.) - [ ] Full provision run after merge to confirm secrets resolve correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/125	2026-02-08 10:52:43 -08:00
Erich Blume	6b1167a345	Fix Fly.io deploy token quoting (#121 ) ## Summary - Strip literal quotes from Fly.io deploy token when syncing to Forgejo Actions secrets - The `op` CLI wraps values containing spaces in quotes; the Fly.io token format is `FlyV1 <key>` (contains a space) - This caused the CI deploy workflow to fail with a 401 auth error ## Test plan - [x] `mise run provision-indri -- --tags forgejo_actions_secrets` succeeds - [ ] Re-trigger deploy-fly workflow after merge — should authenticate successfully 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/121	2026-02-08 02:43:45 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00
Erich Blume	74bd5abe54	Add IaC for Forgejo Actions secrets via Ansible (#107 ) ## Summary - New `forgejo_actions_secrets` Ansible role syncs repository-level Actions secrets from 1Password to Forgejo via the Forgejo API - Replaces manual process of copying secrets from 1Password to Forgejo UI - Documents the one-time PAT setup requirement in forgejo.md ## Manual Setup Required Before this role can run, a Forgejo PAT must be created: 1. Go to https://forge.ops.eblu.me/user/settings/applications 2. Create a new token with `write:repository` scope 3. Store it in 1Password → "Forgejo Secrets" item → `api-token` field This has already been done. ## Test Plan - [x] Ran `mise run provision-indri -- --tags forgejo_actions_secrets` successfully - [x] Verified secret synced (API returned 204 = updated existing) - [x] Ansible-lint passes 🤖 Generated with [Claude Code](https://claude.ai/code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/107	2026-02-04 09:11:01 -08:00
Erich Blume	72f9f21d46	Remove iCloud Photos from borgmatic backup (#100 ) ## Summary - Remove ~/Pictures from borgmatic source directories - Update borgmatic and backup policy documentation - Add Sifaka-Native Data section to clarify that photos (via Immich), music (via Navidrome), and video (via Jellyfin) are stored directly on Sifaka ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags borgmatic --check --diff` to preview changes - [ ] Run `mise run provision-indri -- --tags borgmatic` to apply - [ ] Verify borgmatic config no longer includes ~/Pictures 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/100	2026-02-04 07:09:28 -08:00
Erich Blume	1c86134a62	Phase 1b: Deploy docs hosting with Quartz (#85 ) ## Summary - Add ArgoCD Application and manifests for `quartz` service - Add `docs.ops.eblu.me` to Caddy reverse proxy configuration - ConfigMap points to blumeops v1.0.0 release tarball - Tailscale ingress with homepage annotations for auto-discovery ## Deployment and Testing Pre-deployment (container build): - [ ] Build and tag quartz container: `mise run container-tag-and-release quartz v1.0.0` K8s deployment: - [ ] Sync apps: `argocd app sync apps` - [ ] Point quartz at feature branch: `argocd app set quartz --revision feature/docs-phase-1b-hosting` - [ ] Sync quartz: `argocd app sync quartz` - [ ] Verify pod is running: `kubectl --context=minikube-indri get pods -n quartz` - [ ] Verify Tailscale ingress: `kubectl --context=minikube-indri get ingress -n quartz` Caddy deployment: - [ ] Dry run: `mise run provision-indri -- --tags caddy --check --diff` - [ ] Apply: `mise run provision-indri -- --tags caddy` Verification: - [ ] Test https://docs.tail8d86e.ts.net - [ ] Test https://docs.ops.eblu.me - [ ] Verify homepage dashboard shows docs link Post-merge: - [ ] Reset to main: `argocd app set quartz --revision main && argocd app sync quartz` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/85	2026-02-03 10:52:20 -08:00
Erich Blume	ade21cc49e	Add Navidrome music streaming server (#79 ) ## Summary - Deploy Navidrome music streaming server to k8s - NFS mount for music library from sifaka:/volume1/music (read-only) - Local PVC for SQLite database and config (10Gi) - Tailscale ingress for dj.tail8d86e.ts.net - Caddy reverse proxy for dj.ops.eblu.me - Homepage annotations for dashboard discovery in Media group ## Deployment and Testing - [ ] Sync `apps` application to pick up new Application definition - [ ] Set navidrome app to feature branch and sync - [ ] Verify NFS mount with `kubectl exec` - [ ] Provision Caddy for dj.ops.eblu.me - [ ] Access https://dj.ops.eblu.me and create initial admin user - [ ] Verify Homepage shows DJ in Media group - [ ] Reset to main and resync after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/79	2026-01-31 20:19:31 -08:00
Erich Blume	b8b33b76c8	Remove Plex media server (#78 ) ## Summary - Remove plex_metrics ansible role - Remove Plex Grafana dashboard - Remove Plex log collection from Alloy config - Update indri-services-check to check Jellyfin instead of Plex ## Deployment and Testing - [x] Unloaded plex-metrics LaunchAgent on indri - [x] Deleted plex-metrics plist and script - [x] Deleted plex.prom textfile - [ ] Deploy Alloy config update - [ ] Sync grafana-config to remove dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78	2026-01-30 17:06:00 -08:00
Erich Blume	bcc8685316	Add Jellyfin media server deployment (#77 ) ## Summary - Add Jellyfin ansible role for native macOS deployment via Homebrew cask - Add jellyfin_metrics role for Prometheus textfile metrics collection - Add Caddy routing for jellyfin.ops.eblu.me - Add Alloy log collection for Jellyfin stdout/stderr - Add Grafana dashboard for Jellyfin monitoring ## Architecture Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams. ## Deployment and Testing - [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy` - [ ] Sync Grafana dashboard: `argocd app sync grafana-config` - [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me - [ ] Generate API key and save to `~/.jellyfin-api-key` - [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV) - [ ] Enable VideoToolbox hardware transcoding - [ ] Verify metrics in Grafana dashboard - [ ] Verify logs in Loki: `{service="jellyfin"}` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77	2026-01-30 16:57:26 -08:00
Erich Blume	d1164c8aac	Add Hajimari service dashboard (#73 ) ## Summary - Add Hajimari as a service dashboard/start page at `go.ops.eblu.me` - Auto-discovers k8s services from ingress annotations - Custom apps for non-k8s services: Forgejo, Registry, Sifaka NAS - Add `nas.ops.eblu.me` Caddy proxy to Synology dashboard ## Services Configured Auto-discovered (k8s ingresses with hajimari.io annotations): - Grafana, ArgoCD, Prometheus, Loki (Observability) - Miniflux, Kiwix, Transmission, TeslaMate, Immich (Apps) - PyPI/devpi (Infrastructure) Custom apps (non-k8s): - Forgejo (forge.ops.eblu.me) - Registry (registry.ops.eblu.me) - Sifaka NAS (nas.ops.eblu.me) Bookmarks: - Tailscale Admin, 1Password, Pulumi ## Deployment and Testing - [ ] Sync `apps` application to pick up new Hajimari Application - [ ] Sync `hajimari` application - [ ] Run `mise run provision-indri -- --tags caddy` for go/nas proxy entries - [ ] Re-sync all k8s apps with hajimari annotations (or wait for natural drift) - [ ] Verify https://go.ops.eblu.me shows dashboard with all services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/73	2026-01-29 15:51:42 -08:00
Erich Blume	2bc826c31f	Move metrics scripts from ~/bin to ~/.local/bin (#70 ) ## Summary - Update all metrics role defaults to install scripts to ~/.local/bin following XDG conventions - Scripts already manually moved on indri from ~/bin to ~/.local/bin - Cleaned up orphaned scripts (devpi-metrics, transmission-metrics, mcquack) and plist files ## Deployment and Testing - [x] Manually moved scripts on indri - [x] Deleted orphaned plist files (devpi-metrics, devpi, kiwix-serve, transmission-metrics) - [x] Deleted orphaned scripts (devpi-metrics, transmission-metrics, mcquack) - [x] Verified no metrics dependencies on orphaned scripts (checked alloy config and textfile directory) - [ ] Run ansible to update LaunchAgent plist files with new paths - [ ] Verify metrics collection continues working 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/70	2026-01-29 09:59:38 -08:00
Erich Blume	3971670832	Remove immich-sync ansible role (#65 ) ## Summary - Remove immich_sync ansible role (server-side photo sync via osxphotos) - The Immich iOS app has built-in automatic backup that replaces this functionality - iOS app supports foreground/background backup and can sync iCloud photos directly ## Deployment and Testing - [ ] Clean up files on indri (see manual cleanup commands below) - [ ] Configure Immich iOS app for automatic backup ### Manual cleanup on indri: ```bash # Unload and remove LaunchAgent launchctl unload ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist rm ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist # Remove script and credentials rm ~/bin/immich-sync.sh rm ~/.immich-api-key # Remove logs rm ~/Library/Logs/mcquack.immich-sync.*.log # Optionally remove export directory (check if empty first) ls ~/Pictures/immich-export # rm -r ~/Pictures/immich-export ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/65	2026-01-28 08:49:22 -08:00
Erich Blume	54945be0e3	Add immich-sync ansible role for photo library sync (#63 ) ## Summary - Add `immich_sync` role that syncs macOS Photos Library to Immich - Uses osxphotos to export photos with metadata to staging directory - Uses immich-cli (via Docker) to upload to Immich server - LaunchAgent schedules hourly syncs following mcquack pattern - API key fetched from 1Password in playbook pre_tasks ## Architecture ``` Photos Library → osxphotos export → ~/Pictures/immich-export/ → immich-cli upload → Immich ``` ## Prerequisites (manual) - Install osxphotos on indri: Add `"pipx:osxphotos" = "latest"` to `~/.config/mise/config.toml`, run `mise install` - Docker is already installed on indri ## Deployment and Testing - [ ] Dry run: `mise run provision-indri -- --tags immich_sync --check --diff` - [ ] Deploy: `mise run provision-indri -- --tags immich_sync` - [ ] Verify LaunchAgent: `ssh indri 'launchctl list \| grep immich'` - [ ] Test manual sync: `ssh indri '~/bin/immich-sync.sh'` - [ ] Check logs: `ssh indri 'tail -50 ~/Library/Logs/mcquack.immich-sync.out.log'` - [ ] Verify photos in Immich at https://photos.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/63	2026-01-26 12:38:38 -08:00
Erich Blume	8621996343	Add Immich photo management + migrate forge URLs (#62 ) ## Summary - Migrate all ArgoCD app repo URLs from `indri.tail8d86e.ts.net:2200` to `forge.ops.eblu.me:2222` - Add Immich self-hosted photo management service with: - Helm chart deployment via ArgoCD - PostgreSQL cluster with pgvecto.rs for AI vector search (immich-pg) - NFS storage on sifaka for photo library (2Ti) - Tailscale Ingress + Caddy proxy for `photos.ops.eblu.me` - Machine learning service for face/object recognition ## Deployment and Testing - [x] Update ArgoCD repo-creds-forge secret with new URL (one-time manual step) - [ ] Sync `apps` to pick up new applications - [ ] Sync all existing apps to verify new forge URL works - [ ] Sync `blumeops-pg` to deploy immich-pg cluster - [ ] Wait for immich-pg to be healthy - [ ] Create immich-db secret from auto-generated password - [ ] Sync `immich-storage` (PV, PVC, Ingress) - [ ] Sync `immich` (Helm chart) - [ ] Run `mise run provision-indri -- --tags caddy` to add photos.ops.eblu.me - [ ] Verify Immich UI is accessible 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/62	2026-01-26 11:20:11 -08:00
Erich Blume	ea42362b6f	Migrate Forgejo runner to Kubernetes with DinD (#60 ) ## Summary - Deploy Forgejo runner to k8s with Docker-in-Docker sidecar - Add job execution image with Node.js and Docker CLI - Retire host-mode runner on indri - All CI jobs now run containerized in k8s ## Components Added - `containers/forgejo-runner/Dockerfile` - Job execution image - `argocd/apps/forgejo-runner.yaml` - ArgoCD Application - `argocd/manifests/forgejo-runner/` - Kubernetes manifests ## Components Removed - `ansible/roles/forgejo_runner/` - No longer needed ## Changes to Existing Files - `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env - `.github/actionlint.yaml` - Only `k8s` label now valid ## Deployment 1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl \| kubectl --context=minikube-indri apply -f -` 2. Sync ArgoCD: `argocd app sync forgejo-runner` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60	2026-01-25 19:56:17 -08:00
Erich Blume	66badfafd1	Migrate k8s services to Caddy (.ops.eblu.me) (#59 ) All checks were successful Build Container / build (push) Successful in 13s Details ## Summary - Add Caddy reverse proxy routes for all k8s services (grafana, argocd, prometheus, loki, miniflux, devpi, kiwix, torrent, teslamate) - Add PostgreSQL via Caddy L4 TCP proxy on port 5432 - Caddy proxies to existing Tailscale endpoints - traffic stays local on indri - Both `.ops.eblu.me` and `.tail8d86e.ts.net` URLs continue to work ## Updated References - Alloy: prometheus/loki push endpoints → `.ops.eblu.me` - Borgmatic: PostgreSQL backup host → `pg.ops.eblu.me` - Devpi: DEVPI_OUTSIDE_URL → `pypi.ops.eblu.me` - indri-services-check: health check URLs - CLAUDE.md: argocd login command ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags caddy` to deploy new Caddy config - [ ] Test HTTP services: `curl https://grafana.ops.eblu.me/api/health` - [ ] Test PostgreSQL: `pg_isready -h pg.ops.eblu.me -p 5432` - [ ] Run `mise run provision-indri -- --tags alloy` to update Alloy endpoints - [ ] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic - [ ] Sync devpi in ArgoCD: `argocd app sync devpi` - [ ] Re-login to ArgoCD: `argocd login argocd.ops.eblu.me ...` - [ ] Run `mise run indri-services-check` to verify all services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/59	2026-01-25 12:56:31 -08:00
Erich Blume	d6e6b48f6a	Migrate registry to Caddy (registry.ops.eblu.me) (#58 ) ## Summary - Update all references from `registry.tail8d86e.ts.net` to `registry.ops.eblu.me` - Remove `tailscale_serve` ansible role (no longer needed - all services migrated to Caddy) - Update minikube containerd config for new registry URL - Update devpi manifest, CI actions, and mise tasks ## Deployment and Testing - [ ] Run `mise run provision-indri -- --check --diff` (dry run) - [ ] Run `mise run provision-indri -- --tags minikube` to update containerd config - [ ] Sync devpi ArgoCD app: `argocd app sync devpi` - [ ] Manually remove old Tailscale serve entry: `ssh indri 'tailscale serve --service=svc:registry off'` - [ ] Test registry access: `curl https://registry.ops.eblu.me/v2/_catalog` - [ ] Run `mise run indri-services-check` to verify all services healthy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/58	2026-01-25 12:06:15 -08:00
Erich Blume	1184b4de1d	Add Caddy layer4 for Forgejo SSH (#56 ) ## Summary - Add layer4 TCP proxy configuration to Caddyfile template for SSH services - Configure Forgejo SSH on port 2222 → localhost:2200 - Switch HTTPS from port 8443 (testing) to 443 (production) - Requires Caddy rebuilt with `github.com/mholt/caddy-l4` plugin ## What This Enables Git+SSH access via `forge.ops.eblu.me:2222` is now accessible from: - Tailnet clients (gilbert) - Docker containers on indri - Kubernetes pods in minikube This solves the DNS resolution issues where containers couldn't reach Tailscale MagicDNS names. ## Testing Done - [x] Caddy rebuilt with layer4 plugin - [x] Validated Caddyfile syntax - [x] Cleared `svc:forge` from tailscale serve - [x] Verified HTTPS works: `curl https://forge.ops.eblu.me` - [x] Verified SSH works: `ssh -p 2222 forgejo@forge.ops.eblu.me` - [x] Verified git clone works via new endpoint - [x] Verified minikube pods can reach both HTTPS and SSH endpoints ## Deployment Caddy is already running with the new config on indri. This PR captures the ansible changes. ## Next Steps - Update zk docs with new git remote format - Migrate registry and other services to Caddy - Retire tailscale_services ansible role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/56	2026-01-25 11:37:23 -08:00
Erich Blume	682a68dc9c	Add Caddy reverse proxy for blumeops services (#55 ) ## Summary - Add Caddy ansible role following zot pattern (manual build, ansible deploy) - Caddy built with Gandi DNS plugin for ACME DNS-01 challenges - Gandi PAT fetched from 1Password and written to secured file on indri - Configure wildcard TLS for `*.ops.eblu.me` - Initial services: forge, registry (indri-local) - Uses port 8443 during testing to avoid Tailscale serve conflicts ## Build Instructions (already done) On indri: ```bash cd ~/code/3rd/caddy && mise run build ``` ## Deployment and Testing - [ ] Review Caddyfile configuration - [ ] Run `mise run provision-indri -- --tags caddy` to deploy - [ ] Test: `curl -v https://forge.ops.eblu.me:8443` (should get TLS cert) - [ ] Test: `curl -v https://registry.ops.eblu.me:8443/v2/` (should return `{}`) - [ ] Once verified, switch to port 443 and migrate services from Tailscale serve ## Files Changed - `ansible/playbooks/indri.yml` - Add pre_task for Gandi PAT, add caddy role - `ansible/roles/caddy/` - New role with Caddyfile and LaunchAgent templates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/55	2026-01-25 09:35:06 -08:00
Erich Blume	8ca8798121	Switch to Buildah for container builds (#51 ) All checks were successful Test CI / test (push) Successful in 4s Details ## Summary - Replace Docker with Buildah for container image builds - No Docker socket required - buildah is daemonless - Cleaner security model (no privileged containers or socket mounting) - Remove Docker-related security context from deployment ## Changes - Update Dockerfile to install buildah/podman instead of docker-cli - Configure buildah storage with overlay driver and fuse-overlayfs - Update composite action to use `buildah bud` and `buildah push` - Add `imagePullPolicy: Always` to ensure fresh image pulls - Update test workflow to verify buildah/podman ## Testing - [ ] Runner pod starts successfully - [ ] Buildah is available in runner - [ ] Test workflow verifies buildah/podman versions - [ ] Container build workflow builds and pushes to zot 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51	2026-01-24 13:30:26 -08:00
Erich Blume	7893c41020	Enable Forgejo Actions (Phase 1) (#48 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password - Enable Forgejo Actions in config (`[actions] ENABLED = true`) - Add `repo.actions` to DEFAULT_REPO_UNITS - Clean up unused MySQL database fields (we use SQLite) ## Phase 1 Progress This PR covers the first part of Phase 1 (ci-cd-bootstrap plan): - [x] Refactor app.ini to ansible template - [x] Store secrets in 1Password - [x] Enable Actions in config - [ ] Deploy config changes (pending review) - [ ] Create runner registration token - [ ] Deploy runner to k8s - [ ] Test with simple workflow ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags forgejo` to deploy - [ ] Verify Forgejo restarts correctly - [ ] Verify Actions tab appears in repo settings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48	2026-01-23 17:00:12 -08:00
Erich Blume	272ddb213b	Add TeslaMate deployment for Tesla Model Y data logging (#47 ) ## Summary - Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net - Add teslamate user to CloudNativePG blumeops-pg cluster - Add TeslaMate PostgreSQL datasource to Grafana - Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc. - Add teslamate database to borgmatic backup configuration ## Deployment and Testing - [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key" - [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl \| kubectl apply -f -` - [ ] Sync blumeops-pg: `argocd app sync blumeops-pg` - [ ] Create teslamate database - [ ] Apply teslamate secrets (encryption key, db connection) - [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl \| kubectl apply -f -` - [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config` - [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net - [ ] Verify data collection starts - [ ] Verify Grafana dashboards show data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47	2026-01-22 21:25:44 -08:00
Erich Blume	16bfe06b7b	Fix LaunchDaemon check to use become: true LaunchDaemons run in the system domain and require sudo to query. Without become: true, the check always fails and tries to reload. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 17:34:23 -08:00
Erich Blume	57bf8512dc	Log filtering cleanup and observability improvements (#45 ) ## Summary - Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue) - Disable thermal collector on indri Alloy (not supported on macOS M1) - Add macOS power/thermal metrics collection via powermetrics LaunchDaemon - Add Power & Thermal section to macOS Grafana dashboard - Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.) - Extract more fields from JSON logs (zot compatibility - uses "message" not "msg") - Silence logfmt parse errors for non-logfmt logs - Fix JSON escaping in devpi dashboard ## Deployment and Testing - [x] Deployed Alloy config changes to indri via ansible - [x] Synced alloy-k8s and grafana-config via ArgoCD - [x] Verified power metrics appearing in Prometheus - [x] Verified thermal collector errors stopped - [x] Verified logfmt parse errors silenced - [x] Verified devpi dashboard loads correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45	2026-01-22 17:30:08 -08:00
Erich Blume	e4a8405de7	Observability cleanup and k8s service monitoring (#43 ) (#43 ) ## Summary - Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s) - Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery - Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd - Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios) - Replace stale devpi dashboard with probe-based metrics (status, response time, uptime) - Add unified "K8s Services Health" dashboard for service uptime/response monitoring ## Manual cleanup already performed - Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom` - Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/` ## Deployment and Testing - [x] Sync `apps` application to pick up new alloy-k8s app - [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s` - [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent` - [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus` - [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config` - [x] Verify pod logs appear in Loki/Grafana - [x] Verify transmission metrics appear in Prometheus - [x] Verify service probe metrics appear in Prometheus - [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config - [ ] After merge, reset apps to main and resync 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43	2026-01-22 13:51:01 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	5a829e0afd	Remove unused indri tags and ansible roles (#41 ) ## Summary - Remove ansible roles for services migrated to k8s: devpi, kiwix, transmission - Also remove unused node_exporter and podman ansible roles - Remove service tags from indri for k8s-hosted services (grafana, kiwix, devpi, pg, feed) - Update indri description to reflect current architecture ## Changes Ansible roles removed (34 files, ~1000 lines): - devpi, devpi_metrics - kiwix - transmission, transmission_metrics - node_exporter - podman Pulumi indri tags removed: - tag:grafana, tag:kiwix, tag:devpi, tag:pg, tag:feed These services now run in k8s with their own Tailscale devices via tailscale-operator. ## Deployment and Testing - [x] Verified remaining ansible roles match indri.yml - [x] Verified no playbooks or role dependencies reference removed roles - [ ] Run `pulumi preview` to verify tag changes - [ ] Run `pulumi up` to apply tag changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/41	2026-01-21 20:18:53 -08:00
Erich Blume	7ec98210a9	P6: Migrate Kiwix and Transmission to Kubernetes (#39 ) ## Summary - Add Transmission BitTorrent daemon to k8s (torrent namespace) - Add Kiwix ZIM archive server to k8s (kiwix namespace) - NFS storage from sifaka for shared torrent/ZIM data - Torrent-sync sidecar in kiwix deployment to manage declarative ZIM list - ZIM-watcher CronJob to auto-restart kiwix when new archives appear - Remove transmission, transmission_metrics, and kiwix ansible roles from indri - Remove svc:kiwix from tailscale_serve defaults ## Key Decisions - Direct NFS mount for kiwix (no PVC) since it shares storage with transmission - Shell wrapper for kiwix-serve command (glob expansion) - Accept HTTP 409 as "ready" in torrent sync (transmission session ID mechanism) - Completed downloads stored in `/downloads/complete/` on sifaka ## Deployment and Testing - [x] Deployed transmission to k8s - [x] Verified transmission web UI at torrent.tail8d86e.ts.net - [x] Moved existing ZIM files to complete folder - [x] Deployed kiwix to k8s - [x] Verified kiwix web UI at kiwix.tail8d86e.ts.net - [x] Stopped old services on indri - [x] Cleared svc:kiwix from Tailscale serve on indri - [x] Updated zk documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/39	2026-01-21 18:07:40 -08:00
Erich Blume	21848a7919	P5.1: Migrate minikube from podman to QEMU2 driver (#38 ) ## Summary - Migrate minikube from podman driver to qemu2 driver for proper NFS/SMB volume mount support - Update ansible minikube role with qemu installation and containerd runtime - Remove podman role dependency from indri.yml - Add synology user creation steps and post-migration zot reconfiguration notes ## Why Phase 6 (Kiwix/Transmission migration) was blocked because the podman driver lacks kernel capabilities for filesystem mounts. QEMU2 creates an actual VM with full mount support. ## Deployment and Testing - [ ] Create k8s-storage user on Synology DSM - [ ] Store credentials in 1Password (synology-k8s-storage) - [ ] Export current k8s state - [ ] Stop and delete podman-based minikube cluster - [ ] Run ansible to create QEMU2 cluster - [ ] Test NFS volume mount with test pod - [ ] Redeploy ArgoCD and all apps - [ ] Verify all services healthy - [ ] Reconfigure zot registry mirrors for containerd (post-migration) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/38	2026-01-21 16:03:37 -08:00
Erich Blume	0439fbb704	P5: Migrate devpi to Kubernetes (#34 ) ## Summary - Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes - Custom container image with devpi-server + devpi-web + auto-init - StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net - Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow ## Key Changes - Add CRI-O registry mirror config for registry.tail8d86e.ts.net - Change ArgoCD apps to manual sync (was auto-sync causing issues) - 2Gi memory limit for Whoosh indexer (reclaimed after startup) ## Deployment and Testing - [x] devpi pod healthy in k8s - [x] pip install through proxy works - [x] mcquack 1.0.0 uploaded and installable - [x] Old devpi stopped on indri ## Post-Merge Reset ArgoCD to main: ``` argocd app set apps --revision main && argocd app sync apps argocd app set devpi --revision main && argocd app sync devpi ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34	2026-01-20 14:55:37 -08:00
Erich Blume	735b643429	P4: Miniflux migration + PostgreSQL consolidation (#33 ) ## Summary - Deploy miniflux in k8s via ArgoCD - Expose via Tailscale Ingress at feed.tail8d86e.ts.net - Retire brew PostgreSQL (no longer needed) - Rename k8s-pg to pg (canonical hostname) - Remove ansible miniflux and postgresql roles - Update borgmatic to backup pg.tail8d86e.ts.net - Update all zk documentation ## Deployment and Testing - [x] Miniflux pod running in k8s - [x] User login works at https://feed.tail8d86e.ts.net - [x] Feeds and entries visible - [x] brew miniflux and postgresql stopped - [x] Tailscale services migrated (feed, pg) - [x] zk documentation updated - [x] Run ansible to apply role removals - [ ] Verify borgmatic backup with new pg hostname 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33	2026-01-20 09:04:47 -08:00
Erich Blume	eb952aae01	P3: PostgreSQL disaster recovery test and borgmatic k8s-pg backup (#32 ) ## Summary - Fixed borgmatic `borg: command not found` by adding `local_path` config option - Successfully tested disaster recovery: restored miniflux data from borgmatic backup to k8s-pg - Added borgmatic user to k8s-pg via CloudNativePG managed roles - Configured borgmatic to backup both localhost and k8s-pg PostgreSQL databases - Added Tailscale ACL grant for `tag:homelab` → `tag:k8s` on port 5432 - Disabled selfHeal on apps app to allow manual revision changes during development ## Changes - `ansible/roles/borgmatic/` - Added `local_path` and k8s-pg database entry - `ansible/roles/postgresql/tasks/main.yml` - Added k8s-pg to `.pgpass` - `argocd/apps/apps.yaml` - Disabled selfHeal - `argocd/manifests/databases/blumeops-pg.yaml` - Added borgmatic managed role - `argocd/manifests/databases/secret-borgmatic.yaml.tpl` - New secret template - `pulumi/policy.hujson` - Added ACL grant for backup access ## Deployment and Testing - [x] Borgmatic backup runs successfully - [x] Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries verified) - [x] borgmatic user created in k8s-pg with pg_read_all_data role - [x] Both localhost and k8s-pg databases in backup archive - [x] zk documentation updated (borgmatic.md, postgresql.md) - [ ] After merge: set blumeops-pg app back to main revision 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/32	2026-01-19 18:00:32 -08:00
Erich Blume	f2541c3f77	Fix minikube role idempotency for zot mirror config (#31 ) ## Summary - Fixed trailing newline mismatch in config comparison (ansible command module strips whitespace, slurp preserves it) - Only copy temp file when config actually needs updating (avoids spurious changes) - Task now properly skips when config is already correct ## Deployment and Testing - [x] Verified idempotency: `changed=0` on repeated runs - [x] Verified change detection: corrupted config triggers proper update - [x] ansible-lint passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/31	2026-01-19 16:19:52 -08:00
Erich Blume	130c044523	Fix hanging minikube provision	2026-01-19 15:49:11 -08:00
Erich Blume	7e6742ad24	K8s Migration Phase 2: Grafana to Kubernetes (#30 ) ## Summary - Migrate Grafana from Homebrew/Ansible to Kubernetes deployment - Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed) - Add Grafana Helm chart deployment via ArgoCD with multi-source pattern - Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps) - Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal` ## Key Changes - `argocd/apps/grafana.yaml` - Grafana Helm chart Application - `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps - `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo - `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0 ## Deployment and Testing - [x] Deploy Loki config change: `mise run provision-indri -- --tags loki` - [x] Create namespace: `ki create namespace monitoring` - [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl \| ki apply -f -` - [x] Sync ArgoCD apps (grafana, grafana-config) - [x] Verify Grafana works at https://grafana.tail8d86e.ts.net - [x] Remove svc:grafana from ansible tailscale_serve - [x] Stop brew grafana: `ssh indri 'brew services stop grafana'` - [x] Delete ansible grafana role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30	2026-01-19 14:40:25 -08:00

1 2

91 commits