Commit graph

91 commits

Author SHA1 Message Date
8765ee8706 Deploy Dex OIDC identity provider on ringtail with Grafana SSO
Adds Dex as a central OIDC identity provider running on ringtail's k3s
cluster. Grafana is integrated as the first SSO client via generic_oauth.
Dex uses Kubernetes CRD storage and ExternalSecrets for all sensitive
config (bcrypt hash, client secrets from 1Password).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 19:18:23 -08:00
16a4a9a616 Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216)
## Summary
- Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired
- Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s
- Move ntfy ArgoCD app from indri minikube to ringtail k3s
- Update Frigate docs to reflect detector removal and planned RTX 4080 migration
- Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail

## Deployment

After merge:
1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps:
   ```
   argocd app sync apps --prune
   ```
2. Sync new ringtail apps:
   ```
   argocd app sync mosquitto-ringtail
   argocd app sync ntfy-ringtail
   ```
3. Manually clean up the detector LaunchAgent on indri:
   ```
   ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ```

## Notes
- Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns)
- ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup
- Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed
- Frigate + frigate-notify will be ported to ringtail in a follow-up PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216
2026-02-19 11:22:44 -08:00
b475a1fcd7 Fix 1Password secret tasks always reporting changed in ringtail playbook (#213)
## Summary
- Replace `changed_when: true` with `register` + output inspection on the two 1Password secret tasks in `ringtail.yml`
- Tasks now correctly report `ok` when the secret content hasn't changed, and `changed` only when `kubectl apply` outputs `configured` or `created`

## Test plan
- [ ] Run `mise run provision-ringtail` twice — second run should show both tasks as `ok` not `changed`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/213
2026-02-19 07:25:24 -08:00
918df9e642 Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209)
## Summary

  Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions
  runner for building containers with Nix.

  ### K3s cluster
  - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint)
  - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale
  - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`)
  - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access
  - `kubectl` added to system packages

  ### 1Password Connect + External Secrets Operator
  - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`,
  `external-secrets-config-ringtail`
  - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server
  - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password`
  namespace in post_tasks

  ### Systemd Forgejo Actions runner
  - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host
  - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH
  - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env`
  - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access

  ### Nix container build workflow
  - New `.forgejo/workflows/build-container-nix.yaml` triggers on `*-nix-v[0-9]*` tags (e.g. `nettest-nix-v1.0.0`)
  - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy`
  - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering

  ### Mise task updates
  - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`)
  - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`)

  ## Post-merge

  1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild
  2. Register k3s cluster in ArgoCD (first time only):
     ```fish
     ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' | \
       sed 's|127.0.0.1|ringtail.tail8d86e.ts.net|' > /tmp/k3s-ringtail.yaml
     set -x KUBECONFIG /tmp/k3s-ringtail.yaml
     argocd cluster add default --name k3s-ringtail
  3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail ->
  external-secrets-config-ringtail
  4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder'
  5. Check Forgejo admin panel for ringtail-nix-builder runner online
  6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209
2026-02-18 21:15:30 -08:00
535f897054 Polish ringtail NixOS config and add documentation (#208)
## Summary
- Fix Super+Return keybinding to launch wezterm in sway
- Set fish as default login shell
- Remove `initialPassword` (real password already set)
- Add 1Password CLI + GUI, chezmoi, and dev tool packages (neovim, eza, fd, fzf, zoxide, starship, atuin, bat, ripgrep)
- Add ringtail reference card, update host inventory and reference index
- Changelog fragment

## Post-merge deployment
- `mise run provision-ringtail` to rebuild NixOS
- On ringtail: launch 1Password GUI, enable CLI integration (Settings > Developer > CLI integration)
- Chezmoi needs `.chezmoiignore` updates in the dotfiles repo (separate task)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/208
2026-02-18 17:53:47 -08:00
b76f2314c2 Add force: true to ringtail git task
nixos-rebuild can dirty the tree (e.g. flake.lock updates), which
blocks the Ansible git module. Force ensures we always reset to the
upstream state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-18 09:32:23 -08:00
b9d813cde1 Add NixOS configuration for ringtail workstation (#207)
## Summary
- NixOS flake for ringtail (gaming/compute workstation, RTX 4080) in `nixos/ringtail/`
- Declarative disk partitioning via disko (GPT, 512M EFI + ext4 root on NVMe)
- NVIDIA proprietary drivers, sway/Wayland desktop, greetd, PipeWire, Steam
- Tailscale integration for tailnet connectivity
- Ansible playbook + `mise run provision-ringtail` for ongoing management
- Pulumi auth key (`tag:homelab`, `tag:blumeops`) for tailnet bootstrap

## Deployment Order
1. **Merge PR**
2. `pulumi up` in tailscale stack → creates auth key
3. Retrieve auth key: `pulumi stack output ringtail_authkey --show-secrets`
4. On ringtail NixOS installer:
   - `nix run github:nix-community/disko -- --mode disko /tmp/disk-config.nix` (or from cloned repo)
   - `nixos-install --flake github:eblume/blumeops?dir=nixos/ringtail#ringtail`
5. Reboot, `tailscale up --auth-key=<key>`
6. Verify: `tailscale status`, SSH from gilbert

## Test plan
- [ ] Review NixOS configuration for completeness
- [ ] Verify disko partition layout matches ringtail hardware
- [ ] Run `pulumi preview` for tailscale stack
- [ ] Install NixOS on ringtail
- [ ] Confirm tailscale connectivity
- [ ] Confirm sway desktop works
- [ ] Test `mise run provision-ringtail` for ongoing management

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/207
2026-02-18 08:24:25 -08:00
5f9b024b4a Add Apple Silicon ZMQ detector for Frigate (#206)
## Summary

- New `frigate_detector` ansible role deploys the [apple-silicon-detector](https://github.com/frigate-nvr/apple-silicon-detector) as a LaunchAgent on indri
- Switches Frigate from ONNX CPU detector (~117ms) to ZMQ detector backed by CoreML/Neural Engine (~15ms)
- Removes detect FPS cap (no longer needed with fast inference)
- Updates Frigate docs and adds changelog fragment

## Deployment

### Phase 1: Deploy detector on indri (one-time setup + ansible)
```fish
ssh indri 'git clone https://github.com/frigate-nvr/apple-silicon-detector.git ~/code/3rd/apple-silicon-detector'
ssh indri 'cd ~/code/3rd/apple-silicon-detector && make install'
mise run provision-indri -- --tags frigate_detector --check --diff  # dry run
mise run provision-indri -- --tags frigate_detector                 # apply
ssh indri 'launchctl list mcquack.eblume.frigate-detector'          # verify running
ssh indri 'tail ~/Library/Logs/mcquack.frigate-detector.out.log'    # verify bound
```

### Phase 2: Test connectivity
```fish
kubectl --context=minikube-indri -n frigate exec deploy/frigate -- nc -vz host.minikube.internal 5555
```

### Phase 3: Deploy Frigate config (branch workflow)
```fish
argocd app set frigate --revision feature/frigate-zmq-detector && argocd app sync frigate
```

### Phase 4: Post-deploy checks
- [ ] Pod starts, no config errors
- [ ] `/api/stats` shows detector type zmq, inference_speed ~15ms
- [ ] detect_fps uncapped
- [ ] Recordings and MQTT events flowing
- [ ] After merge: `argocd app set frigate --revision main && argocd app sync frigate`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/206
2026-02-17 19:03:28 -08:00
04c7f3c45a Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190)
## Summary

Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159):

- **Mosquitto** — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth)
- **Ntfy** — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me`
- **Frigate** — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me`
- **frigate-notify** — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT

Also includes:
- Prometheus scrape target for Frigate metrics
- Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage)
- Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me`

## Prerequisites

- [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri)
- [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields

## Deployment (after merge)

```bash
argocd app sync apps
argocd app sync mosquitto
argocd app sync ntfy
argocd app sync frigate
argocd app sync grafana-config
argocd app sync prometheus
mise run provision-indri -- --tags caddy
mise run services-check
```

## Verification

- [ ] Mosquitto pod running, accepting connections on 1883
- [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me`
- [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed
- [ ] Object detection working (ONNX, person/car/dog/cat)
- [ ] Recordings appearing in NFS share on sifaka
- [ ] frigate-notify sending detection alerts to Ntfy
- [ ] Prometheus scraping Frigate metrics
- [ ] Grafana dashboard showing Frigate data

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190
2026-02-14 21:27:44 -08:00
cd25dae8f9 Extend forgejo_actions_secrets role to support multiple repos
Uses subelements loop to sync secrets across repos. Adds FORGE_TOKEN
to the cv repo for package uploads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-12 11:15:28 -08:00
01e19023ee Add CV/resume web app at cv.ops.eblu.me (#169)
## Summary
- nginx container (`containers/cv/`) downloads and serves a content tarball at startup (same pattern as quartz)
- ArgoCD app + k8s manifests (deployment, service, Tailscale ingress)
- Caddy route for `cv.ops.eblu.me`
- Deploy workflow: resolves "latest" or specific version from Forgejo packages, updates deployment, syncs ArgoCD
- Content is built and released from the separate [cv repo](https://forge.ops.eblu.me/eblume/cv)

## Deployment steps (after merge)
1. `mise run container-tag-and-release cv v1.0.0`
2. Run "Release CV" workflow in cv repo (SPECIFIC_VERSION `v0.1.0`)
3. Run "Deploy CV" workflow in blumeops (default: latest)
4. `mise run provision-indri -- --tags caddy`
5. Verify at `https://cv.ops.eblu.me/`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/169
2026-02-12 11:09:41 -08:00
f65d11d55b Update BorgBase repo ID after recreation (#144)
## Summary
- Previous BorgBase repo (k04ljcd7) had corrupted segments from interrupted backup attempts
- Recreated as u3ugi1x1 (same US region, same SSH key, same append-only settings)
- Updates repo path in Ansible defaults and known_hosts hostname in tasks

## Post-merge
1. `mise run provision-indri -- --tags borgmatic`
2. `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'`
3. `mise x -- borgmatic create --repository borgbase-offsite --verbosity 1 --progress`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/144
2026-02-10 13:19:15 -08:00
d045a5d76a Add BorgBase offsite backup repository (#142)
## Summary
- Adds BorgBase as a second borgmatic repository for offsite backups (US region, append-only)
- SSH key managed via 1Password, deployed to indri by Ansible
- Borgmatic `ssh_command` configured to use the dedicated BorgBase key
- BorgBase host key pinned in known_hosts via Ansible

## Post-merge deployment steps
1. Provision borgmatic: `mise run provision-indri -- --tags borgmatic`
2. Initialize the BorgBase repo: `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'`
3. Export and store the borg repokey: `ssh indri 'borg key export ssh://k04ljcd7@k04ljcd7.repo.borgbase.com/./repo'` → save to 1Password
4. Verify first backup: `ssh indri 'mise x -- borgmatic create --repository borgbase-offsite --verbosity 1'`

## BorgBase setup (already done)
- Account created, API token in 1Password (`borgbase` item in blumeops vault)
- SSH keypair generated, stored in 1Password, public key uploaded to BorgBase (ID: 200815)
- Repository `indri-borgmatic` created (ID: k04ljcd7, US region, append-only, 2-day alert)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/142
2026-02-10 12:47:02 -08:00
d76d675b29 Fix minikube role skipping start when kubelet/apiserver are stopped (#137)
## Summary
- After a power loss, minikube's Docker container (host) restarts but kubelet/apiserver remain stopped
- The ansible role's status check used `--format='{{.Host}}'` which only examined the host VM state
- When host=Running but kubelet/apiserver=Stopped, the role skipped `minikube start`
- Fixed to use full `minikube status` exit code (returns non-zero when any component is unhealthy)
- Simplified all downstream conditions to use exit code instead of string matching

## Test plan
- [x] Verified the fix correctly skips `minikube start` when cluster is already fully running
- [x] Pre-commit hooks pass (ansible-lint, yamllint, etc.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/137
2026-02-09 23:03:01 -08:00
85e36cd807 Operations and observability for sifaka NAS (#135)
## Summary
- Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring
- Formalize existing `node_exporter` container under Ansible management
- Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus
- Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime)
- Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS
- Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles

## Prerequisites before deploy
- [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP)
- [ ] Verify `ssh eblume@sifaka 'docker ps'` works
- [ ] Run `mise run provision-sifaka` to deploy containers
- [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes
- [ ] `argocd app sync prometheus` + `argocd app sync grafana-config`

## Test plan
- [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics`
- [ ] Verify Prometheus targets page shows both sifaka jobs as UP
- [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135
2026-02-09 17:44:05 -08:00
7f41621c7f Migrate Ansible op calls to op read URI syntax (#125)
## Summary
- Convert all 12 `op item get ... --fields ... --reveal` calls in Ansible to the newer `op read "op://vault/item/field"` syntax
- Remove the `regex_replace` workaround on the Fly deploy token (no longer needed since `op read` returns clean unquoted values)
- Covers `ansible/playbooks/indri.yml`, `ansible/roles/caddy/tasks/main.yml`, `ansible/roles/jellyfin_metrics/tasks/main.yml`, and `ansible/roles/alloy/tasks/main.yml`

## Test plan
- [x] `mise run provision-indri -- --check --diff` dry run passes (ok=67, failed=0)
- [x] No `op item get` calls remain in `ansible/` directory
- [x] All pre-commit hooks pass (yaml, ansible-lint, TruffleHog, etc.)
- [ ] Full provision run after merge to confirm secrets resolve correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/125
2026-02-08 10:52:43 -08:00
6b1167a345 Fix Fly.io deploy token quoting (#121)
## Summary

- Strip literal quotes from Fly.io deploy token when syncing to Forgejo Actions secrets
- The `op` CLI wraps values containing spaces in quotes; the Fly.io token format is `FlyV1 <key>` (contains a space)
- This caused the CI deploy workflow to fail with a 401 auth error

## Test plan

- [x] `mise run provision-indri -- --tags forgejo_actions_secrets` succeeds
- [ ] Re-trigger deploy-fly workflow after merge — should authenticate successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/121
2026-02-08 02:43:45 -08:00
64a78422b1 Add Fly.io public reverse proxy for docs.eblu.me (#120)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 9s
## Summary

- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow

## Key details

- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`

## Test plan

- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120
2026-02-08 02:36:19 -08:00
74bd5abe54 Add IaC for Forgejo Actions secrets via Ansible (#107)
## Summary
- New `forgejo_actions_secrets` Ansible role syncs repository-level Actions secrets from 1Password to Forgejo via the Forgejo API
- Replaces manual process of copying secrets from 1Password to Forgejo UI
- Documents the one-time PAT setup requirement in forgejo.md

## Manual Setup Required
Before this role can run, a Forgejo PAT must be created:
1. Go to https://forge.ops.eblu.me/user/settings/applications
2. Create a new token with `write:repository` scope
3. Store it in 1Password → "Forgejo Secrets" item → `api-token` field

This has already been done.

## Test Plan
- [x] Ran `mise run provision-indri -- --tags forgejo_actions_secrets` successfully
- [x] Verified secret synced (API returned 204 = updated existing)
- [x] Ansible-lint passes

🤖 Generated with [Claude Code](https://claude.ai/code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/107
2026-02-04 09:11:01 -08:00
72f9f21d46 Remove iCloud Photos from borgmatic backup (#100)
## Summary
- Remove ~/Pictures from borgmatic source directories
- Update borgmatic and backup policy documentation
- Add Sifaka-Native Data section to clarify that photos (via Immich), music (via Navidrome), and video (via Jellyfin) are stored directly on Sifaka

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags borgmatic --check --diff` to preview changes
- [ ] Run `mise run provision-indri -- --tags borgmatic` to apply
- [ ] Verify borgmatic config no longer includes ~/Pictures

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/100
2026-02-04 07:09:28 -08:00
1c86134a62 Phase 1b: Deploy docs hosting with Quartz (#85)
## Summary
- Add ArgoCD Application and manifests for `quartz` service
- Add `docs.ops.eblu.me` to Caddy reverse proxy configuration
- ConfigMap points to blumeops v1.0.0 release tarball
- Tailscale ingress with homepage annotations for auto-discovery

## Deployment and Testing

**Pre-deployment (container build):**
- [ ] Build and tag quartz container: `mise run container-tag-and-release quartz v1.0.0`

**K8s deployment:**
- [ ] Sync apps: `argocd app sync apps`
- [ ] Point quartz at feature branch: `argocd app set quartz --revision feature/docs-phase-1b-hosting`
- [ ] Sync quartz: `argocd app sync quartz`
- [ ] Verify pod is running: `kubectl --context=minikube-indri get pods -n quartz`
- [ ] Verify Tailscale ingress: `kubectl --context=minikube-indri get ingress -n quartz`

**Caddy deployment:**
- [ ] Dry run: `mise run provision-indri -- --tags caddy --check --diff`
- [ ] Apply: `mise run provision-indri -- --tags caddy`

**Verification:**
- [ ] Test https://docs.tail8d86e.ts.net
- [ ] Test https://docs.ops.eblu.me
- [ ] Verify homepage dashboard shows docs link

**Post-merge:**
- [ ] Reset to main: `argocd app set quartz --revision main && argocd app sync quartz`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/85
2026-02-03 10:52:20 -08:00
ade21cc49e Add Navidrome music streaming server (#79)
## Summary
- Deploy Navidrome music streaming server to k8s
- NFS mount for music library from sifaka:/volume1/music (read-only)
- Local PVC for SQLite database and config (10Gi)
- Tailscale ingress for dj.tail8d86e.ts.net
- Caddy reverse proxy for dj.ops.eblu.me
- Homepage annotations for dashboard discovery in Media group

## Deployment and Testing
- [ ] Sync `apps` application to pick up new Application definition
- [ ] Set navidrome app to feature branch and sync
- [ ] Verify NFS mount with `kubectl exec`
- [ ] Provision Caddy for dj.ops.eblu.me
- [ ] Access https://dj.ops.eblu.me and create initial admin user
- [ ] Verify Homepage shows DJ in Media group
- [ ] Reset to main and resync after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/79
2026-01-31 20:19:31 -08:00
b8b33b76c8 Remove Plex media server (#78)
## Summary
- Remove plex_metrics ansible role
- Remove Plex Grafana dashboard
- Remove Plex log collection from Alloy config
- Update indri-services-check to check Jellyfin instead of Plex

## Deployment and Testing
- [x] Unloaded plex-metrics LaunchAgent on indri
- [x] Deleted plex-metrics plist and script
- [x] Deleted plex.prom textfile
- [ ] Deploy Alloy config update
- [ ] Sync grafana-config to remove dashboard

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78
2026-01-30 17:06:00 -08:00
bcc8685316 Add Jellyfin media server deployment (#77)
## Summary
- Add Jellyfin ansible role for native macOS deployment via Homebrew cask
- Add jellyfin_metrics role for Prometheus textfile metrics collection
- Add Caddy routing for jellyfin.ops.eblu.me
- Add Alloy log collection for Jellyfin stdout/stderr
- Add Grafana dashboard for Jellyfin monitoring

## Architecture
Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams.

## Deployment and Testing
- [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy`
- [ ] Sync Grafana dashboard: `argocd app sync grafana-config`
- [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me
- [ ] Generate API key and save to `~/.jellyfin-api-key`
- [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV)
- [ ] Enable VideoToolbox hardware transcoding
- [ ] Verify metrics in Grafana dashboard
- [ ] Verify logs in Loki: `{service="jellyfin"}`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77
2026-01-30 16:57:26 -08:00
d1164c8aac Add Hajimari service dashboard (#73)
## Summary
- Add Hajimari as a service dashboard/start page at `go.ops.eblu.me`
- Auto-discovers k8s services from ingress annotations
- Custom apps for non-k8s services: Forgejo, Registry, Sifaka NAS
- Add `nas.ops.eblu.me` Caddy proxy to Synology dashboard

## Services Configured

**Auto-discovered (k8s ingresses with hajimari.io annotations):**
- Grafana, ArgoCD, Prometheus, Loki (Observability)
- Miniflux, Kiwix, Transmission, TeslaMate, Immich (Apps)
- PyPI/devpi (Infrastructure)

**Custom apps (non-k8s):**
- Forgejo (forge.ops.eblu.me)
- Registry (registry.ops.eblu.me)
- Sifaka NAS (nas.ops.eblu.me)

**Bookmarks:**
- Tailscale Admin, 1Password, Pulumi

## Deployment and Testing
- [ ] Sync `apps` application to pick up new Hajimari Application
- [ ] Sync `hajimari` application
- [ ] Run `mise run provision-indri -- --tags caddy` for go/nas proxy entries
- [ ] Re-sync all k8s apps with hajimari annotations (or wait for natural drift)
- [ ] Verify https://go.ops.eblu.me shows dashboard with all services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/73
2026-01-29 15:51:42 -08:00
2bc826c31f Move metrics scripts from ~/bin to ~/.local/bin (#70)
## Summary
- Update all metrics role defaults to install scripts to ~/.local/bin following XDG conventions
- Scripts already manually moved on indri from ~/bin to ~/.local/bin
- Cleaned up orphaned scripts (devpi-metrics, transmission-metrics, mcquack) and plist files

## Deployment and Testing
- [x] Manually moved scripts on indri
- [x] Deleted orphaned plist files (devpi-metrics, devpi, kiwix-serve, transmission-metrics)
- [x] Deleted orphaned scripts (devpi-metrics, transmission-metrics, mcquack)
- [x] Verified no metrics dependencies on orphaned scripts (checked alloy config and textfile directory)
- [ ] Run ansible to update LaunchAgent plist files with new paths
- [ ] Verify metrics collection continues working

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/70
2026-01-29 09:59:38 -08:00
3971670832 Remove immich-sync ansible role (#65)
## Summary
- Remove immich_sync ansible role (server-side photo sync via osxphotos)
- The Immich iOS app has built-in automatic backup that replaces this functionality
- iOS app supports foreground/background backup and can sync iCloud photos directly

## Deployment and Testing
- [ ] Clean up files on indri (see manual cleanup commands below)
- [ ] Configure Immich iOS app for automatic backup

### Manual cleanup on indri:
```bash
# Unload and remove LaunchAgent
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist
rm ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist

# Remove script and credentials
rm ~/bin/immich-sync.sh
rm ~/.immich-api-key

# Remove logs
rm ~/Library/Logs/mcquack.immich-sync.*.log

# Optionally remove export directory (check if empty first)
ls ~/Pictures/immich-export
# rm -r ~/Pictures/immich-export
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/65
2026-01-28 08:49:22 -08:00
54945be0e3 Add immich-sync ansible role for photo library sync (#63)
## Summary
- Add `immich_sync` role that syncs macOS Photos Library to Immich
- Uses osxphotos to export photos with metadata to staging directory
- Uses immich-cli (via Docker) to upload to Immich server
- LaunchAgent schedules hourly syncs following mcquack pattern
- API key fetched from 1Password in playbook pre_tasks

## Architecture
```
Photos Library → osxphotos export → ~/Pictures/immich-export/ → immich-cli upload → Immich
```

## Prerequisites (manual)
- Install osxphotos on indri: Add `"pipx:osxphotos" = "latest"` to `~/.config/mise/config.toml`, run `mise install`
- Docker is already installed on indri

## Deployment and Testing
- [ ] Dry run: `mise run provision-indri -- --tags immich_sync --check --diff`
- [ ] Deploy: `mise run provision-indri -- --tags immich_sync`
- [ ] Verify LaunchAgent: `ssh indri 'launchctl list | grep immich'`
- [ ] Test manual sync: `ssh indri '~/bin/immich-sync.sh'`
- [ ] Check logs: `ssh indri 'tail -50 ~/Library/Logs/mcquack.immich-sync.out.log'`
- [ ] Verify photos in Immich at https://photos.ops.eblu.me

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/63
2026-01-26 12:38:38 -08:00
8621996343 Add Immich photo management + migrate forge URLs (#62)
## Summary
- Migrate all ArgoCD app repo URLs from `indri.tail8d86e.ts.net:2200` to `forge.ops.eblu.me:2222`
- Add Immich self-hosted photo management service with:
  - Helm chart deployment via ArgoCD
  - PostgreSQL cluster with pgvecto.rs for AI vector search (immich-pg)
  - NFS storage on sifaka for photo library (2Ti)
  - Tailscale Ingress + Caddy proxy for `photos.ops.eblu.me`
  - Machine learning service for face/object recognition

## Deployment and Testing
- [x] Update ArgoCD repo-creds-forge secret with new URL (one-time manual step)
- [ ] Sync `apps` to pick up new applications
- [ ] Sync all existing apps to verify new forge URL works
- [ ] Sync `blumeops-pg` to deploy immich-pg cluster
- [ ] Wait for immich-pg to be healthy
- [ ] Create immich-db secret from auto-generated password
- [ ] Sync `immich-storage` (PV, PVC, Ingress)
- [ ] Sync `immich` (Helm chart)
- [ ] Run `mise run provision-indri -- --tags caddy` to add photos.ops.eblu.me
- [ ] Verify Immich UI is accessible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/62
2026-01-26 11:20:11 -08:00
ea42362b6f Migrate Forgejo runner to Kubernetes with DinD (#60)
## Summary
- Deploy Forgejo runner to k8s with Docker-in-Docker sidecar
- Add job execution image with Node.js and Docker CLI
- Retire host-mode runner on indri
- All CI jobs now run containerized in k8s

## Components Added
- `containers/forgejo-runner/Dockerfile` - Job execution image
- `argocd/apps/forgejo-runner.yaml` - ArgoCD Application
- `argocd/manifests/forgejo-runner/` - Kubernetes manifests

## Components Removed
- `ansible/roles/forgejo_runner/` - No longer needed

## Changes to Existing Files
- `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env
- `.github/actionlint.yaml` - Only `k8s` label now valid

## Deployment
1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl | kubectl --context=minikube-indri apply -f -`
2. Sync ArgoCD: `argocd app sync forgejo-runner`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60
2026-01-25 19:56:17 -08:00
66badfafd1 Migrate k8s services to Caddy (*.ops.eblu.me) (#59)
All checks were successful
Build Container / build (push) Successful in 13s
## Summary
- Add Caddy reverse proxy routes for all k8s services (grafana, argocd, prometheus, loki, miniflux, devpi, kiwix, torrent, teslamate)
- Add PostgreSQL via Caddy L4 TCP proxy on port 5432
- Caddy proxies to existing Tailscale endpoints - traffic stays local on indri
- Both `*.ops.eblu.me` and `*.tail8d86e.ts.net` URLs continue to work

## Updated References
- Alloy: prometheus/loki push endpoints → `*.ops.eblu.me`
- Borgmatic: PostgreSQL backup host → `pg.ops.eblu.me`
- Devpi: DEVPI_OUTSIDE_URL → `pypi.ops.eblu.me`
- indri-services-check: health check URLs
- CLAUDE.md: argocd login command

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags caddy` to deploy new Caddy config
- [ ] Test HTTP services: `curl https://grafana.ops.eblu.me/api/health`
- [ ] Test PostgreSQL: `pg_isready -h pg.ops.eblu.me -p 5432`
- [ ] Run `mise run provision-indri -- --tags alloy` to update Alloy endpoints
- [ ] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic
- [ ] Sync devpi in ArgoCD: `argocd app sync devpi`
- [ ] Re-login to ArgoCD: `argocd login argocd.ops.eblu.me ...`
- [ ] Run `mise run indri-services-check` to verify all services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/59
2026-01-25 12:56:31 -08:00
d6e6b48f6a Migrate registry to Caddy (registry.ops.eblu.me) (#58)
## Summary
- Update all references from `registry.tail8d86e.ts.net` to `registry.ops.eblu.me`
- Remove `tailscale_serve` ansible role (no longer needed - all services migrated to Caddy)
- Update minikube containerd config for new registry URL
- Update devpi manifest, CI actions, and mise tasks

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --check --diff` (dry run)
- [ ] Run `mise run provision-indri -- --tags minikube` to update containerd config
- [ ] Sync devpi ArgoCD app: `argocd app sync devpi`
- [ ] Manually remove old Tailscale serve entry: `ssh indri 'tailscale serve --service=svc:registry off'`
- [ ] Test registry access: `curl https://registry.ops.eblu.me/v2/_catalog`
- [ ] Run `mise run indri-services-check` to verify all services healthy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/58
2026-01-25 12:06:15 -08:00
1184b4de1d Add Caddy layer4 for Forgejo SSH (#56)
## Summary
- Add layer4 TCP proxy configuration to Caddyfile template for SSH services
- Configure Forgejo SSH on port 2222 → localhost:2200
- Switch HTTPS from port 8443 (testing) to 443 (production)
- Requires Caddy rebuilt with `github.com/mholt/caddy-l4` plugin

## What This Enables
Git+SSH access via `forge.ops.eblu.me:2222` is now accessible from:
- Tailnet clients (gilbert)
- Docker containers on indri
- Kubernetes pods in minikube

This solves the DNS resolution issues where containers couldn't reach Tailscale MagicDNS names.

## Testing Done
- [x] Caddy rebuilt with layer4 plugin
- [x] Validated Caddyfile syntax
- [x] Cleared `svc:forge` from tailscale serve
- [x] Verified HTTPS works: `curl https://forge.ops.eblu.me`
- [x] Verified SSH works: `ssh -p 2222 forgejo@forge.ops.eblu.me`
- [x] Verified git clone works via new endpoint
- [x] Verified minikube pods can reach both HTTPS and SSH endpoints

## Deployment
Caddy is already running with the new config on indri. This PR captures the ansible changes.

## Next Steps
- Update zk docs with new git remote format
- Migrate registry and other services to Caddy
- Retire tailscale_services ansible role

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/56
2026-01-25 11:37:23 -08:00
682a68dc9c Add Caddy reverse proxy for blumeops services (#55)
## Summary
- Add Caddy ansible role following zot pattern (manual build, ansible deploy)
- Caddy built with Gandi DNS plugin for ACME DNS-01 challenges
- Gandi PAT fetched from 1Password and written to secured file on indri
- Configure wildcard TLS for `*.ops.eblu.me`
- Initial services: forge, registry (indri-local)
- Uses port 8443 during testing to avoid Tailscale serve conflicts

## Build Instructions (already done)
On indri:
```bash
cd ~/code/3rd/caddy && mise run build
```

## Deployment and Testing
- [ ] Review Caddyfile configuration
- [ ] Run `mise run provision-indri -- --tags caddy` to deploy
- [ ] Test: `curl -v https://forge.ops.eblu.me:8443` (should get TLS cert)
- [ ] Test: `curl -v https://registry.ops.eblu.me:8443/v2/` (should return `{}`)
- [ ] Once verified, switch to port 443 and migrate services from Tailscale serve

## Files Changed
- `ansible/playbooks/indri.yml` - Add pre_task for Gandi PAT, add caddy role
- `ansible/roles/caddy/` - New role with Caddyfile and LaunchAgent templates

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/55
2026-01-25 09:35:06 -08:00
8ca8798121 Switch to Buildah for container builds (#51)
All checks were successful
Test CI / test (push) Successful in 4s
## Summary
- Replace Docker with Buildah for container image builds
- No Docker socket required - buildah is daemonless
- Cleaner security model (no privileged containers or socket mounting)
- Remove Docker-related security context from deployment

## Changes
- Update Dockerfile to install buildah/podman instead of docker-cli
- Configure buildah storage with overlay driver and fuse-overlayfs
- Update composite action to use `buildah bud` and `buildah push`
- Add `imagePullPolicy: Always` to ensure fresh image pulls
- Update test workflow to verify buildah/podman

## Testing
- [ ] Runner pod starts successfully
- [ ] Buildah is available in runner
- [ ] Test workflow verifies buildah/podman versions
- [ ] Container build workflow builds and pushes to zot

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51
2026-01-24 13:30:26 -08:00
7893c41020 Enable Forgejo Actions (Phase 1) (#48)
All checks were successful
Test CI / test (push) Successful in 0s
## Summary
- Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password
- Enable Forgejo Actions in config (`[actions] ENABLED = true`)
- Add `repo.actions` to DEFAULT_REPO_UNITS
- Clean up unused MySQL database fields (we use SQLite)

## Phase 1 Progress
This PR covers the first part of Phase 1 (ci-cd-bootstrap plan):
- [x] Refactor app.ini to ansible template
- [x] Store secrets in 1Password
- [x] Enable Actions in config
- [ ] Deploy config changes (pending review)
- [ ] Create runner registration token
- [ ] Deploy runner to k8s
- [ ] Test with simple workflow

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags forgejo` to deploy
- [ ] Verify Forgejo restarts correctly
- [ ] Verify Actions tab appears in repo settings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48
2026-01-23 17:00:12 -08:00
272ddb213b Add TeslaMate deployment for Tesla Model Y data logging (#47)
## Summary
- Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net
- Add teslamate user to CloudNativePG blumeops-pg cluster
- Add TeslaMate PostgreSQL datasource to Grafana
- Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc.
- Add teslamate database to borgmatic backup configuration

## Deployment and Testing
- [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key"
- [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl | kubectl apply -f -`
- [ ] Sync blumeops-pg: `argocd app sync blumeops-pg`
- [ ] Create teslamate database
- [ ] Apply teslamate secrets (encryption key, db connection)
- [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl | kubectl apply -f -`
- [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config`
- [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net
- [ ] Verify data collection starts
- [ ] Verify Grafana dashboards show data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47
2026-01-22 21:25:44 -08:00
16bfe06b7b Fix LaunchDaemon check to use become: true
LaunchDaemons run in the system domain and require sudo to query.
Without become: true, the check always fails and tries to reload.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 17:34:23 -08:00
57bf8512dc Log filtering cleanup and observability improvements (#45)
## Summary
- Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue)
- Disable thermal collector on indri Alloy (not supported on macOS M1)
- Add macOS power/thermal metrics collection via powermetrics LaunchDaemon
- Add Power & Thermal section to macOS Grafana dashboard
- Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.)
- Extract more fields from JSON logs (zot compatibility - uses "message" not "msg")
- Silence logfmt parse errors for non-logfmt logs
- Fix JSON escaping in devpi dashboard

## Deployment and Testing
- [x] Deployed Alloy config changes to indri via ansible
- [x] Synced alloy-k8s and grafana-config via ArgoCD
- [x] Verified power metrics appearing in Prometheus
- [x] Verified thermal collector errors stopped
- [x] Verified logfmt parse errors silenced
- [x] Verified devpi dashboard loads correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45
2026-01-22 17:30:08 -08:00
e4a8405de7 Observability cleanup and k8s service monitoring (#43) (#43)
## Summary
- Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s)
- Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery
- Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd
- Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios)
- Replace stale devpi dashboard with probe-based metrics (status, response time, uptime)
- Add unified "K8s Services Health" dashboard for service uptime/response monitoring

## Manual cleanup already performed
- Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom`
- Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/`

## Deployment and Testing
- [x] Sync `apps` application to pick up new alloy-k8s app
- [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s`
- [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent`
- [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus`
- [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config`
- [x] Verify pod logs appear in Loki/Grafana
- [x] Verify transmission metrics appear in Prometheus
- [x] Verify service probe metrics appear in Prometheus
- [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config
- [ ] After merge, reset apps to main and resync

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43
2026-01-22 13:51:01 -08:00
17023085cb Migrate observability stack to Kubernetes (#42)
Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack.

Summary

  - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal)
  - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses
  - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics
  - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net)
  - Add ACL rule for port 9187 (CNPG metrics)
  - Delete obsolete ansible roles for prometheus and loki

Changes

  - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications
  - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS
  - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint
  - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics
  - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints
  - pulumi/policy.hujson - ACL for port 9187
  - Deleted ansible/roles/prometheus/ and ansible/roles/loki/

Deployment and Testing

  - Stop prometheus and loki on indri
  - Sync ArgoCD apps (apps, prometheus, loki, grafana)
  - Run mise run provision-indri -- --tags alloy
  - Verify Grafana dashboards show data

🤖 Generated with https://claude.ai/claude-code

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42
2026-01-22 12:06:02 -08:00
5a829e0afd Remove unused indri tags and ansible roles (#41)
## Summary
- Remove ansible roles for services migrated to k8s: devpi, kiwix, transmission
- Also remove unused node_exporter and podman ansible roles
- Remove service tags from indri for k8s-hosted services (grafana, kiwix, devpi, pg, feed)
- Update indri description to reflect current architecture

## Changes
**Ansible roles removed** (34 files, ~1000 lines):
- devpi, devpi_metrics
- kiwix
- transmission, transmission_metrics
- node_exporter
- podman

**Pulumi indri tags removed**:
- tag:grafana, tag:kiwix, tag:devpi, tag:pg, tag:feed

These services now run in k8s with their own Tailscale devices via tailscale-operator.

## Deployment and Testing
- [x] Verified remaining ansible roles match indri.yml
- [x] Verified no playbooks or role dependencies reference removed roles
- [ ] Run `pulumi preview` to verify tag changes
- [ ] Run `pulumi up` to apply tag changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/41
2026-01-21 20:18:53 -08:00
7ec98210a9 P6: Migrate Kiwix and Transmission to Kubernetes (#39)
## Summary
- Add Transmission BitTorrent daemon to k8s (torrent namespace)
- Add Kiwix ZIM archive server to k8s (kiwix namespace)
- NFS storage from sifaka for shared torrent/ZIM data
- Torrent-sync sidecar in kiwix deployment to manage declarative ZIM list
- ZIM-watcher CronJob to auto-restart kiwix when new archives appear
- Remove transmission, transmission_metrics, and kiwix ansible roles from indri
- Remove svc:kiwix from tailscale_serve defaults

## Key Decisions
- Direct NFS mount for kiwix (no PVC) since it shares storage with transmission
- Shell wrapper for kiwix-serve command (glob expansion)
- Accept HTTP 409 as "ready" in torrent sync (transmission session ID mechanism)
- Completed downloads stored in `/downloads/complete/` on sifaka

## Deployment and Testing
- [x] Deployed transmission to k8s
- [x] Verified transmission web UI at torrent.tail8d86e.ts.net
- [x] Moved existing ZIM files to complete folder
- [x] Deployed kiwix to k8s
- [x] Verified kiwix web UI at kiwix.tail8d86e.ts.net
- [x] Stopped old services on indri
- [x] Cleared svc:kiwix from Tailscale serve on indri
- [x] Updated zk documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/39
2026-01-21 18:07:40 -08:00
21848a7919 P5.1: Migrate minikube from podman to QEMU2 driver (#38)
## Summary
- Migrate minikube from podman driver to qemu2 driver for proper NFS/SMB volume mount support
- Update ansible minikube role with qemu installation and containerd runtime
- Remove podman role dependency from indri.yml
- Add synology user creation steps and post-migration zot reconfiguration notes

## Why
Phase 6 (Kiwix/Transmission migration) was blocked because the podman driver lacks kernel capabilities for filesystem mounts. QEMU2 creates an actual VM with full mount support.

## Deployment and Testing
- [ ] Create k8s-storage user on Synology DSM
- [ ] Store credentials in 1Password (synology-k8s-storage)
- [ ] Export current k8s state
- [ ] Stop and delete podman-based minikube cluster
- [ ] Run ansible to create QEMU2 cluster
- [ ] Test NFS volume mount with test pod
- [ ] Redeploy ArgoCD and all apps
- [ ] Verify all services healthy
- [ ] Reconfigure zot registry mirrors for containerd (post-migration)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/38
2026-01-21 16:03:37 -08:00
0439fbb704 P5: Migrate devpi to Kubernetes (#34)
## Summary
- Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes
- Custom container image with devpi-server + devpi-web + auto-init
- StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net
- Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow

## Key Changes
- Add CRI-O registry mirror config for registry.tail8d86e.ts.net
- Change ArgoCD apps to manual sync (was auto-sync causing issues)
- 2Gi memory limit for Whoosh indexer (reclaimed after startup)

## Deployment and Testing
- [x] devpi pod healthy in k8s
- [x] pip install through proxy works
- [x] mcquack 1.0.0 uploaded and installable
- [x] Old devpi stopped on indri

## Post-Merge
Reset ArgoCD to main:
```
argocd app set apps --revision main && argocd app sync apps
argocd app set devpi --revision main && argocd app sync devpi
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34
2026-01-20 14:55:37 -08:00
735b643429 P4: Miniflux migration + PostgreSQL consolidation (#33)
## Summary
- Deploy miniflux in k8s via ArgoCD
- Expose via Tailscale Ingress at feed.tail8d86e.ts.net
- Retire brew PostgreSQL (no longer needed)
- Rename k8s-pg to pg (canonical hostname)
- Remove ansible miniflux and postgresql roles
- Update borgmatic to backup pg.tail8d86e.ts.net
- Update all zk documentation

## Deployment and Testing
- [x] Miniflux pod running in k8s
- [x] User login works at https://feed.tail8d86e.ts.net
- [x] Feeds and entries visible
- [x] brew miniflux and postgresql stopped
- [x] Tailscale services migrated (feed, pg)
- [x] zk documentation updated
- [x] Run ansible to apply role removals
- [ ] Verify borgmatic backup with new pg hostname

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33
2026-01-20 09:04:47 -08:00
eb952aae01 P3: PostgreSQL disaster recovery test and borgmatic k8s-pg backup (#32)
## Summary
- Fixed borgmatic `borg: command not found` by adding `local_path` config option
- Successfully tested disaster recovery: restored miniflux data from borgmatic backup to k8s-pg
- Added borgmatic user to k8s-pg via CloudNativePG managed roles
- Configured borgmatic to backup both localhost and k8s-pg PostgreSQL databases
- Added Tailscale ACL grant for `tag:homelab` → `tag:k8s` on port 5432
- Disabled selfHeal on apps app to allow manual revision changes during development

## Changes
- `ansible/roles/borgmatic/` - Added `local_path` and k8s-pg database entry
- `ansible/roles/postgresql/tasks/main.yml` - Added k8s-pg to `.pgpass`
- `argocd/apps/apps.yaml` - Disabled selfHeal
- `argocd/manifests/databases/blumeops-pg.yaml` - Added borgmatic managed role
- `argocd/manifests/databases/secret-borgmatic.yaml.tpl` - New secret template
- `pulumi/policy.hujson` - Added ACL grant for backup access

## Deployment and Testing
- [x] Borgmatic backup runs successfully
- [x] Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries verified)
- [x] borgmatic user created in k8s-pg with pg_read_all_data role
- [x] Both localhost and k8s-pg databases in backup archive
- [x] zk documentation updated (borgmatic.md, postgresql.md)
- [ ] After merge: set blumeops-pg app back to main revision

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/32
2026-01-19 18:00:32 -08:00
f2541c3f77 Fix minikube role idempotency for zot mirror config (#31)
## Summary
- Fixed trailing newline mismatch in config comparison (ansible command module strips whitespace, slurp preserves it)
- Only copy temp file when config actually needs updating (avoids spurious changes)
- Task now properly skips when config is already correct

## Deployment and Testing
- [x] Verified idempotency: `changed=0` on repeated runs
- [x] Verified change detection: corrupted config triggers proper update
- [x] ansible-lint passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/31
2026-01-19 16:19:52 -08:00
130c044523 Fix hanging minikube provision 2026-01-19 15:49:11 -08:00
7e6742ad24 K8s Migration Phase 2: Grafana to Kubernetes (#30)
## Summary
- Migrate Grafana from Homebrew/Ansible to Kubernetes deployment
- Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed)
- Add Grafana Helm chart deployment via ArgoCD with multi-source pattern
- Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps)
- Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal`

## Key Changes
- `argocd/apps/grafana.yaml` - Grafana Helm chart Application
- `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps
- `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo
- `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0

## Deployment and Testing
- [x] Deploy Loki config change: `mise run provision-indri -- --tags loki`
- [x] Create namespace: `ki create namespace monitoring`
- [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl | ki apply -f -`
- [x] Sync ArgoCD apps (grafana, grafana-config)
- [x] Verify Grafana works at https://grafana.tail8d86e.ts.net
- [x] Remove svc:grafana from ansible tailscale_serve
- [x] Stop brew grafana: `ssh indri 'brew services stop grafana'`
- [x] Delete ansible grafana role

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30
2026-01-19 14:40:25 -08:00