Commit graph

50 commits

Author SHA1 Message Date
0395ca068e Add Jellyfin media server deployment
Deploy Jellyfin natively on indri via Ansible for VideoToolbox
hardware transcoding. Includes metrics collection, log forwarding
to Loki, and Grafana dashboard.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:53:53 -08:00
23b8897c1f Add provider field to OpenWeatherMap widget
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 15:11:37 -08:00
e830a0f26b Homepage dashboard improvements (#76)
## Summary
- Fix ArgoCD icon (use `argo-cd.png` per Dashboard Icons naming)
- Add Borgmatic backup metrics widget (time since last backup, archive size)
- Add Sifaka NAS disk usage widget (used/total space)
- Create `[[grafana]]` zk card with management notes

## What didn't work
Attempted Grafana iframe embedding for a metrics panel but reverted:
- Homepage iframe widget only supports height classes, not width
- Some panels fail to load even with anonymous auth enabled
- Documented in grafana zk card for future reference

## Deployment and Testing
- [x] ArgoCD icon displays correctly
- [x] Borgmatic metrics show time since backup and archive size
- [x] NAS disk usage shows used/total bytes
- [x] Grafana reverted to authenticated-only access

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/76
2026-01-30 15:05:02 -08:00
38538ad5f0 Replace hajimari with gethomepage (#75)
## Summary
- Remove hajimari (unmaintained since Oct 2022, broken helm deps)
- Add gethomepage (28k stars, actively maintained, monthly releases)
- Migrate custom apps, bookmarks, and search config
- Enable k8s RBAC for service autodiscovery
- Configure Tailscale ingress at go.tail8d86e.ts.net

## Why the switch
Hajimari hasn't released since October 2022. The helm chart has a broken
dependency (bjw-s/common URL is 404), and unreleased code on main has bugs.
gethomepage has similar k8s autodiscovery via ingress annotations and is
very actively maintained.

## Deployment and Testing
- [ ] Delete hajimari app from ArgoCD
- [ ] Delete hajimari namespace
- [ ] Sync apps to pick up new homepage app
- [ ] Sync homepage app
- [ ] Verify go.ops.eblu.me loads

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/75
2026-01-30 13:21:12 -08:00
9fd70c3892 Switch Hajimari to custom fork
- Use chart from forge.ops.eblu.me/eblume/hajimari fork
- Use custom image from registry.ops.eblu.me/blumeops/hajimari
- Enables future customizations (search auto-focus, weather widget)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 11:26:20 -08:00
4c852751db Update forgejo-runner to v2.2.0 (adds skopeo)
All checks were successful
Build Container / build (push) Successful in 13s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 11:13:54 -08:00
316a4c4e42 Shorten Hajimari info descriptions and hide URLs
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:34:46 -08:00
1c63220dcd Rename bookmarks group from Docs to Admin
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:16:39 -08:00
a40e1dad1b Rename customApps group to "Host Services"
Avoids duplicate "Infrastructure" groups since Hajimari doesn't
merge customApps with discovered apps.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:14:47 -08:00
e9c5153a75 Disable Hajimari for Loki (no useful landing page)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:11:09 -08:00
303fb0ba05 Use simple-icons:immich for Immich dashboard icon
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:08:54 -08:00
08e5546c0b Configure Hajimari with Kagi search and app groups
- Set Kagi as default search provider with simple-icons:kagi
- Add Google and DuckDuckGo as alternative search providers
- Explicitly enable showAppGroups, showAppUrls, showAppInfo

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-29 16:04:55 -08:00
d1164c8aac Add Hajimari service dashboard (#73)
## Summary
- Add Hajimari as a service dashboard/start page at `go.ops.eblu.me`
- Auto-discovers k8s services from ingress annotations
- Custom apps for non-k8s services: Forgejo, Registry, Sifaka NAS
- Add `nas.ops.eblu.me` Caddy proxy to Synology dashboard

## Services Configured

**Auto-discovered (k8s ingresses with hajimari.io annotations):**
- Grafana, ArgoCD, Prometheus, Loki (Observability)
- Miniflux, Kiwix, Transmission, TeslaMate, Immich (Apps)
- PyPI/devpi (Infrastructure)

**Custom apps (non-k8s):**
- Forgejo (forge.ops.eblu.me)
- Registry (registry.ops.eblu.me)
- Sifaka NAS (nas.ops.eblu.me)

**Bookmarks:**
- Tailscale Admin, 1Password, Pulumi

## Deployment and Testing
- [ ] Sync `apps` application to pick up new Hajimari Application
- [ ] Sync `hajimari` application
- [ ] Run `mise run provision-indri -- --tags caddy` for go/nas proxy entries
- [ ] Re-sync all k8s apps with hajimari annotations (or wait for natural drift)
- [ ] Verify https://go.ops.eblu.me shows dashboard with all services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/73
2026-01-29 15:51:42 -08:00
dae438a5ed Update Immich to v2.5.2 (#72)
## Summary
- Update Immich from v2.5.0 to v2.5.2

## Deployment and Testing
- [ ] Sync apps application: `argocd app sync apps`
- [ ] Point immich at feature branch: `argocd app set immich --revision update-immich-2.5.2`
- [ ] Sync immich: `argocd app sync immich`
- [ ] Verify pods restart and photos.ops.eblu.me is accessible
- [ ] After merge, reset to main: `argocd app set immich --revision main && argocd app sync immich`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/72
2026-01-29 11:25:22 -08:00
0e2df9645d Fix ArgoCD sync drift for apps and immich (#71)
## Summary
- Fix immich app to track `main` branch instead of `feature/immich` for values
- The tailscale-operator ignoreDifferences schema drift will be fixed by syncing the `apps` app

## Deployment and Testing
- [ ] Sync `apps` to fix tailscale-operator schema drift
- [ ] Sync `immich` to pick up correct image versions from main

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/71
2026-01-29 10:24:26 -08:00
0d8eb651d4 Fix XID Age graph to show threshold context (#69)
All checks were successful
Build Container / build (push) Successful in 1m10s
## Summary
- Add fixed Y-axis (0-220M) so the 200M autovacuum threshold is always visible
- Add dashed threshold lines at 150M (yellow warning) and 200M (red danger)
- Update title to clarify the threshold

## Context
The raw XID age naturally trends upward between vacuum freezes, which looked alarming without context. Current values (~143K-216K) are at 0.1% of the threshold - completely healthy.

## Deployment and Testing
- [ ] Sync grafana-config app to feature branch
- [ ] Verify threshold lines appear on PostgreSQL dashboard

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/69
2026-01-29 07:08:21 -08:00
0604877db2 Add 'Tesla' prefix to all TeslaMate dashboard titles (#68)
## Summary
- Renamed all 18 TeslaMate Grafana dashboards to include "Tesla" prefix
- Improves organization and discoverability in the dashboard list

## Deployment and Testing
- [ ] Sync grafana-config app: `argocd app set grafana-config --revision feature/rename-tesla-dashboards && argocd app sync grafana-config`
- [ ] Verify dashboards display with "Tesla" prefix in Grafana

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/68
2026-01-29 06:55:44 -08:00
8f4660915d Fix argocd SSH key format for 1Password Connect
1Password Connect doesn't support ?ssh-format=openssh, so we need a
separate Secure Note item with the OpenSSH-formatted key.

Created new 1Password item: argocd-forge-ssh-key

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:31:16 -08:00
9114aac8f6 Switch all ExternalSecrets to creationPolicy: Owner
ESO now has full ownership of these secrets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:27:16 -08:00
dd6cf20d51 Remove obsolete secret templates
- Delete 13 .yaml.tpl files replaced by ExternalSecrets
- Update immich/README.md with direct CNPG secret copy instructions
- Update miniflux/README.md with context flag and ESO note

Only 1password-connect/secret-credentials.yaml.tpl remains (bootstrap).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 20:26:37 -08:00
351528474c Add ExternalSecrets for remaining k8s secrets
Migrate 10 secret templates to ESO ExternalSecrets with 1Password Connect:
- databases: eblume, borgmatic, teslamate passwords
- tailscale-operator: OAuth client credentials
- grafana-config: admin password, teslamate datasource
- teslamate: db password, encryption key
- forgejo-runner: runner registration token
- argocd: forge SSH credentials

All use creationPolicy: Merge for safe migration from existing secrets.

Skipped:
- miniflux/secret-db: Uses CNPG secret, not 1Password directly
- immich/secret-db: Requires 1Password item creation first
- 1password-connect: Bootstrap secret, must stay as template

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-28 19:50:38 -08:00
482414346e Add External Secrets Operator with 1Password Connect (#66) (#66)
## Summary
- Add 1Password Connect server for secrets automation API
- Add External Secrets Operator (ESO) to sync secrets from 1Password to K8s
- Add ClusterSecretStore connecting ESO to 1Password Connect
- Convert devpi secret to ExternalSecret as proof of concept

## Architecture
```
1Password Cloud → 1Password Connect (k8s) → ESO → Native K8s Secrets
```

## Deployment and Testing
- [ ] Mirror Helm charts to forge (connect-helm-charts, external-secrets) - DONE
- [ ] Create 1Password Connect credentials (`op connect server create`)
- [ ] Store credentials in 1Password item "1Password Connect"
- [ ] Bootstrap secret: `op inject -i argocd/manifests/1password-connect/secret-credentials.yaml.tpl | kubectl apply -f -`
- [ ] Deploy 1password-connect: `argocd app sync 1password-connect`
- [ ] Deploy external-secrets: `argocd app sync external-secrets`
- [ ] Deploy external-secrets-config: `argocd app sync external-secrets-config`
- [ ] Test devpi ExternalSecret: `argocd app sync devpi`
- [ ] Verify secret synced: `kubectl get externalsecret -n devpi`

## Future Work
After PoC validated, migrate remaining 12 secret templates to ExternalSecrets:
- databases (3), tailscale-operator (1), grafana-config (2), teslamate (2)
- forgejo-runner (1), argocd (1), immich (1), 1password-connect (1 - self-bootstrap)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/66
2026-01-28 19:30:10 -08:00
aa4464f84e Upgrade Immich from v2.4.1 to v2.5.0 (#64)
## Summary
- Upgrades Immich image tag from v2.4.1 to v2.5.0

## Deployment and Testing
- [ ] Point immich ArgoCD app at feature branch and sync
- [ ] Verify pods come up healthy
- [ ] Verify Immich web UI accessible
- [ ] Reset to main and sync after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/64
2026-01-27 20:51:09 -08:00
8621996343 Add Immich photo management + migrate forge URLs (#62)
## Summary
- Migrate all ArgoCD app repo URLs from `indri.tail8d86e.ts.net:2200` to `forge.ops.eblu.me:2222`
- Add Immich self-hosted photo management service with:
  - Helm chart deployment via ArgoCD
  - PostgreSQL cluster with pgvecto.rs for AI vector search (immich-pg)
  - NFS storage on sifaka for photo library (2Ti)
  - Tailscale Ingress + Caddy proxy for `photos.ops.eblu.me`
  - Machine learning service for face/object recognition

## Deployment and Testing
- [x] Update ArgoCD repo-creds-forge secret with new URL (one-time manual step)
- [ ] Sync `apps` to pick up new applications
- [ ] Sync all existing apps to verify new forge URL works
- [ ] Sync `blumeops-pg` to deploy immich-pg cluster
- [ ] Wait for immich-pg to be healthy
- [ ] Create immich-db secret from auto-generated password
- [ ] Sync `immich-storage` (PV, PVC, Ingress)
- [ ] Sync `immich` (Helm chart)
- [ ] Run `mise run provision-indri -- --tags caddy` to add photos.ops.eblu.me
- [ ] Verify Immich UI is accessible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/62
2026-01-26 11:20:11 -08:00
c8b655f177 Build local containers for k8s services (#61)
## Summary
- Move devpi Dockerfile from argocd/manifests to containers/devpi/
- Add containers for: transmission, teslamate, miniflux, kiwix-serve, kubectl
- Update all k8s deployments to use local images (registry.ops.eblu.me/blumeops/*)
- All containers use v1.0.0 tag for initial release

## Containers Added
| Container | Source | Notes |
|-----------|--------|-------|
| devpi | python:3.12-slim | Existing, moved to containers/ |
| kubectl | alpine + download | For zim-watcher CronJob |
| miniflux | Go build from source | v2.2.16 |
| kiwix-serve | Download pre-built binary | v3.8.1 |
| transmission | alpine + apk install | Simpler than linuxserver image |
| teslamate | Elixir build from source | v2.2.0 |

## Deployment and Testing
- [ ] Build and tag devpi-v1.0.0
- [ ] Build and tag kubectl-v1.0.0
- [ ] Build and tag miniflux-v1.0.0
- [ ] Build and tag kiwix-serve-v1.0.0
- [ ] Build and tag transmission-v1.0.0
- [ ] Build and tag teslamate-v1.0.0
- [ ] Sync ArgoCD apps and verify services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/61
2026-01-25 21:35:57 -08:00
ea42362b6f Migrate Forgejo runner to Kubernetes with DinD (#60)
## Summary
- Deploy Forgejo runner to k8s with Docker-in-Docker sidecar
- Add job execution image with Node.js and Docker CLI
- Retire host-mode runner on indri
- All CI jobs now run containerized in k8s

## Components Added
- `containers/forgejo-runner/Dockerfile` - Job execution image
- `argocd/apps/forgejo-runner.yaml` - ArgoCD Application
- `argocd/manifests/forgejo-runner/` - Kubernetes manifests

## Components Removed
- `ansible/roles/forgejo_runner/` - No longer needed

## Changes to Existing Files
- `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env
- `.github/actionlint.yaml` - Only `k8s` label now valid

## Deployment
1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl | kubectl --context=minikube-indri apply -f -`
2. Sync ArgoCD: `argocd app sync forgejo-runner`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60
2026-01-25 19:56:17 -08:00
66badfafd1 Migrate k8s services to Caddy (*.ops.eblu.me) (#59)
All checks were successful
Build Container / build (push) Successful in 13s
## Summary
- Add Caddy reverse proxy routes for all k8s services (grafana, argocd, prometheus, loki, miniflux, devpi, kiwix, torrent, teslamate)
- Add PostgreSQL via Caddy L4 TCP proxy on port 5432
- Caddy proxies to existing Tailscale endpoints - traffic stays local on indri
- Both `*.ops.eblu.me` and `*.tail8d86e.ts.net` URLs continue to work

## Updated References
- Alloy: prometheus/loki push endpoints → `*.ops.eblu.me`
- Borgmatic: PostgreSQL backup host → `pg.ops.eblu.me`
- Devpi: DEVPI_OUTSIDE_URL → `pypi.ops.eblu.me`
- indri-services-check: health check URLs
- CLAUDE.md: argocd login command

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags caddy` to deploy new Caddy config
- [ ] Test HTTP services: `curl https://grafana.ops.eblu.me/api/health`
- [ ] Test PostgreSQL: `pg_isready -h pg.ops.eblu.me -p 5432`
- [ ] Run `mise run provision-indri -- --tags alloy` to update Alloy endpoints
- [ ] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic
- [ ] Sync devpi in ArgoCD: `argocd app sync devpi`
- [ ] Re-login to ArgoCD: `argocd login argocd.ops.eblu.me ...`
- [ ] Run `mise run indri-services-check` to verify all services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/59
2026-01-25 12:56:31 -08:00
d6e6b48f6a Migrate registry to Caddy (registry.ops.eblu.me) (#58)
## Summary
- Update all references from `registry.tail8d86e.ts.net` to `registry.ops.eblu.me`
- Remove `tailscale_serve` ansible role (no longer needed - all services migrated to Caddy)
- Update minikube containerd config for new registry URL
- Update devpi manifest, CI actions, and mise tasks

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --check --diff` (dry run)
- [ ] Run `mise run provision-indri -- --tags minikube` to update containerd config
- [ ] Sync devpi ArgoCD app: `argocd app sync devpi`
- [ ] Manually remove old Tailscale serve entry: `ssh indri 'tailscale serve --service=svc:registry off'`
- [ ] Test registry access: `curl https://registry.ops.eblu.me/v2/_catalog`
- [ ] Run `mise run indri-services-check` to verify all services healthy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/58
2026-01-25 12:06:15 -08:00
1184b4de1d Add Caddy layer4 for Forgejo SSH (#56)
## Summary
- Add layer4 TCP proxy configuration to Caddyfile template for SSH services
- Configure Forgejo SSH on port 2222 → localhost:2200
- Switch HTTPS from port 8443 (testing) to 443 (production)
- Requires Caddy rebuilt with `github.com/mholt/caddy-l4` plugin

## What This Enables
Git+SSH access via `forge.ops.eblu.me:2222` is now accessible from:
- Tailnet clients (gilbert)
- Docker containers on indri
- Kubernetes pods in minikube

This solves the DNS resolution issues where containers couldn't reach Tailscale MagicDNS names.

## Testing Done
- [x] Caddy rebuilt with layer4 plugin
- [x] Validated Caddyfile syntax
- [x] Cleared `svc:forge` from tailscale serve
- [x] Verified HTTPS works: `curl https://forge.ops.eblu.me`
- [x] Verified SSH works: `ssh -p 2222 forgejo@forge.ops.eblu.me`
- [x] Verified git clone works via new endpoint
- [x] Verified minikube pods can reach both HTTPS and SSH endpoints

## Deployment
Caddy is already running with the new config on indri. This PR captures the ansible changes.

## Next Steps
- Update zk docs with new git remote format
- Migrate registry and other services to Caddy
- Retire tailscale_services ansible role

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/56
2026-01-25 11:37:23 -08:00
8ca8798121 Switch to Buildah for container builds (#51)
All checks were successful
Test CI / test (push) Successful in 4s
## Summary
- Replace Docker with Buildah for container image builds
- No Docker socket required - buildah is daemonless
- Cleaner security model (no privileged containers or socket mounting)
- Remove Docker-related security context from deployment

## Changes
- Update Dockerfile to install buildah/podman instead of docker-cli
- Configure buildah storage with overlay driver and fuse-overlayfs
- Update composite action to use `buildah bud` and `buildah push`
- Add `imagePullPolicy: Always` to ensure fresh image pulls
- Update test workflow to verify buildah/podman

## Testing
- [ ] Runner pod starts successfully
- [ ] Buildah is available in runner
- [ ] Test workflow verifies buildah/podman versions
- [ ] Container build workflow builds and pushes to zot

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51
2026-01-24 13:30:26 -08:00
5fcd122494 Reorganize CI/CD bootstrap phases and add custom runner Dockerfile (#50)
All checks were successful
Test CI / test (push) Successful in 2s
## Summary
- Reorder CI/CD bootstrap phases to address chicken-and-egg problem
- P2 is now "Custom Runner Image" (stock runner lacks Node.js)
- Add P3 for "Mirror Forgejo & Build from Source"
- Rename P3 -> P4 (Self-Deploy), P4 -> P5 (Container Builds)
- Add Dockerfile for custom runner with Node.js, npm, docker, build tools
- Update overview with new phase structure, host mode notes, and cross-compilation challenge

## Key Changes

### Phase Reordering
| Old | New | Name |
|-----|-----|------|
| P1 | P1 | Enable Actions (complete) |
| P2 | P2 | **Custom Runner Image** (new focus) |
| - | P3 | **Mirror Forgejo & Build** (new) |
| P3 | P4 | Self-Deploy |
| P4 | P5 | Container Builds |

### Custom Runner Dockerfile
The stock `forgejo/runner:3.5.1` image lacks Node.js, so `actions/checkout@v4` doesn't work. The new Dockerfile adds:
- Node.js + npm (for GitHub Actions)
- Docker CLI (for container builds)
- Build tools (gcc, make, curl, jq)

### Bootstrap Strategy
1. Build custom runner image manually on gilbert (podman build)
2. Push to zot registry
3. Update deployment to use custom image
4. Then enable auto-build workflow for runner

## Deployment and Testing
- [x] Review plan changes
- [x] Build custom runner image manually and verify
- [x] Update runner deployment
- [x] Test `actions/checkout@v4` works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/50
2026-01-23 18:50:27 -08:00
7893c41020 Enable Forgejo Actions (Phase 1) (#48)
All checks were successful
Test CI / test (push) Successful in 0s
## Summary
- Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password
- Enable Forgejo Actions in config (`[actions] ENABLED = true`)
- Add `repo.actions` to DEFAULT_REPO_UNITS
- Clean up unused MySQL database fields (we use SQLite)

## Phase 1 Progress
This PR covers the first part of Phase 1 (ci-cd-bootstrap plan):
- [x] Refactor app.ini to ansible template
- [x] Store secrets in 1Password
- [x] Enable Actions in config
- [ ] Deploy config changes (pending review)
- [ ] Create runner registration token
- [ ] Deploy runner to k8s
- [ ] Test with simple workflow

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags forgejo` to deploy
- [ ] Verify Forgejo restarts correctly
- [ ] Verify Actions tab appears in repo settings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48
2026-01-23 17:00:12 -08:00
272ddb213b Add TeslaMate deployment for Tesla Model Y data logging (#47)
## Summary
- Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net
- Add teslamate user to CloudNativePG blumeops-pg cluster
- Add TeslaMate PostgreSQL datasource to Grafana
- Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc.
- Add teslamate database to borgmatic backup configuration

## Deployment and Testing
- [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key"
- [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl | kubectl apply -f -`
- [ ] Sync blumeops-pg: `argocd app sync blumeops-pg`
- [ ] Create teslamate database
- [ ] Apply teslamate secrets (encryption key, db connection)
- [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl | kubectl apply -f -`
- [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config`
- [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net
- [ ] Verify data collection starts
- [ ] Verify Grafana dashboards show data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47
2026-01-22 21:25:44 -08:00
11075d4517 Remove logfmt parsing stage from Alloy k8s config
The stage.match selector wasn't preventing Alloy from logging decode
errors internally. Removing logfmt parsing entirely - JSON parsing
handles most structured logs, and plain text logs still get collected.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 18:06:34 -08:00
e6de7ba391 Fix Alloy logfmt decode errors for JSON logs (#46)
## Summary
- Use `stage.match` to conditionally apply logfmt parsing only to lines that don't start with `{`
- This prevents error spam like `"failed to decode logfmt" component_path=/ component_id=loki.process.pods component=stage type=logfmt err="logfmt syntax error at pos 2 on line 1: unexpected '\"'"` when JSON-formatted logs hit the logfmt parser

## Deployment and Testing
- [ ] Sync alloy-k8s app to feature branch and verify errors stop appearing
- [ ] Verify JSON logs are still parsed correctly
- [ ] Verify logfmt logs (from Loki, Prometheus etc.) are still parsed correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/46
2026-01-22 18:00:34 -08:00
57bf8512dc Log filtering cleanup and observability improvements (#45)
## Summary
- Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue)
- Disable thermal collector on indri Alloy (not supported on macOS M1)
- Add macOS power/thermal metrics collection via powermetrics LaunchDaemon
- Add Power & Thermal section to macOS Grafana dashboard
- Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.)
- Extract more fields from JSON logs (zot compatibility - uses "message" not "msg")
- Silence logfmt parse errors for non-logfmt logs
- Fix JSON escaping in devpi dashboard

## Deployment and Testing
- [x] Deployed Alloy config changes to indri via ansible
- [x] Synced alloy-k8s and grafana-config via ArgoCD
- [x] Verified power metrics appearing in Prometheus
- [x] Verified thermal collector errors stopped
- [x] Verified logfmt parse errors silenced
- [x] Verified devpi dashboard loads correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45
2026-01-22 17:30:08 -08:00
af39067e1f Pin ArgoCD to v3.2.6 (#44)
## Summary
- Pin ArgoCD kustomization to v3.2.6 tag instead of `stable` branch
- This gives intentional control over ArgoCD version upgrades

## Deployment and Testing
- [ ] Sync the `apps` application: `argocd app sync apps`
- [ ] Point argocd at feature branch: `argocd app set argocd --revision feature/pin-argocd-v3.2.6`
- [ ] Sync argocd: `argocd app sync argocd`
- [ ] Verify ArgoCD is running v3.2.6
- [ ] After merge, reset to main: `argocd app set argocd --revision main && argocd app sync argocd`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/44
2026-01-22 16:38:27 -08:00
e4a8405de7 Observability cleanup and k8s service monitoring (#43) (#43)
## Summary
- Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s)
- Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery
- Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd
- Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios)
- Replace stale devpi dashboard with probe-based metrics (status, response time, uptime)
- Add unified "K8s Services Health" dashboard for service uptime/response monitoring

## Manual cleanup already performed
- Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom`
- Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/`

## Deployment and Testing
- [x] Sync `apps` application to pick up new alloy-k8s app
- [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s`
- [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent`
- [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus`
- [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config`
- [x] Verify pod logs appear in Loki/Grafana
- [x] Verify transmission metrics appear in Prometheus
- [x] Verify service probe metrics appear in Prometheus
- [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config
- [ ] After merge, reset apps to main and resync

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43
2026-01-22 13:51:01 -08:00
17023085cb Migrate observability stack to Kubernetes (#42)
Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack.

Summary

  - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal)
  - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses
  - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics
  - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net)
  - Add ACL rule for port 9187 (CNPG metrics)
  - Delete obsolete ansible roles for prometheus and loki

Changes

  - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications
  - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS
  - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint
  - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics
  - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints
  - pulumi/policy.hujson - ACL for port 9187
  - Deleted ansible/roles/prometheus/ and ansible/roles/loki/

Deployment and Testing

  - Stop prometheus and loki on indri
  - Sync ArgoCD apps (apps, prometheus, loki, grafana)
  - Run mise run provision-indri -- --tags alloy
  - Verify Grafana dashboards show data

🤖 Generated with https://claude.ai/claude-code

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42
2026-01-22 12:06:02 -08:00
5a829e0afd Remove unused indri tags and ansible roles (#41)
## Summary
- Remove ansible roles for services migrated to k8s: devpi, kiwix, transmission
- Also remove unused node_exporter and podman ansible roles
- Remove service tags from indri for k8s-hosted services (grafana, kiwix, devpi, pg, feed)
- Update indri description to reflect current architecture

## Changes
**Ansible roles removed** (34 files, ~1000 lines):
- devpi, devpi_metrics
- kiwix
- transmission, transmission_metrics
- node_exporter
- podman

**Pulumi indri tags removed**:
- tag:grafana, tag:kiwix, tag:devpi, tag:pg, tag:feed

These services now run in k8s with their own Tailscale devices via tailscale-operator.

## Deployment and Testing
- [x] Verified remaining ansible roles match indri.yml
- [x] Verified no playbooks or role dependencies reference removed roles
- [ ] Run `pulumi preview` to verify tag changes
- [ ] Run `pulumi up` to apply tag changes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/41
2026-01-21 20:18:53 -08:00
7ec98210a9 P6: Migrate Kiwix and Transmission to Kubernetes (#39)
## Summary
- Add Transmission BitTorrent daemon to k8s (torrent namespace)
- Add Kiwix ZIM archive server to k8s (kiwix namespace)
- NFS storage from sifaka for shared torrent/ZIM data
- Torrent-sync sidecar in kiwix deployment to manage declarative ZIM list
- ZIM-watcher CronJob to auto-restart kiwix when new archives appear
- Remove transmission, transmission_metrics, and kiwix ansible roles from indri
- Remove svc:kiwix from tailscale_serve defaults

## Key Decisions
- Direct NFS mount for kiwix (no PVC) since it shares storage with transmission
- Shell wrapper for kiwix-serve command (glob expansion)
- Accept HTTP 409 as "ready" in torrent sync (transmission session ID mechanism)
- Completed downloads stored in `/downloads/complete/` on sifaka

## Deployment and Testing
- [x] Deployed transmission to k8s
- [x] Verified transmission web UI at torrent.tail8d86e.ts.net
- [x] Moved existing ZIM files to complete folder
- [x] Deployed kiwix to k8s
- [x] Verified kiwix web UI at kiwix.tail8d86e.ts.net
- [x] Stopped old services on indri
- [x] Cleared svc:kiwix from Tailscale serve on indri
- [x] Updated zk documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/39
2026-01-21 18:07:40 -08:00
21848a7919 P5.1: Migrate minikube from podman to QEMU2 driver (#38)
## Summary
- Migrate minikube from podman driver to qemu2 driver for proper NFS/SMB volume mount support
- Update ansible minikube role with qemu installation and containerd runtime
- Remove podman role dependency from indri.yml
- Add synology user creation steps and post-migration zot reconfiguration notes

## Why
Phase 6 (Kiwix/Transmission migration) was blocked because the podman driver lacks kernel capabilities for filesystem mounts. QEMU2 creates an actual VM with full mount support.

## Deployment and Testing
- [ ] Create k8s-storage user on Synology DSM
- [ ] Store credentials in 1Password (synology-k8s-storage)
- [ ] Export current k8s state
- [ ] Stop and delete podman-based minikube cluster
- [ ] Run ansible to create QEMU2 cluster
- [ ] Test NFS volume mount with test pod
- [ ] Redeploy ArgoCD and all apps
- [ ] Verify all services healthy
- [ ] Reconfigure zot registry mirrors for containerd (post-migration)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/38
2026-01-21 16:03:37 -08:00
0439fbb704 P5: Migrate devpi to Kubernetes (#34)
## Summary
- Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes
- Custom container image with devpi-server + devpi-web + auto-init
- StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net
- Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow

## Key Changes
- Add CRI-O registry mirror config for registry.tail8d86e.ts.net
- Change ArgoCD apps to manual sync (was auto-sync causing issues)
- 2Gi memory limit for Whoosh indexer (reclaimed after startup)

## Deployment and Testing
- [x] devpi pod healthy in k8s
- [x] pip install through proxy works
- [x] mcquack 1.0.0 uploaded and installable
- [x] Old devpi stopped on indri

## Post-Merge
Reset ArgoCD to main:
```
argocd app set apps --revision main && argocd app sync apps
argocd app set devpi --revision main && argocd app sync devpi
```

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34
2026-01-20 14:55:37 -08:00
735b643429 P4: Miniflux migration + PostgreSQL consolidation (#33)
## Summary
- Deploy miniflux in k8s via ArgoCD
- Expose via Tailscale Ingress at feed.tail8d86e.ts.net
- Retire brew PostgreSQL (no longer needed)
- Rename k8s-pg to pg (canonical hostname)
- Remove ansible miniflux and postgresql roles
- Update borgmatic to backup pg.tail8d86e.ts.net
- Update all zk documentation

## Deployment and Testing
- [x] Miniflux pod running in k8s
- [x] User login works at https://feed.tail8d86e.ts.net
- [x] Feeds and entries visible
- [x] brew miniflux and postgresql stopped
- [x] Tailscale services migrated (feed, pg)
- [x] zk documentation updated
- [x] Run ansible to apply role removals
- [ ] Verify borgmatic backup with new pg hostname

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33
2026-01-20 09:04:47 -08:00
0c6f0a13c3 Add CNPG default values to prevent ArgoCD drift
CloudNativePG operator fills in connectionLimit, ensure, and inherit
defaults on managed roles. Adding these explicitly keeps ArgoCD in sync.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 18:02:42 -08:00
eb952aae01 P3: PostgreSQL disaster recovery test and borgmatic k8s-pg backup (#32)
## Summary
- Fixed borgmatic `borg: command not found` by adding `local_path` config option
- Successfully tested disaster recovery: restored miniflux data from borgmatic backup to k8s-pg
- Added borgmatic user to k8s-pg via CloudNativePG managed roles
- Configured borgmatic to backup both localhost and k8s-pg PostgreSQL databases
- Added Tailscale ACL grant for `tag:homelab` → `tag:k8s` on port 5432
- Disabled selfHeal on apps app to allow manual revision changes during development

## Changes
- `ansible/roles/borgmatic/` - Added `local_path` and k8s-pg database entry
- `ansible/roles/postgresql/tasks/main.yml` - Added k8s-pg to `.pgpass`
- `argocd/apps/apps.yaml` - Disabled selfHeal
- `argocd/manifests/databases/blumeops-pg.yaml` - Added borgmatic managed role
- `argocd/manifests/databases/secret-borgmatic.yaml.tpl` - New secret template
- `pulumi/policy.hujson` - Added ACL grant for backup access

## Deployment and Testing
- [x] Borgmatic backup runs successfully
- [x] Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries verified)
- [x] borgmatic user created in k8s-pg with pg_read_all_data role
- [x] Both localhost and k8s-pg databases in backup archive
- [x] zk documentation updated (borgmatic.md, postgresql.md)
- [ ] After merge: set blumeops-pg app back to main revision

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/32
2026-01-19 18:00:32 -08:00
258c88f2f7 Fix kustomize patch: remove namespace for proper matching
Kustomize matches patches before namespace transformation, so the
patch file shouldn't specify namespace (kustomization.yaml adds it).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 15:00:33 -08:00
623b122f58 Fix kustomization: known_hosts as resource not patch
The argocd-ssh-known-hosts-cm ConfigMap needs to be a resource,
not a patch, because the upstream install.yaml includes it inline
in a way kustomize can't patch.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-19 14:45:33 -08:00
7e6742ad24 K8s Migration Phase 2: Grafana to Kubernetes (#30)
## Summary
- Migrate Grafana from Homebrew/Ansible to Kubernetes deployment
- Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed)
- Add Grafana Helm chart deployment via ArgoCD with multi-source pattern
- Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps)
- Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal`

## Key Changes
- `argocd/apps/grafana.yaml` - Grafana Helm chart Application
- `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps
- `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo
- `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0

## Deployment and Testing
- [x] Deploy Loki config change: `mise run provision-indri -- --tags loki`
- [x] Create namespace: `ki create namespace monitoring`
- [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl | ki apply -f -`
- [x] Sync ArgoCD apps (grafana, grafana-config)
- [x] Verify Grafana works at https://grafana.tail8d86e.ts.net
- [x] Remove svc:grafana from ansible tailscale_serve
- [x] Stop brew grafana: `ssh indri 'brew services stop grafana'`
- [x] Delete ansible grafana role

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30
2026-01-19 14:40:25 -08:00
a8f4d00294 K8s Migration Phase 1: Infrastructure Setup (#29)
## Summary
- Split k8s migration plan into phases folder for easier navigation
- Added `tag:k8s` to Pulumi ACLs for Kubernetes workloads
- Phase 1 work in progress

## Phase 1 Goals
- Tailscale Kubernetes Operator
- CloudNativePG Operator
- PostgreSQL cluster for future app migrations

## Deployment and Testing
- [ ] Review Phase 1 plan
- [ ] `mise run tailnet-preview` to verify ACL changes
- [ ] `mise run tailnet-up` to apply ACL changes
- [ ] Create Tailscale OAuth client (manual)
- [ ] Deploy operators and PostgreSQL cluster

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/29
2026-01-19 09:49:52 -08:00