blumeops

Author	SHA1	Message	Date
Erich Blume	4c249ff116	Add docker group (GID 999) to runner security context	2026-01-23 19:44:43 -08:00
Erich Blume	4a3219648d	Add container build workflows with composite action - Create composite action: .forgejo/actions/build-push-image - Add build-runner.yaml workflow (triggers on Dockerfile changes) - Add build-devpi.yaml workflow (triggers on Dockerfile/start.sh changes) - Mount Docker socket in runner deployment for container builds - Run runner as root to access Docker socket Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 19:42:47 -08:00
Erich Blume	5fcd122494	Reorganize CI/CD bootstrap phases and add custom runner Dockerfile (#50 ) All checks were successful Test CI / test (push) Successful in 2s Details ## Summary - Reorder CI/CD bootstrap phases to address chicken-and-egg problem - P2 is now "Custom Runner Image" (stock runner lacks Node.js) - Add P3 for "Mirror Forgejo & Build from Source" - Rename P3 -> P4 (Self-Deploy), P4 -> P5 (Container Builds) - Add Dockerfile for custom runner with Node.js, npm, docker, build tools - Update overview with new phase structure, host mode notes, and cross-compilation challenge ## Key Changes ### Phase Reordering \| Old \| New \| Name \| \|-----\|-----\|------\| \| P1 \| P1 \| Enable Actions (complete) \| \| P2 \| P2 \| Custom Runner Image (new focus) \| \| - \| P3 \| Mirror Forgejo & Build (new) \| \| P3 \| P4 \| Self-Deploy \| \| P4 \| P5 \| Container Builds \| ### Custom Runner Dockerfile The stock `forgejo/runner:3.5.1` image lacks Node.js, so `actions/checkout@v4` doesn't work. The new Dockerfile adds: - Node.js + npm (for GitHub Actions) - Docker CLI (for container builds) - Build tools (gcc, make, curl, jq) ### Bootstrap Strategy 1. Build custom runner image manually on gilbert (podman build) 2. Push to zot registry 3. Update deployment to use custom image 4. Then enable auto-build workflow for runner ## Deployment and Testing - [x] Review plan changes - [x] Build custom runner image manually and verify - [x] Update runner deployment - [x] Test `actions/checkout@v4` works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/50	2026-01-23 18:50:27 -08:00
Erich Blume	3bcad4189f	Add actionlint pre-commit hook for workflow validation (#49 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Fix workflow to use `github.` context variables (Forgejo schema validator only recognizes GitHub Actions syntax, not `gitea.` aliases) - Pass untrusted inputs through environment variables (security best practice per actionlint) - Add actionlint to Brewfile and pre-commit config to catch workflow validation errors locally ## Deployment and Testing - [x] Pre-commit hooks all pass - [x] actionlint validates `.forgejo/workflows/test.yaml` successfully - [ ] Verify workflow runs without errors on Forge after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/49	2026-01-23 17:56:24 -08:00
Erich Blume	6a436d141a	Update CI/CD plan: mark Phase 1 complete, add runner observability All checks were successful Test CI / test (push) Successful in 0s Details - Mark Phase 1 (Enable Actions) as completed with date - Check off all verification items in P1 - Add Step 6 to Phase 4 for runner logging and metrics - Update overview table with status column Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:10:14 -08:00
Erich Blume	7893c41020	Enable Forgejo Actions (Phase 1) (#48 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password - Enable Forgejo Actions in config (`[actions] ENABLED = true`) - Add `repo.actions` to DEFAULT_REPO_UNITS - Clean up unused MySQL database fields (we use SQLite) ## Phase 1 Progress This PR covers the first part of Phase 1 (ci-cd-bootstrap plan): - [x] Refactor app.ini to ansible template - [x] Store secrets in 1Password - [x] Enable Actions in config - [ ] Deploy config changes (pending review) - [ ] Create runner registration token - [ ] Deploy runner to k8s - [ ] Test with simple workflow ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags forgejo` to deploy - [ ] Verify Forgejo restarts correctly - [ ] Verify Actions tab appears in repo settings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48	2026-01-23 17:00:12 -08:00
Erich Blume	016f1043c8	Retire k8s-migration plan and create ci-cd-bootstrap plan	2026-01-23 14:13:01 -08:00
Erich Blume	25fa2ea665	Update indri-services-check	2026-01-22 21:31:11 -08:00
Erich Blume	272ddb213b	Add TeslaMate deployment for Tesla Model Y data logging (#47 ) ## Summary - Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net - Add teslamate user to CloudNativePG blumeops-pg cluster - Add TeslaMate PostgreSQL datasource to Grafana - Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc. - Add teslamate database to borgmatic backup configuration ## Deployment and Testing - [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key" - [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl \| kubectl apply -f -` - [ ] Sync blumeops-pg: `argocd app sync blumeops-pg` - [ ] Create teslamate database - [ ] Apply teslamate secrets (encryption key, db connection) - [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl \| kubectl apply -f -` - [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config` - [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net - [ ] Verify data collection starts - [ ] Verify Grafana dashboards show data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47	2026-01-22 21:25:44 -08:00
Erich Blume	11075d4517	Remove logfmt parsing stage from Alloy k8s config The stage.match selector wasn't preventing Alloy from logging decode errors internally. Removing logfmt parsing entirely - JSON parsing handles most structured logs, and plain text logs still get collected. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 18:06:34 -08:00
Erich Blume	e6de7ba391	Fix Alloy logfmt decode errors for JSON logs (#46 ) ## Summary - Use `stage.match` to conditionally apply logfmt parsing only to lines that don't start with `{` - This prevents error spam like `"failed to decode logfmt" component_path=/ component_id=loki.process.pods component=stage type=logfmt err="logfmt syntax error at pos 2 on line 1: unexpected '\"'"` when JSON-formatted logs hit the logfmt parser ## Deployment and Testing - [ ] Sync alloy-k8s app to feature branch and verify errors stop appearing - [ ] Verify JSON logs are still parsed correctly - [ ] Verify logfmt logs (from Loki, Prometheus etc.) are still parsed correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/46	2026-01-22 18:00:34 -08:00
Erich Blume	16bfe06b7b	Fix LaunchDaemon check to use become: true LaunchDaemons run in the system domain and require sudo to query. Without become: true, the check always fails and tries to reload. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 17:34:23 -08:00
Erich Blume	57bf8512dc	Log filtering cleanup and observability improvements (#45 ) ## Summary - Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue) - Disable thermal collector on indri Alloy (not supported on macOS M1) - Add macOS power/thermal metrics collection via powermetrics LaunchDaemon - Add Power & Thermal section to macOS Grafana dashboard - Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.) - Extract more fields from JSON logs (zot compatibility - uses "message" not "msg") - Silence logfmt parse errors for non-logfmt logs - Fix JSON escaping in devpi dashboard ## Deployment and Testing - [x] Deployed Alloy config changes to indri via ansible - [x] Synced alloy-k8s and grafana-config via ArgoCD - [x] Verified power metrics appearing in Prometheus - [x] Verified thermal collector errors stopped - [x] Verified logfmt parse errors silenced - [x] Verified devpi dashboard loads correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45	2026-01-22 17:30:08 -08:00
Erich Blume	af39067e1f	Pin ArgoCD to v3.2.6 (#44 ) ## Summary - Pin ArgoCD kustomization to v3.2.6 tag instead of `stable` branch - This gives intentional control over ArgoCD version upgrades ## Deployment and Testing - [ ] Sync the `apps` application: `argocd app sync apps` - [ ] Point argocd at feature branch: `argocd app set argocd --revision feature/pin-argocd-v3.2.6` - [ ] Sync argocd: `argocd app sync argocd` - [ ] Verify ArgoCD is running v3.2.6 - [ ] After merge, reset to main: `argocd app set argocd --revision main && argocd app sync argocd` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/44	2026-01-22 16:38:27 -08:00
Erich Blume	e4a8405de7	Observability cleanup and k8s service monitoring (#43 ) (#43 ) ## Summary - Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s) - Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery - Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd - Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios) - Replace stale devpi dashboard with probe-based metrics (status, response time, uptime) - Add unified "K8s Services Health" dashboard for service uptime/response monitoring ## Manual cleanup already performed - Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom` - Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/` ## Deployment and Testing - [x] Sync `apps` application to pick up new alloy-k8s app - [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s` - [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent` - [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus` - [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config` - [x] Verify pod logs appear in Loki/Grafana - [x] Verify transmission metrics appear in Prometheus - [x] Verify service probe metrics appear in Prometheus - [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config - [ ] After merge, reset apps to main and resync 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43	2026-01-22 13:51:01 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	5a829e0afd	Remove unused indri tags and ansible roles (#41 ) ## Summary - Remove ansible roles for services migrated to k8s: devpi, kiwix, transmission - Also remove unused node_exporter and podman ansible roles - Remove service tags from indri for k8s-hosted services (grafana, kiwix, devpi, pg, feed) - Update indri description to reflect current architecture ## Changes Ansible roles removed (34 files, ~1000 lines): - devpi, devpi_metrics - kiwix - transmission, transmission_metrics - node_exporter - podman Pulumi indri tags removed: - tag:grafana, tag:kiwix, tag:devpi, tag:pg, tag:feed These services now run in k8s with their own Tailscale devices via tailscale-operator. ## Deployment and Testing - [x] Verified remaining ansible roles match indri.yml - [x] Verified no playbooks or role dependencies reference removed roles - [ ] Run `pulumi preview` to verify tag changes - [ ] Run `pulumi up` to apply tag changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/41	2026-01-21 20:18:53 -08:00
Erich Blume	6a140107c6	P7 forgejo plan updated	2026-01-21 20:04:18 -08:00
Erich Blume	4dd74dfff8	complete P6	2026-01-21 19:16:04 -08:00
Erich Blume	2e7ca8a5ff	Add mise task to list unresolved PR comments (#40 ) ## Summary - New `pr-comments` mise task queries Forge API for unresolved review comments on a PR - Task takes a PR number as argument and displays all comments without a resolver - Updated CLAUDE.md to include using this task after user reviews PRs ## Deployment and Testing - [x] Tested task on PR #39 (shows no unresolved comments since all were resolved) - [x] Tested error handling with non-existent PR #9999 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/40	2026-01-21 19:14:27 -08:00
Erich Blume	7ec98210a9	P6: Migrate Kiwix and Transmission to Kubernetes (#39 ) ## Summary - Add Transmission BitTorrent daemon to k8s (torrent namespace) - Add Kiwix ZIM archive server to k8s (kiwix namespace) - NFS storage from sifaka for shared torrent/ZIM data - Torrent-sync sidecar in kiwix deployment to manage declarative ZIM list - ZIM-watcher CronJob to auto-restart kiwix when new archives appear - Remove transmission, transmission_metrics, and kiwix ansible roles from indri - Remove svc:kiwix from tailscale_serve defaults ## Key Decisions - Direct NFS mount for kiwix (no PVC) since it shares storage with transmission - Shell wrapper for kiwix-serve command (glob expansion) - Accept HTTP 409 as "ready" in torrent sync (transmission session ID mechanism) - Completed downloads stored in `/downloads/complete/` on sifaka ## Deployment and Testing - [x] Deployed transmission to k8s - [x] Verified transmission web UI at torrent.tail8d86e.ts.net - [x] Moved existing ZIM files to complete folder - [x] Deployed kiwix to k8s - [x] Verified kiwix web UI at kiwix.tail8d86e.ts.net - [x] Stopped old services on indri - [x] Cleared svc:kiwix from Tailscale serve on indri - [x] Updated zk documentation 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/39	2026-01-21 18:07:40 -08:00
Erich Blume	89eff26301	complete P5.1	2026-01-21 16:32:01 -08:00
Erich Blume	21848a7919	P5.1: Migrate minikube from podman to QEMU2 driver (#38 ) ## Summary - Migrate minikube from podman driver to qemu2 driver for proper NFS/SMB volume mount support - Update ansible minikube role with qemu installation and containerd runtime - Remove podman role dependency from indri.yml - Add synology user creation steps and post-migration zot reconfiguration notes ## Why Phase 6 (Kiwix/Transmission migration) was blocked because the podman driver lacks kernel capabilities for filesystem mounts. QEMU2 creates an actual VM with full mount support. ## Deployment and Testing - [ ] Create k8s-storage user on Synology DSM - [ ] Store credentials in 1Password (synology-k8s-storage) - [ ] Export current k8s state - [ ] Stop and delete podman-based minikube cluster - [ ] Run ansible to create QEMU2 cluster - [ ] Test NFS volume mount with test pod - [ ] Redeploy ArgoCD and all apps - [ ] Verify all services healthy - [ ] Reconfigure zot registry mirrors for containerd (post-migration) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/38	2026-01-21 16:03:37 -08:00
Erich Blume	7b60cca31e	Document P6 blocker and add P5.1 QEMU2 migration plan (#37 ) ## Summary - Document P6 (Kiwix/Transmission) blocker: podman driver cannot mount external volumes - Add P5.1 plan to migrate minikube from podman to QEMU2 driver - Update overview with corrected phase statuses and driver information ## Background P6 implementation (`feature/p6-kiwix-transmission`) was completed but blocked because all volume mount approaches failed with the podman driver: \| Approach \| Result \| \|----------\|--------\| \| NFS volume \| Failed - CAP_SYS_ADMIN required \| \| SMB CSI driver \| Failed - EPERM in rootless container \| \| `minikube mount` (9p) \| Failed - permission denied \| \| hostPath \| Failed - path doesn't exist in container \| Root cause: Podman driver runs minikube in a rootless container lacking kernel capabilities for filesystem mounts. ## What's Next 1. Merge this documentation PR 2. Execute P5.1 (QEMU2 migration) in a fresh session 3. Retry P6 with the QEMU2 driver ## Deployment and Testing - [x] No deployment needed - documentation only - [x] ArgoCD apps reset to main - [x] Cluster healthy (except kiwix/transmission intentionally offline) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/37	2026-01-20 20:49:48 -08:00
Erich Blume	b97d461a5a	P6: Kiwix and Transmission migration planning (#35 ) ## Summary - Detailed planning document for Phase 6 of k8s migration - Transmission as standalone general-purpose torrent service with web UI at torrent.tail8d86e.ts.net - NFS storage on sifaka (/volume1/torrents) shared between both services - Declarative ZIM torrent list in kiwix's ConfigMap, synced to transmission via sidecar - ZIM watcher CronJob for automatic kiwix restart when new archives complete - Supports both GitOps (declarative) and interactive (web UI) torrent management ## Architecture Highlights - torrent namespace: Standalone transmission with Tailscale ingress - kiwix namespace: kiwix-serve with torrent-sync sidecar - Shared NFS PV: Single PV referenced by PVCs in both namespaces - No backup needed: Sifaka is RAID 5/6 and already the backup target ## Deployment and Testing - [ ] Review plan document - [ ] Verify NFS export on sifaka is feasible - [ ] Approve architecture decisions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/35	2026-01-20 18:42:11 -08:00
Erich Blume	f98103a58d	P5 done	2026-01-20 15:04:46 -08:00
Erich Blume	0439fbb704	P5: Migrate devpi to Kubernetes (#34 ) ## Summary - Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes - Custom container image with devpi-server + devpi-web + auto-init - StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net - Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow ## Key Changes - Add CRI-O registry mirror config for registry.tail8d86e.ts.net - Change ArgoCD apps to manual sync (was auto-sync causing issues) - 2Gi memory limit for Whoosh indexer (reclaimed after startup) ## Deployment and Testing - [x] devpi pod healthy in k8s - [x] pip install through proxy works - [x] mcquack 1.0.0 uploaded and installable - [x] Old devpi stopped on indri ## Post-Merge Reset ArgoCD to main: ``` argocd app set apps --revision main && argocd app sync apps argocd app set devpi --revision main && argocd app sync devpi ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34	2026-01-20 14:55:37 -08:00
Erich Blume	b2307412fc	Add P4 implementation notes and mark complete	2026-01-20 09:10:23 -08:00
Erich Blume	735b643429	P4: Miniflux migration + PostgreSQL consolidation (#33 ) ## Summary - Deploy miniflux in k8s via ArgoCD - Expose via Tailscale Ingress at feed.tail8d86e.ts.net - Retire brew PostgreSQL (no longer needed) - Rename k8s-pg to pg (canonical hostname) - Remove ansible miniflux and postgresql roles - Update borgmatic to backup pg.tail8d86e.ts.net - Update all zk documentation ## Deployment and Testing - [x] Miniflux pod running in k8s - [x] User login works at https://feed.tail8d86e.ts.net - [x] Feeds and entries visible - [x] brew miniflux and postgresql stopped - [x] Tailscale services migrated (feed, pg) - [x] zk documentation updated - [x] Run ansible to apply role removals - [ ] Verify borgmatic backup with new pg hostname 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33	2026-01-20 09:04:47 -08:00
Erich Blume	463f476374	P3 done Updated P3_postgresql.complete.md with full implementation notes including: - borgmatic borg path fix - Disaster recovery testing - CloudNativePG managed roles for borgmatic user - Dual database backup configuration - ACL grant for homelab → k8s - ArgoCD selfHeal disabled for feature branch workflow - CNPG default values to prevent drift Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 18:19:33 -08:00
Erich Blume	e69a3df2d4	P3 done	2026-01-19 18:03:48 -08:00
Erich Blume	0c6f0a13c3	Add CNPG default values to prevent ArgoCD drift CloudNativePG operator fills in connectionLimit, ensure, and inherit defaults on managed roles. Adding these explicitly keeps ArgoCD in sync. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 18:02:42 -08:00
Erich Blume	eb952aae01	P3: PostgreSQL disaster recovery test and borgmatic k8s-pg backup (#32 ) ## Summary - Fixed borgmatic `borg: command not found` by adding `local_path` config option - Successfully tested disaster recovery: restored miniflux data from borgmatic backup to k8s-pg - Added borgmatic user to k8s-pg via CloudNativePG managed roles - Configured borgmatic to backup both localhost and k8s-pg PostgreSQL databases - Added Tailscale ACL grant for `tag:homelab` → `tag:k8s` on port 5432 - Disabled selfHeal on apps app to allow manual revision changes during development ## Changes - `ansible/roles/borgmatic/` - Added `local_path` and k8s-pg database entry - `ansible/roles/postgresql/tasks/main.yml` - Added k8s-pg to `.pgpass` - `argocd/apps/apps.yaml` - Disabled selfHeal - `argocd/manifests/databases/blumeops-pg.yaml` - Added borgmatic managed role - `argocd/manifests/databases/secret-borgmatic.yaml.tpl` - New secret template - `pulumi/policy.hujson` - Added ACL grant for backup access ## Deployment and Testing - [x] Borgmatic backup runs successfully - [x] Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries verified) - [x] borgmatic user created in k8s-pg with pg_read_all_data role - [x] Both localhost and k8s-pg databases in backup archive - [x] zk documentation updated (borgmatic.md, postgresql.md) - [ ] After merge: set blumeops-pg app back to main revision 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/32	2026-01-19 18:00:32 -08:00
Erich Blume	f2541c3f77	Fix minikube role idempotency for zot mirror config (#31 ) ## Summary - Fixed trailing newline mismatch in config comparison (ansible command module strips whitespace, slurp preserves it) - Only copy temp file when config actually needs updating (avoids spurious changes) - Task now properly skips when config is already correct ## Deployment and Testing - [x] Verified idempotency: `changed=0` on repeated runs - [x] Verified change detection: corrupted config triggers proper update - [x] ansible-lint passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/31	2026-01-19 16:19:52 -08:00
Erich Blume	130c044523	Fix hanging minikube provision	2026-01-19 15:49:11 -08:00
Erich Blume	f0c28a3cdd	Rename P2 plan to .complete.md	2026-01-19 15:06:27 -08:00
Erich Blume	45dfefa8df	Mark P2 complete with implementation notes Documents lessons learned: - SSH credential template for all forge repos - Kustomize patches must omit namespace for matching - Tailscale hostname cutover requires manual admin console deletion - ArgoCD workflow: all apps target main, manual sync for control Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 15:06:14 -08:00
Erich Blume	258c88f2f7	Fix kustomize patch: remove namespace for proper matching Kustomize matches patches before namespace transformation, so the patch file shouldn't specify namespace (kustomization.yaml adds it). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 15:00:33 -08:00
Erich Blume	623b122f58	Fix kustomization: known_hosts as resource not patch The argocd-ssh-known-hosts-cm ConfigMap needs to be a resource, not a patch, because the upstream install.yaml includes it inline in a way kustomize can't patch. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-19 14:45:33 -08:00
Erich Blume	7e6742ad24	K8s Migration Phase 2: Grafana to Kubernetes (#30 ) ## Summary - Migrate Grafana from Homebrew/Ansible to Kubernetes deployment - Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed) - Add Grafana Helm chart deployment via ArgoCD with multi-source pattern - Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps) - Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal` ## Key Changes - `argocd/apps/grafana.yaml` - Grafana Helm chart Application - `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps - `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo - `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0 ## Deployment and Testing - [x] Deploy Loki config change: `mise run provision-indri -- --tags loki` - [x] Create namespace: `ki create namespace monitoring` - [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl \| ki apply -f -` - [x] Sync ArgoCD apps (grafana, grafana-config) - [x] Verify Grafana works at https://grafana.tail8d86e.ts.net - [x] Remove svc:grafana from ansible tailscale_serve - [x] Stop brew grafana: `ssh indri 'brew services stop grafana'` - [x] Delete ansible grafana role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30	2026-01-19 14:40:25 -08:00
Erich Blume	4c1c4b92e1	Scan full repo history in trufflehog	2026-01-19 10:12:56 -08:00
Erich Blume	680ad1095b	Rename P1 to complete	2026-01-19 10:03:52 -08:00
Erich Blume	a8f4d00294	K8s Migration Phase 1: Infrastructure Setup (#29 ) ## Summary - Split k8s migration plan into phases folder for easier navigation - Added `tag:k8s` to Pulumi ACLs for Kubernetes workloads - Phase 1 work in progress ## Phase 1 Goals - Tailscale Kubernetes Operator - CloudNativePG Operator - PostgreSQL cluster for future app migrations ## Deployment and Testing - [ ] Review Phase 1 plan - [ ] `mise run tailnet-preview` to verify ACL changes - [ ] `mise run tailnet-up` to apply ACL changes - [ ] Create Tailscale OAuth client (manual) - [ ] Deploy operators and PostgreSQL cluster 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/29	2026-01-19 09:49:52 -08:00
Erich Blume	61dced048b	Fix borgmatic-metrics script PATH issue (#28 ) ## Summary - Fixed borgmatic-metrics script failing in LaunchAgent context - Changed from `mise x -- borg` to absolute paths (`/opt/homebrew/bin/borg`, `/opt/homebrew/bin/jq`) - This fixes the Grafana dashboard showing "DOWN" for Repository Status and missing time series data ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags borgmatic-metrics` to deploy the fix - [ ] Wait for the hourly metrics collection (or manually run `ssh indri '~/bin/borgmatic-metrics'`) - [ ] Verify Grafana dashboard shows "UP" status and populated graphs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/28	2026-01-18 14:57:35 -08:00
Erich Blume	3679124ebd	Expose Kubernetes API as Tailscale service (Step 0.14) (#27 ) ## Summary - Add `tag:k8s-api` to Pulumi ACLs and indri device tags - Configure Tailscale serve with TCP passthrough for k8s API at `k8s.tail8d86e.ts.net` - Update minikube role to include `k8s.tail8d86e.ts.net` in certificate SANs - Add `apiserver_port` config option (internal port 6443, dynamic host port with podman driver) - Document Step 0.14 in k8s-migration plan (added post-Phase 0 completion) The Kubernetes API is now accessible at `https://k8s.tail8d86e.ts.net` using TCP passthrough to preserve mTLS authentication. ## Deployment and Testing - [x] Pulumi ACLs applied - [x] Tailscale service created and approved in admin console - [x] Minikube cluster recreated with new cert SANs - [x] tailscale serve configured with TCP passthrough - [x] 1Password credentials updated with new certs - [x] Kubeconfig updated on gilbert - [x] `mise run indri-services-check` passes - [x] `kubectl --context=minikube-indri get nodes` works via Tailscale 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/27	2026-01-18 12:49:20 -08:00
Erich Blume	19a82373d5	K8s Migration Phase 0: Foundation Infrastructure (#26 ) ## Summary - Step 0.1: Update Pulumi ACLs with tag:registry - Step 0.3: Create Zot registry ansible role with mcquack LaunchAgent - Step 0.4: Add Zot to Tailscale Serve configuration - Step 0.5: Create Zot metrics role for Prometheus scraping - Step 0.6: Add Zot log collection to Alloy - Step 0.7: Update indri-services-check with zot checks - Step 0.8: Add podman role for container runtime - Step 0.9: Add minikube role for Kubernetes cluster - Step 0.10: Configure remote kubectl access with 1Password credentials ## Remaining Steps - [ ] Step 0.11: Add minikube to indri-services-check - [ ] Step 0.12: Create zettelkasten documentation - [ ] Step 0.13: Verify main playbook (already done - roles added) ## Deployment and Testing - [x] Zot registry deployed and accessible at https://registry.tail8d86e.ts.net - [x] Podman machine running on indri - [x] Minikube cluster running on indri - [x] kubectl access from gilbert working with 1Password credentials - [ ] indri-services-check passes all checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/26	2026-01-18 12:06:28 -08:00
Erich Blume	ee196b0c10	Fix Phase 0 plan based on review feedback (#25 ) ## Summary - Step 0.3: Use launchctl unload/load pattern for handlers (consistent with existing handlers) - Step 0.6: Correct file path - add zot logs to alloy defaults/main.yml - Step 0.9: Use cri-o runtime instead of containerd - Step 0.10: Simplify kubeconfig instructions - focus on goal not implementation ## Deployment and Testing - [x] Documentation-only change, no deployment needed 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/25	2026-01-17 20:07:10 -08:00
Erich Blume	c8433467c1	Add Kubernetes migration plan documentation (#24 ) ## Summary - Comprehensive phased plan for migrating blumeops services to minikube - Technical decisions documented: Zot registry, Podman driver, CloudNativePG, Tailscale Operator - 9 migration phases with verification and rollback procedures - LaunchAgent absolute path requirements documented - Observability requirements (zk docs, logging, metrics, dashboards) for new services ## Deployment and Testing - [x] Plan document created at `docs/k8s-migration.md` - [ ] Review plan phases for completeness - [ ] Validate technical decisions align with requirements 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/24	2026-01-17 17:34:53 -08:00
Erich Blume	e6d302b40b	Harden Tailscale ACL policy with least-privilege grants (#23 ) ## Summary - Replace permissive wildcard ACL (`` -> ``) with specific service grants - Admin: full access to all services including NAS - Member: user-facing services only (no Grafana/Loki/NAS) - Add device tagging for gilbert (workstation) and sifaka (NAS) via Pulumi - SSH hardening: remove root access, use "check" action with MFA - Add ACL tests to validate policy behavior ## Deployment and Testing - [x] Pulumi preview passes - [x] HuJSON syntax validated - [x] ACL tests defined and passing - [ ] Deploy with `mise run tailnet-up` - [ ] Verify SSH access from gilbert to indri - [ ] Verify Allison cannot access Grafana/Loki/NAS 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/23	2026-01-17 11:58:04 -08:00
Erich Blume	0918764e93	Rename Node Exporter dashboard to macOS (#22 ) ## Summary - Renamed dashboard from "Node Exporter - macOS" to just "macOS" since it now uses Alloy - Updated filename, title, uid, and tags to reflect the change ## Deployment and Testing - [ ] Deploy with `mise run provision-indri -- --tags grafana` - [ ] Verify dashboard accessible at https://grafana.tail8d86e.ts.net 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/22	2026-01-17 09:29:19 -08:00

1 2 3

104 commits