blumeops

Author	SHA1	Message	Date
Erich Blume	16bfe06b7b	Fix LaunchDaemon check to use become: true LaunchDaemons run in the system domain and require sudo to query. Without become: true, the check always fails and tries to reload. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 17:34:23 -08:00
Erich Blume	57bf8512dc	Log filtering cleanup and observability improvements (#45 ) ## Summary - Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue) - Disable thermal collector on indri Alloy (not supported on macOS M1) - Add macOS power/thermal metrics collection via powermetrics LaunchDaemon - Add Power & Thermal section to macOS Grafana dashboard - Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.) - Extract more fields from JSON logs (zot compatibility - uses "message" not "msg") - Silence logfmt parse errors for non-logfmt logs - Fix JSON escaping in devpi dashboard ## Deployment and Testing - [x] Deployed Alloy config changes to indri via ansible - [x] Synced alloy-k8s and grafana-config via ArgoCD - [x] Verified power metrics appearing in Prometheus - [x] Verified thermal collector errors stopped - [x] Verified logfmt parse errors silenced - [x] Verified devpi dashboard loads correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45	2026-01-22 17:30:08 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	0439fbb704	P5: Migrate devpi to Kubernetes (#34 ) ## Summary - Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes - Custom container image with devpi-server + devpi-web + auto-init - StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net - Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow ## Key Changes - Add CRI-O registry mirror config for registry.tail8d86e.ts.net - Change ArgoCD apps to manual sync (was auto-sync causing issues) - 2Gi memory limit for Whoosh indexer (reclaimed after startup) ## Deployment and Testing - [x] devpi pod healthy in k8s - [x] pip install through proxy works - [x] mcquack 1.0.0 uploaded and installable - [x] Old devpi stopped on indri ## Post-Merge Reset ArgoCD to main: ``` argocd app set apps --revision main && argocd app sync apps argocd app set devpi --revision main && argocd app sync devpi ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34	2026-01-20 14:55:37 -08:00
Erich Blume	735b643429	P4: Miniflux migration + PostgreSQL consolidation (#33 ) ## Summary - Deploy miniflux in k8s via ArgoCD - Expose via Tailscale Ingress at feed.tail8d86e.ts.net - Retire brew PostgreSQL (no longer needed) - Rename k8s-pg to pg (canonical hostname) - Remove ansible miniflux and postgresql roles - Update borgmatic to backup pg.tail8d86e.ts.net - Update all zk documentation ## Deployment and Testing - [x] Miniflux pod running in k8s - [x] User login works at https://feed.tail8d86e.ts.net - [x] Feeds and entries visible - [x] brew miniflux and postgresql stopped - [x] Tailscale services migrated (feed, pg) - [x] zk documentation updated - [x] Run ansible to apply role removals - [ ] Verify borgmatic backup with new pg hostname 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33	2026-01-20 09:04:47 -08:00
Erich Blume	19a82373d5	K8s Migration Phase 0: Foundation Infrastructure (#26 ) ## Summary - Step 0.1: Update Pulumi ACLs with tag:registry - Step 0.3: Create Zot registry ansible role with mcquack LaunchAgent - Step 0.4: Add Zot to Tailscale Serve configuration - Step 0.5: Create Zot metrics role for Prometheus scraping - Step 0.6: Add Zot log collection to Alloy - Step 0.7: Update indri-services-check with zot checks - Step 0.8: Add podman role for container runtime - Step 0.9: Add minikube role for Kubernetes cluster - Step 0.10: Configure remote kubectl access with 1Password credentials ## Remaining Steps - [ ] Step 0.11: Add minikube to indri-services-check - [ ] Step 0.12: Create zettelkasten documentation - [ ] Step 0.13: Verify main playbook (already done - roles added) ## Deployment and Testing - [x] Zot registry deployed and accessible at https://registry.tail8d86e.ts.net - [x] Podman machine running on indri - [x] Minikube cluster running on indri - [x] kubectl access from gilbert working with 1Password credentials - [ ] indri-services-check passes all checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/26	2026-01-18 12:06:28 -08:00
Erich Blume	75426be1dc	Remove ansible role meta dependencies to fix duplicate execution (#20 ) ## Summary - Remove all `meta/main.yml` dependencies from ansible roles - Role ordering is now controlled entirely by `indri.yml` playbook - Fix incorrect roles path in CLAUDE.md (`playbooks/roles` → `roles`) ## Why Ansible's tag accumulation behavior prevents proper role deduplication when using meta dependencies. When a role is pulled in as a dependency, the parent role's tags are added to the dependency's tags (e.g., `[loki]` becomes `[alloy, loki]`), making them appear as different invocations to Ansible and causing roles to run multiple times. ## Deployment and Testing - [x] Verified with `ansible-playbook --list-tasks` that each role now appears exactly once - [x] Run full provision to verify no regressions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/20	2026-01-16 22:50:34 -08:00
Erich Blume	9931829d03	Add pre-commit hooks for code quality (#19 ) ## Summary - Add pre-commit framework with hooks for YAML, Ansible, Python, shell, TOML, JSON, and secret detection - Fix all 91+ ansible-lint violations (variable naming, handler capitalization, changed_when) - Fix shellcheck warnings in mise-tasks scripts - Document pre-commit setup in README.md ## Deployment and Testing - [x] All pre-commit hooks pass (`uvx pre-commit run --all-files`) - [x] Test ansible playbook with `--check` mode - [x] Run `mise run indri-services-check` after deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/19	2026-01-16 19:33:02 -08:00
Erich Blume	adf6f4fbe9	Add PostgreSQL and Miniflux services to tailnet (#16 ) ## Summary - Add PostgreSQL 18 as a new service at `pg.tail8d86e.ts.net:5432` - Add Miniflux RSS/Atom feed reader at `feed.tail8d86e.ts.net` - Both services managed via homebrew/brew services - Pulumi ACL tags added (tag:pg, tag:feed) - Alloy log collection configured for both services - Zettelkasten documentation updated ## Manual Setup Required Before running ansible, the following steps are needed on indri: ### 1. Apply Pulumi tags ```bash mise run tailnet-up ``` Then apply tags to indri in Tailscale admin console. ### 2. Create 1Password entries - miniflux PostgreSQL user password - miniflux admin password (for first run) ### 3. Set PostgreSQL user password (after ansible installs postgres) ```bash ssh indri '/opt/homebrew/opt/postgresql@18/bin/psql -c "ALTER USER miniflux PASSWORD '\''your-password'\'';"' ``` ### 4. Create password files on indri ```bash ssh indri 'echo "your-db-password" > ~/.miniflux-db-password && chmod 600 ~/.miniflux-db-password' ssh indri 'echo "your-admin-password" > ~/.miniflux-admin-password && chmod 600 ~/.miniflux-admin-password' ``` ### 5. Create ~/.pgpass for borgmatic ```bash ssh indri 'echo "localhost:5432:miniflux:miniflux:YOUR_PASSWORD" > ~/.pgpass && chmod 600 ~/.pgpass' ``` ### 6. Run ansible with first-run admin creation ```bash mise run provision-indri -- -e miniflux_create_admin=1 ``` ### 7. Update borgmatic config Add to `~/.config/borgmatic/config.yaml` on indri: ```yaml postgresql_databases: - name: miniflux hostname: localhost port: 5432 username: miniflux ``` ### 8. Cleanup after first run ```bash ssh indri 'rm ~/.miniflux-admin-password' ``` ## Test plan - [ ] Run `mise run tailnet-up` and verify Pulumi changes - [ ] Apply tags to indri in Tailscale admin - [ ] Run `mise run provision-indri -- --check --diff` for dry run - [ ] Run `mise run provision-indri -- -e miniflux_create_admin=1` - [ ] Approve services in Tailscale admin - [ ] Verify PostgreSQL: `ssh indri '/opt/homebrew/opt/postgresql@18/bin/pg_isready'` - [ ] Verify Miniflux: `curl https://feed.tail8d86e.ts.net/healthcheck` - [ ] Run `mise run indri-services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/16	2026-01-16 12:30:20 -08:00
Erich Blume	ae1513e7e9	Add Plex Media Server observability (#13 ) ## Summary - Add `plex_metrics` ansible role with textfile collector for Prometheus metrics - Add Plex log collection to Alloy (forwards to Loki) - Add Grafana dashboard for Plex monitoring (status, library counts, sessions, transcoding, logs) ## Metrics Collected - `plex_up` - server health - `plex_version_info` - server version - `plex_sessions_total/playing/paused` - active sessions - `plex_transcode_sessions_total/video/audio` - transcoding status - `plex_library_items{library,type}` - library item counts ## Prerequisites Plex token must be stored at `~/.plex-token` on indri (already done). ## Test plan - [x] Dry-run passed (`mise run provision-indri -- --check --diff`) - [ ] Apply changes (`mise run provision-indri`) - [ ] Verify metrics: `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/plex.prom'` - [ ] Verify logs in Grafana Explore: `{service="plex"}` - [ ] Check Plex dashboard in Grafana 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/13	2026-01-15 15:27:59 -08:00
Erich Blume	2a1359a3b6	Fix ansible handler timeouts for alloy and loki restarts (#12 ) ## Summary - Use async with poll: 0 for alloy and loki restart handlers - Fire-and-forget approach prevents ansible from hanging on graceful shutdown ## Test plan - [x] Manually verified `brew services restart grafana-alloy` works - [x] Run full ansible playbook and verify it completes without timeout 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/12	2026-01-15 13:56:11 -08:00
Erich Blume	ba5cd75ee2	Fix ansible handler timeouts for alloy and loki restarts Use async with poll: 0 to fire-and-forget service restarts. These services have graceful shutdown periods that can exceed ansible's default command timeout. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-15 12:39:28 -08:00
Erich Blume	242c1880de	Add Grafana Alloy and Loki for unified observability (#11 ) ## Summary - Add Grafana Alloy to replace node_exporter for metrics collection - Add Loki for log aggregation and storage - Configure Alloy to collect logs from all services (grafana, forgejo, prometheus, tailscale, transmission, devpi, kiwix, borgmatic) - Update Prometheus to accept metrics via remote_write - Add Loki datasource to Grafana ## Test plan - [ ] Run \`mise run provision-indri -- --check --diff\` to verify changes - [ ] Apply with \`mise run provision-indri\` - [ ] Verify services: \`mise run indri-services-check\` - [ ] Check Grafana Explore with Loki datasource - [ ] Query logs: \`{service="grafana"}\` - [ ] Verify metrics still flowing to Prometheus dashboards 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/11	2026-01-15 12:24:13 -08:00

13 commits