blumeops

Author	SHA1	Message	Date
Erich Blume	d7af004842	Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m53s Details - Enable Forgejo /metrics endpoint (app.ini [metrics] section) - Add Alloy scrape target for Forgejo metrics on indri - Add upstream_response_time histogram to Fly proxy Alloy config - Replace single p95 panel with p50/p90/p99 + upstream breakdown filtered to forge.eblu.me host Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:05:59 -07:00
Erich Blume	ab8ea6f301	Bump Grafana Alloy to v1.14.0 (#292 ) ## Summary - Bump alloy-k8s, alloy-ringtail, and alloy-tracing-ringtail image tags from v1.13.1 to v1.14.0 - Mark indri alloy (ansible) as reviewed at v1.14.0 — source rebuild from forge mirror needed - Add missing alloy-ringtail entry to service-versions.yaml - Update alloy reference doc ## Breaking changes reviewed - `loki.secretfilter` options removed — not used in our configs - OTel Collector upgraded to v0.142.0 — Kafka receiver changes don't affect us - Exporter queue default changes — our tracing pipeline (Beyla → batch → otlphttp) uses simple config, low risk ## Deployment and Testing - [ ] Sync alloy-k8s: `argocd app set alloy-k8s --revision bump/alloy-v1.14.0 && argocd app sync alloy-k8s` - [ ] Sync alloy-ringtail: `argocd app set alloy-ringtail --revision bump/alloy-v1.14.0 --server ringtail-argocd && argocd app sync alloy-ringtail` - [ ] Sync alloy-tracing-ringtail similarly - [ ] Verify metrics flowing in Grafana - [ ] Verify traces flowing to Tempo (ringtail) - [ ] Rebuild indri alloy from source (`v1.14.0` tag on forge mirror), SCP to indri, restart - [ ] After merge: reset ArgoCD revisions to main, re-sync Reviewed-on: #292	2026-03-13 16:25:27 -07:00
Erich Blume	5f9bc20345	Fix mirror org refs in ArgoCD apps and widen credential template (#266 ) ## Summary - Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials) - Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts) - Fix stale alloy clone comment in Ansible defaults - Bump immich v2.5.2 → v2.5.6 (bug-fix patches only) - Update ArgoCD README bootstrap command and credential docs ## Context Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit ``cd57814``. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos. ## Deployment 1. Sync the ArgoCD `argocd` app first (picks up the widened credential template) 2. Sync the `apps` app (picks up new repo URLs for all 8 apps) 3. Verify immich resolves its ComparisonError: `argocd app get immich` 4. Sync immich to deploy v2.5.6: `argocd app sync immich` 5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266	2026-02-25 06:55:53 -08:00
Erich Blume	b8b33b76c8	Remove Plex media server (#78 ) ## Summary - Remove plex_metrics ansible role - Remove Plex Grafana dashboard - Remove Plex log collection from Alloy config - Update indri-services-check to check Jellyfin instead of Plex ## Deployment and Testing - [x] Unloaded plex-metrics LaunchAgent on indri - [x] Deleted plex-metrics plist and script - [x] Deleted plex.prom textfile - [ ] Deploy Alloy config update - [ ] Sync grafana-config to remove dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78	2026-01-30 17:06:00 -08:00
Erich Blume	bcc8685316	Add Jellyfin media server deployment (#77 ) ## Summary - Add Jellyfin ansible role for native macOS deployment via Homebrew cask - Add jellyfin_metrics role for Prometheus textfile metrics collection - Add Caddy routing for jellyfin.ops.eblu.me - Add Alloy log collection for Jellyfin stdout/stderr - Add Grafana dashboard for Jellyfin monitoring ## Architecture Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams. ## Deployment and Testing - [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy` - [ ] Sync Grafana dashboard: `argocd app sync grafana-config` - [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me - [ ] Generate API key and save to `~/.jellyfin-api-key` - [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV) - [ ] Enable VideoToolbox hardware transcoding - [ ] Verify metrics in Grafana dashboard - [ ] Verify logs in Loki: `{service="jellyfin"}` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77	2026-01-30 16:57:26 -08:00
Erich Blume	66badfafd1	Migrate k8s services to Caddy (.ops.eblu.me) (#59 ) All checks were successful Build Container / build (push) Successful in 13s Details ## Summary - Add Caddy reverse proxy routes for all k8s services (grafana, argocd, prometheus, loki, miniflux, devpi, kiwix, torrent, teslamate) - Add PostgreSQL via Caddy L4 TCP proxy on port 5432 - Caddy proxies to existing Tailscale endpoints - traffic stays local on indri - Both `.ops.eblu.me` and `.tail8d86e.ts.net` URLs continue to work ## Updated References - Alloy: prometheus/loki push endpoints → `.ops.eblu.me` - Borgmatic: PostgreSQL backup host → `pg.ops.eblu.me` - Devpi: DEVPI_OUTSIDE_URL → `pypi.ops.eblu.me` - indri-services-check: health check URLs - CLAUDE.md: argocd login command ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags caddy` to deploy new Caddy config - [ ] Test HTTP services: `curl https://grafana.ops.eblu.me/api/health` - [ ] Test PostgreSQL: `pg_isready -h pg.ops.eblu.me -p 5432` - [ ] Run `mise run provision-indri -- --tags alloy` to update Alloy endpoints - [ ] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic - [ ] Sync devpi in ArgoCD: `argocd app sync devpi` - [ ] Re-login to ArgoCD: `argocd login argocd.ops.eblu.me ...` - [ ] Run `mise run indri-services-check` to verify all services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/59	2026-01-25 12:56:31 -08:00
Erich Blume	1184b4de1d	Add Caddy layer4 for Forgejo SSH (#56 ) ## Summary - Add layer4 TCP proxy configuration to Caddyfile template for SSH services - Configure Forgejo SSH on port 2222 → localhost:2200 - Switch HTTPS from port 8443 (testing) to 443 (production) - Requires Caddy rebuilt with `github.com/mholt/caddy-l4` plugin ## What This Enables Git+SSH access via `forge.ops.eblu.me:2222` is now accessible from: - Tailnet clients (gilbert) - Docker containers on indri - Kubernetes pods in minikube This solves the DNS resolution issues where containers couldn't reach Tailscale MagicDNS names. ## Testing Done - [x] Caddy rebuilt with layer4 plugin - [x] Validated Caddyfile syntax - [x] Cleared `svc:forge` from tailscale serve - [x] Verified HTTPS works: `curl https://forge.ops.eblu.me` - [x] Verified SSH works: `ssh -p 2222 forgejo@forge.ops.eblu.me` - [x] Verified git clone works via new endpoint - [x] Verified minikube pods can reach both HTTPS and SSH endpoints ## Deployment Caddy is already running with the new config on indri. This PR captures the ansible changes. ## Next Steps - Update zk docs with new git remote format - Migrate registry and other services to Caddy - Retire tailscale_services ansible role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/56	2026-01-25 11:37:23 -08:00
Erich Blume	57bf8512dc	Log filtering cleanup and observability improvements (#45 ) ## Summary - Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue) - Disable thermal collector on indri Alloy (not supported on macOS M1) - Add macOS power/thermal metrics collection via powermetrics LaunchDaemon - Add Power & Thermal section to macOS Grafana dashboard - Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.) - Extract more fields from JSON logs (zot compatibility - uses "message" not "msg") - Silence logfmt parse errors for non-logfmt logs - Fix JSON escaping in devpi dashboard ## Deployment and Testing - [x] Deployed Alloy config changes to indri via ansible - [x] Synced alloy-k8s and grafana-config via ArgoCD - [x] Verified power metrics appearing in Prometheus - [x] Verified thermal collector errors stopped - [x] Verified logfmt parse errors silenced - [x] Verified devpi dashboard loads correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45	2026-01-22 17:30:08 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	0439fbb704	P5: Migrate devpi to Kubernetes (#34 ) ## Summary - Migrate devpi PyPI caching proxy from indri LaunchAgent to Kubernetes - Custom container image with devpi-server + devpi-web + auto-init - StatefulSet with 50Gi PVC, Tailscale Ingress at pypi.tail8d86e.ts.net - Remove devpi from ansible playbooks and update CLAUDE.md with k8s workflow ## Key Changes - Add CRI-O registry mirror config for registry.tail8d86e.ts.net - Change ArgoCD apps to manual sync (was auto-sync causing issues) - 2Gi memory limit for Whoosh indexer (reclaimed after startup) ## Deployment and Testing - [x] devpi pod healthy in k8s - [x] pip install through proxy works - [x] mcquack 1.0.0 uploaded and installable - [x] Old devpi stopped on indri ## Post-Merge Reset ArgoCD to main: ``` argocd app set apps --revision main && argocd app sync apps argocd app set devpi --revision main && argocd app sync devpi ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/34	2026-01-20 14:55:37 -08:00
Erich Blume	735b643429	P4: Miniflux migration + PostgreSQL consolidation (#33 ) ## Summary - Deploy miniflux in k8s via ArgoCD - Expose via Tailscale Ingress at feed.tail8d86e.ts.net - Retire brew PostgreSQL (no longer needed) - Rename k8s-pg to pg (canonical hostname) - Remove ansible miniflux and postgresql roles - Update borgmatic to backup pg.tail8d86e.ts.net - Update all zk documentation ## Deployment and Testing - [x] Miniflux pod running in k8s - [x] User login works at https://feed.tail8d86e.ts.net - [x] Feeds and entries visible - [x] brew miniflux and postgresql stopped - [x] Tailscale services migrated (feed, pg) - [x] zk documentation updated - [x] Run ansible to apply role removals - [ ] Verify borgmatic backup with new pg hostname 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/33	2026-01-20 09:04:47 -08:00
Erich Blume	19a82373d5	K8s Migration Phase 0: Foundation Infrastructure (#26 ) ## Summary - Step 0.1: Update Pulumi ACLs with tag:registry - Step 0.3: Create Zot registry ansible role with mcquack LaunchAgent - Step 0.4: Add Zot to Tailscale Serve configuration - Step 0.5: Create Zot metrics role for Prometheus scraping - Step 0.6: Add Zot log collection to Alloy - Step 0.7: Update indri-services-check with zot checks - Step 0.8: Add podman role for container runtime - Step 0.9: Add minikube role for Kubernetes cluster - Step 0.10: Configure remote kubectl access with 1Password credentials ## Remaining Steps - [ ] Step 0.11: Add minikube to indri-services-check - [ ] Step 0.12: Create zettelkasten documentation - [ ] Step 0.13: Verify main playbook (already done - roles added) ## Deployment and Testing - [x] Zot registry deployed and accessible at https://registry.tail8d86e.ts.net - [x] Podman machine running on indri - [x] Minikube cluster running on indri - [x] kubectl access from gilbert working with 1Password credentials - [ ] indri-services-check passes all checks 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/26	2026-01-18 12:06:28 -08:00
Erich Blume	adf6f4fbe9	Add PostgreSQL and Miniflux services to tailnet (#16 ) ## Summary - Add PostgreSQL 18 as a new service at `pg.tail8d86e.ts.net:5432` - Add Miniflux RSS/Atom feed reader at `feed.tail8d86e.ts.net` - Both services managed via homebrew/brew services - Pulumi ACL tags added (tag:pg, tag:feed) - Alloy log collection configured for both services - Zettelkasten documentation updated ## Manual Setup Required Before running ansible, the following steps are needed on indri: ### 1. Apply Pulumi tags ```bash mise run tailnet-up ``` Then apply tags to indri in Tailscale admin console. ### 2. Create 1Password entries - miniflux PostgreSQL user password - miniflux admin password (for first run) ### 3. Set PostgreSQL user password (after ansible installs postgres) ```bash ssh indri '/opt/homebrew/opt/postgresql@18/bin/psql -c "ALTER USER miniflux PASSWORD '\''your-password'\'';"' ``` ### 4. Create password files on indri ```bash ssh indri 'echo "your-db-password" > ~/.miniflux-db-password && chmod 600 ~/.miniflux-db-password' ssh indri 'echo "your-admin-password" > ~/.miniflux-admin-password && chmod 600 ~/.miniflux-admin-password' ``` ### 5. Create ~/.pgpass for borgmatic ```bash ssh indri 'echo "localhost:5432:miniflux:miniflux:YOUR_PASSWORD" > ~/.pgpass && chmod 600 ~/.pgpass' ``` ### 6. Run ansible with first-run admin creation ```bash mise run provision-indri -- -e miniflux_create_admin=1 ``` ### 7. Update borgmatic config Add to `~/.config/borgmatic/config.yaml` on indri: ```yaml postgresql_databases: - name: miniflux hostname: localhost port: 5432 username: miniflux ``` ### 8. Cleanup after first run ```bash ssh indri 'rm ~/.miniflux-admin-password' ``` ## Test plan - [ ] Run `mise run tailnet-up` and verify Pulumi changes - [ ] Apply tags to indri in Tailscale admin - [ ] Run `mise run provision-indri -- --check --diff` for dry run - [ ] Run `mise run provision-indri -- -e miniflux_create_admin=1` - [ ] Approve services in Tailscale admin - [ ] Verify PostgreSQL: `ssh indri '/opt/homebrew/opt/postgresql@18/bin/pg_isready'` - [ ] Verify Miniflux: `curl https://feed.tail8d86e.ts.net/healthcheck` - [ ] Run `mise run indri-services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/16	2026-01-16 12:30:20 -08:00
Erich Blume	ae1513e7e9	Add Plex Media Server observability (#13 ) ## Summary - Add `plex_metrics` ansible role with textfile collector for Prometheus metrics - Add Plex log collection to Alloy (forwards to Loki) - Add Grafana dashboard for Plex monitoring (status, library counts, sessions, transcoding, logs) ## Metrics Collected - `plex_up` - server health - `plex_version_info` - server version - `plex_sessions_total/playing/paused` - active sessions - `plex_transcode_sessions_total/video/audio` - transcoding status - `plex_library_items{library,type}` - library item counts ## Prerequisites Plex token must be stored at `~/.plex-token` on indri (already done). ## Test plan - [x] Dry-run passed (`mise run provision-indri -- --check --diff`) - [ ] Apply changes (`mise run provision-indri`) - [ ] Verify metrics: `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/plex.prom'` - [ ] Verify logs in Grafana Explore: `{service="plex"}` - [ ] Check Plex dashboard in Grafana 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/13	2026-01-15 15:27:59 -08:00
Erich Blume	242c1880de	Add Grafana Alloy and Loki for unified observability (#11 ) ## Summary - Add Grafana Alloy to replace node_exporter for metrics collection - Add Loki for log aggregation and storage - Configure Alloy to collect logs from all services (grafana, forgejo, prometheus, tailscale, transmission, devpi, kiwix, borgmatic) - Update Prometheus to accept metrics via remote_write - Add Loki datasource to Grafana ## Test plan - [ ] Run \`mise run provision-indri -- --check --diff\` to verify changes - [ ] Apply with \`mise run provision-indri\` - [ ] Verify services: \`mise run indri-services-check\` - [ ] Check Grafana Explore with Loki datasource - [ ] Query logs: \`{service="grafana"}\` - [ ] Verify metrics still flowing to Prometheus dashboards 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/11	2026-01-15 12:24:13 -08:00

15 commits