blumeops

Author	SHA1	Message	Date
Erich Blume	85e36cd807	Operations and observability for sifaka NAS (#135 ) ## Summary - Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring - Formalize existing `node_exporter` container under Ansible management - Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus - Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime) - Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS - Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles ## Prerequisites before deploy - [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP) - [ ] Verify `ssh eblume@sifaka 'docker ps'` works - [ ] Run `mise run provision-sifaka` to deploy containers - [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes - [ ] `argocd app sync prometheus` + `argocd app sync grafana-config` ## Test plan - [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics` - [ ] Verify Prometheus targets page shows both sifaka jobs as UP - [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135	2026-02-09 17:44:05 -08:00
Erich Blume	3415cad38c	Log real client IPs via Fly-Client-IP header (#130 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 59s Details ## Summary - Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header - Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki - Keeps `remote_addr` (the internal proxy IP) for debugging Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs. ## Deployment and Testing - [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory - [ ] Verify in Grafana that new log lines include `client_ip` with real IPs - [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130	2026-02-09 11:02:06 -08:00
Erich Blume	cc54b4f565	Add Fly.io proxy observability via embedded Alloy (#123 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m16s Details ## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: Docs APM (per-service view filtered by `host="docs.eblu.me"`) and Fly.io Proxy Health (aggregate proxy health across all upstream services) ## Changed Files \| File \| Change \| \|------\|--------\| \| `fly/nginx.conf` \| Add JSON `log_format` + `access_log`, add `stub_status` endpoint \| \| `fly/Dockerfile` \| COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config \| \| `fly/alloy.river` \| New — Alloy config: log tailing, metric extraction, remote_write \| \| `fly/start.sh` \| Start Alloy after Tailscale, before nginx \| \| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` \| New — Docs APM dashboard \| \| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` \| New — Fly.io Proxy Health dashboard \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Register new dashboard configmaps \| \| `docs/reference/services/flyio-proxy.md` \| Document observability setup \| ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123	2026-02-08 10:05:38 -08:00
Erich Blume	737371ab59	Add pod state observability to minikube dashboard (#83 ) ## Summary - Add "Unhealthy Pods" stat panel showing count of pods in error states (ImagePullBackOff, CrashLoopBackOff, etc.) with red background when > 0 - Add "Pods by Waiting Reason" time series chart showing container waiting states over time - Provides visibility into stuck pods that ArgoCD doesn't track (since it manages CronJobs, not the Jobs/Pods they spawn) ## Context This addresses the issue where a `zim-watcher` cronjob pod was stuck in `ImagePullBackOff` for 11 days without any alerting. ArgoCD showed the CronJob as "Synced, Healthy" because it only manages the CronJob resource, not its spawned Jobs/Pods. ## Deployment and Testing - [ ] Sync grafana-config app to test branch - [ ] Verify dashboard renders correctly - [ ] Confirm "Unhealthy Pods" shows 0 (green) when no issues - [ ] Reset to main after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/83	2026-02-03 07:20:05 -08:00
Erich Blume	b8b33b76c8	Remove Plex media server (#78 ) ## Summary - Remove plex_metrics ansible role - Remove Plex Grafana dashboard - Remove Plex log collection from Alloy config - Update indri-services-check to check Jellyfin instead of Plex ## Deployment and Testing - [x] Unloaded plex-metrics LaunchAgent on indri - [x] Deleted plex-metrics plist and script - [x] Deleted plex.prom textfile - [ ] Deploy Alloy config update - [ ] Sync grafana-config to remove dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78	2026-01-30 17:06:00 -08:00
Erich Blume	bcc8685316	Add Jellyfin media server deployment (#77 ) ## Summary - Add Jellyfin ansible role for native macOS deployment via Homebrew cask - Add jellyfin_metrics role for Prometheus textfile metrics collection - Add Caddy routing for jellyfin.ops.eblu.me - Add Alloy log collection for Jellyfin stdout/stderr - Add Grafana dashboard for Jellyfin monitoring ## Architecture Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams. ## Deployment and Testing - [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy` - [ ] Sync Grafana dashboard: `argocd app sync grafana-config` - [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me - [ ] Generate API key and save to `~/.jellyfin-api-key` - [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV) - [ ] Enable VideoToolbox hardware transcoding - [ ] Verify metrics in Grafana dashboard - [ ] Verify logs in Loki: `{service="jellyfin"}` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77	2026-01-30 16:57:26 -08:00
Erich Blume	0d8eb651d4	Fix XID Age graph to show threshold context (#69 ) All checks were successful Build Container / build (push) Successful in 1m10s Details ## Summary - Add fixed Y-axis (0-220M) so the 200M autovacuum threshold is always visible - Add dashed threshold lines at 150M (yellow warning) and 200M (red danger) - Update title to clarify the threshold ## Context The raw XID age naturally trends upward between vacuum freezes, which looked alarming without context. Current values (~143K-216K) are at 0.1% of the threshold - completely healthy. ## Deployment and Testing - [ ] Sync grafana-config app to feature branch - [ ] Verify threshold lines appear on PostgreSQL dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/69	2026-01-29 07:08:21 -08:00
Erich Blume	0604877db2	Add 'Tesla' prefix to all TeslaMate dashboard titles (#68 ) ## Summary - Renamed all 18 TeslaMate Grafana dashboards to include "Tesla" prefix - Improves organization and discoverability in the dashboard list ## Deployment and Testing - [ ] Sync grafana-config app: `argocd app set grafana-config --revision feature/rename-tesla-dashboards && argocd app sync grafana-config` - [ ] Verify dashboards display with "Tesla" prefix in Grafana 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/68	2026-01-29 06:55:44 -08:00
Erich Blume	272ddb213b	Add TeslaMate deployment for Tesla Model Y data logging (#47 ) ## Summary - Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net - Add teslamate user to CloudNativePG blumeops-pg cluster - Add TeslaMate PostgreSQL datasource to Grafana - Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc. - Add teslamate database to borgmatic backup configuration ## Deployment and Testing - [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key" - [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl \| kubectl apply -f -` - [ ] Sync blumeops-pg: `argocd app sync blumeops-pg` - [ ] Create teslamate database - [ ] Apply teslamate secrets (encryption key, db connection) - [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl \| kubectl apply -f -` - [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config` - [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net - [ ] Verify data collection starts - [ ] Verify Grafana dashboards show data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47	2026-01-22 21:25:44 -08:00
Erich Blume	57bf8512dc	Log filtering cleanup and observability improvements (#45 ) ## Summary - Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue) - Disable thermal collector on indri Alloy (not supported on macOS M1) - Add macOS power/thermal metrics collection via powermetrics LaunchDaemon - Add Power & Thermal section to macOS Grafana dashboard - Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.) - Extract more fields from JSON logs (zot compatibility - uses "message" not "msg") - Silence logfmt parse errors for non-logfmt logs - Fix JSON escaping in devpi dashboard ## Deployment and Testing - [x] Deployed Alloy config changes to indri via ansible - [x] Synced alloy-k8s and grafana-config via ArgoCD - [x] Verified power metrics appearing in Prometheus - [x] Verified thermal collector errors stopped - [x] Verified logfmt parse errors silenced - [x] Verified devpi dashboard loads correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45	2026-01-22 17:30:08 -08:00
Erich Blume	e4a8405de7	Observability cleanup and k8s service monitoring (#43 ) (#43 ) ## Summary - Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s) - Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery - Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd - Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios) - Replace stale devpi dashboard with probe-based metrics (status, response time, uptime) - Add unified "K8s Services Health" dashboard for service uptime/response monitoring ## Manual cleanup already performed - Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom` - Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/` ## Deployment and Testing - [x] Sync `apps` application to pick up new alloy-k8s app - [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s` - [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent` - [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus` - [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config` - [x] Verify pod logs appear in Loki/Grafana - [x] Verify transmission metrics appear in Prometheus - [x] Verify service probe metrics appear in Prometheus - [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config - [ ] After merge, reset apps to main and resync 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43	2026-01-22 13:51:01 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	7e6742ad24	K8s Migration Phase 2: Grafana to Kubernetes (#30 ) ## Summary - Migrate Grafana from Homebrew/Ansible to Kubernetes deployment - Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed) - Add Grafana Helm chart deployment via ArgoCD with multi-source pattern - Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps) - Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal` ## Key Changes - `argocd/apps/grafana.yaml` - Grafana Helm chart Application - `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps - `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo - `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0 ## Deployment and Testing - [x] Deploy Loki config change: `mise run provision-indri -- --tags loki` - [x] Create namespace: `ki create namespace monitoring` - [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl \| ki apply -f -` - [x] Sync ArgoCD apps (grafana, grafana-config) - [x] Verify Grafana works at https://grafana.tail8d86e.ts.net - [x] Remove svc:grafana from ansible tailscale_serve - [x] Stop brew grafana: `ssh indri 'brew services stop grafana'` - [x] Delete ansible grafana role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30	2026-01-19 14:40:25 -08:00

13 commits