blumeops

Author	SHA1	Message	Date
Erich Blume	a0b076172f	Fix Immich/Homepage Ingress host matching, add missing service checks (#127 ) ## Summary - Fix Immich Ingress `host: photos` causing 404 with ProxyGroup (same FQDN mismatch as Prometheus/Loki) - Migrate Homepage from old per-service Tailscale proxy to shared ProxyGroup (was the last holdout) - Add Immich and Navidrome to `services-check` HTTP endpoints ## Deployment Notes - Already tested on branch: Immich and Homepage both return 200 via Caddy - Homepage's old Helm-managed Ingress was deleted manually; ArgoCD may recreate it on sync — prune with `argocd app sync homepage --prune` after merge - Old per-service `ts-homepage-*` pod in tailscale namespace can be cleaned up after confirming ProxyGroup works ## Test Plan - [x] `curl https://photos.ops.eblu.me/` returns 200 - [x] `curl https://go.ops.eblu.me/` returns 200 - [ ] `mise run services-check` fully passes after merge Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/127	2026-02-08 22:12:50 -08:00
Erich Blume	e6cf7e47e0	Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m8s Details ## Summary - Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy - Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test - Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses - Switch Alloy push endpoints from `.ops.eblu.me` (Caddy) to `.tail8d86e.ts.net` (Tailscale Ingress) - Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly ## Manual step (not in PR) Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes. ## Deployment order 1. Pulumi ACLs — `mise run tailnet-preview && mise run tailnet-up` 2. OAuth client — Manual update in Tailscale admin console 3. K8s Ingresses — `argocd app sync apps && argocd app sync docs loki prometheus` 4. Fly.io proxy — `mise run fly-deploy` 5. Verify — `mise run services-check`, check Grafana dashboards ## Test plan - [ ] `mise run tailnet-preview` shows clean diff - [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions - [ ] After deploy: Grafana dashboards show continued log/metric flow - [ ] `curl -sf https://docs.eblu.me` returns 200 - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126	2026-02-08 21:54:18 -08:00
Forgejo Actions	c8d0af6644	Update docs release to v1.5.1 - Built changelog from towncrier fragments [skip ci]	2026-02-08 18:06:46 +00:00
Erich Blume	cc54b4f565	Add Fly.io proxy observability via embedded Alloy (#123 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m16s Details ## Summary - Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus) - Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting) - Create two Grafana dashboards: Docs APM (per-service view filtered by `host="docs.eblu.me"`) and Fly.io Proxy Health (aggregate proxy health across all upstream services) ## Changed Files \| File \| Change \| \|------\|--------\| \| `fly/nginx.conf` \| Add JSON `log_format` + `access_log`, add `stub_status` endpoint \| \| `fly/Dockerfile` \| COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config \| \| `fly/alloy.river` \| New — Alloy config: log tailing, metric extraction, remote_write \| \| `fly/start.sh` \| Start Alloy after Tailscale, before nginx \| \| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` \| New — Docs APM dashboard \| \| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` \| New — Fly.io Proxy Health dashboard \| \| `argocd/manifests/grafana-config/kustomization.yaml` \| Register new dashboard configmaps \| \| `docs/reference/services/flyio-proxy.md` \| Document observability setup \| ## Deployment and Testing - [ ] `mise run fly-deploy` — rebuild container with Alloy - [ ] `curl https://docs.eblu.me/` — generate traffic - [ ] `fly logs -a blumeops-proxy` — verify Alloy startup - [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}` - [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}` - [ ] `argocd app sync grafana-config` — deploy dashboards - [ ] Verify dashboards show data in Grafana - [ ] `mise run services-check` — no regressions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123	2026-02-08 10:05:38 -08:00
Forgejo Actions	c46d55060d	Update docs release to v1.5.0 - Built changelog from towncrier fragments [skip ci]	2026-02-08 10:37:30 +00:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00
Forgejo Actions	11c76d4768	Update docs release to v1.4.2 - Built changelog from towncrier fragments [skip ci]	2026-02-08 05:45:40 +00:00
Forgejo Actions	ab7efd8c1c	Update docs release to v1.4.1 - Built changelog from towncrier fragments [skip ci]	2026-02-08 05:27:23 +00:00
Forgejo Actions	3f5017f732	Update docs release to v1.4.0 - Built changelog from towncrier fragments [skip ci]	2026-02-08 05:03:34 +00:00
Erich Blume	3b4ff91469	Fix homepage Admin bookmark icons (#110 ) ## Summary - Fix broken Pulumi icon: changed `pulumi` to `si-pulumi` (Simple Icons prefix required) - Fix broken ArgoCD icon: changed `argocd` to `argo-cd` (Dashboard Icons uses hyphenated name) ## Deployment and Testing - [ ] Sync homepage app in ArgoCD - [ ] Verify icons appear on go.ops.eblu.me Admin bookmarks section Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/110	2026-02-05 06:29:39 -08:00
Forgejo Actions	808bc507d8	Update docs release to v1.3.4 - Built changelog from towncrier fragments [skip ci]	2026-02-05 01:22:10 +00:00
Forgejo Actions	a03a9faaad	Update docs release to v1.3.3 - Built changelog from towncrier fragments [skip ci]	2026-02-04 22:40:18 +00:00
Forgejo Actions	e15caec898	Update docs release to v1.3.2 - Built changelog from towncrier fragments [skip ci]	2026-02-04 16:47:27 +00:00
Forgejo Actions	4aeade1543	Update docs release to v1.3.1 - Built changelog from towncrier fragments [skip ci]	2026-02-04 16:26:24 +00:00
Forgejo Actions	1835e3e80e	Update docs release to v1.3.0 - Built changelog from towncrier fragments [skip ci]	2026-02-04 16:14:08 +00:00
Erich Blume	1e13d4b83d	Fix Navidrome automatic library scan schedule (#101 ) ## Summary - Fix env var name from `ND_SCANSCHEDULE` to `ND_SCANNER_SCHEDULE` (Navidrome uses viper config where dots become underscores) - Use explicit `@every 1h` format for clarity - Reorder CLAUDE.md rules to emphasize running zk-docs first ## Root Cause Navidrome logs showed "Periodic scan is DISABLED" at startup despite the env var being set. The config key is `scanner.schedule`, which translates to `ND_SCANNER_SCHEDULE` (not `ND_SCANSCHEDULE`). ## Deployment and Testing - [ ] Sync navidrome app: `argocd app sync navidrome` - [ ] Verify pod restarts with new env var - [ ] Check logs for "Scheduling scanner" message instead of "Periodic scan is DISABLED" - [ ] Wait ~1 hour and confirm scan runs automatically 🤖 Generated with [Claude Code](https://claude.ai/code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/101	2026-02-04 07:23:12 -08:00
Forgejo Actions	e405a48881	Update docs release to v1.2.1 - Built changelog from towncrier fragments [skip ci]	2026-02-04 05:18:37 +00:00
Forgejo Actions	f88da51e23	Update docs release to v1.2.0 - Built changelog from towncrier fragments [skip ci]	2026-02-04 04:53:30 +00:00
Forgejo Actions	16cdffaebf	Update docs release to v1.1.5 - Built changelog from towncrier fragments [skip ci]	2026-02-04 04:34:31 +00:00
Forgejo Actions	e426473c59	Update docs release to v1.1.4 - Built changelog from towncrier fragments [skip ci]	2026-02-04 04:18:04 +00:00
Forgejo Actions	672dbda9d7	Update docs release to v1.1.3 - Built changelog from towncrier fragments [skip ci]	2026-02-04 03:07:15 +00:00
Forgejo Actions	f279891575	Update docs release to v1.1.2 - Built changelog from towncrier fragments [skip ci]	2026-02-04 03:02:13 +00:00
Forgejo Actions	81d99b689d	Update docs release to v1.1.1 - Built changelog from towncrier fragments [skip ci]	2026-02-04 02:53:17 +00:00
Forgejo Actions	bf03d71780	Update docs release to v1.1.0 - Built changelog from towncrier fragments [skip ci]	2026-02-04 01:27:09 +00:00
Erich Blume	82bcd935cd	Move DOCS_RELEASE_URL from ConfigMap to Deployment This ensures ArgoCD sync triggers a pod rollout when the URL changes, since ConfigMap data changes don't restart pods automatically. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:23:52 -08:00
Forgejo Actions	103cc0deab	Update docs release to v1.0.14 - Updated configmap with new DOCS_RELEASE_URL - Built changelog from towncrier fragments [skip ci]	2026-02-04 01:18:33 +00:00
Erich Blume	aaf5090509	Remove ARGOCD_AUTH_TOKEN from external secret Workflow secrets come from Forgejo's secret store, not runner env. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:17:53 -08:00
Forgejo Actions	492aa9a104	Update docs release to v1.0.13 - Updated configmap with new DOCS_RELEASE_URL - Built changelog from towncrier fragments [skip ci]	2026-02-04 01:15:22 +00:00
Erich Blume	3a26d7e49a	Update forgejo-runner image to v2.5.0 Fixes argocd CLI download. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:13:37 -08:00
Forgejo Actions	4d3222d91b	Update docs release to v1.0.12 - Updated configmap with new DOCS_RELEASE_URL - Built changelog from towncrier fragments [skip ci]	2026-02-04 01:07:05 +00:00
Erich Blume	f08595a3c0	Update forgejo-runner image to v2.4.0 Includes uv and argocd CLI for auto-deploy workflow. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:05:09 -08:00
Erich Blume	1f73eb675d	Auto-deploy docs from build workflow (#93 ) ## Summary - Add `uv` and `argocd` CLI to forgejo-runner container image - Add `workflow-bot` ArgoCD account with sync permissions (declarative via kustomize patches) - Add `ARGOCD_AUTH_TOKEN` to forgejo-runner external secret for workflow auth - Update build workflow to auto-deploy docs after release: - Update configmap with new release URL - Commit changelog and configmap changes - Sync docs app via ArgoCD ## Deployment and Testing Manual steps required before this can work: 1. [ ] Build and push new forgejo-runner image (v2.4.0) 2. [ ] Sync argocd app to create workflow-bot account 3. [ ] Generate token: `argocd account generate-token --account workflow-bot` 4. [ ] Store token in 1Password under "Forgejo Secrets" with field `argocd_token` 5. [ ] Sync forgejo-runner app to pick up new external secret 6. [ ] Update forgejo-runner deployment to use new image version 7. [ ] Test by running workflow manually 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/93	2026-02-03 16:58:03 -08:00
Erich Blume	7d5e6b032b	Update docs release to v1.0.11 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 16:40:06 -08:00
Erich Blume	31564d1d9a	Update docs release to v1.0.10 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 16:32:17 -08:00
Erich Blume	d359583d0a	Update docs release to v1.0.9 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 16:23:02 -08:00
Erich Blume	46a5c3a20f	Update docs release to v1.0.8 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 16:07:54 -08:00
Erich Blume	8d7863e61d	Update docs release to v1.0.7 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 15:58:01 -08:00
Erich Blume	6162179ac9	Update docs release to v1.0.6 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 15:35:14 -08:00
Erich Blume	8f427beeab	Update docs release to v1.0.5 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 15:07:07 -08:00
Erich Blume	ae64021224	Update docs release to v1.0.4 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 14:43:37 -08:00
Erich Blume	9904429562	Update docs release to v1.0.3 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 14:37:18 -08:00
Erich Blume	1c86134a62	Phase 1b: Deploy docs hosting with Quartz (#85 ) ## Summary - Add ArgoCD Application and manifests for `quartz` service - Add `docs.ops.eblu.me` to Caddy reverse proxy configuration - ConfigMap points to blumeops v1.0.0 release tarball - Tailscale ingress with homepage annotations for auto-discovery ## Deployment and Testing Pre-deployment (container build): - [ ] Build and tag quartz container: `mise run container-tag-and-release quartz v1.0.0` K8s deployment: - [ ] Sync apps: `argocd app sync apps` - [ ] Point quartz at feature branch: `argocd app set quartz --revision feature/docs-phase-1b-hosting` - [ ] Sync quartz: `argocd app sync quartz` - [ ] Verify pod is running: `kubectl --context=minikube-indri get pods -n quartz` - [ ] Verify Tailscale ingress: `kubectl --context=minikube-indri get ingress -n quartz` Caddy deployment: - [ ] Dry run: `mise run provision-indri -- --tags caddy --check --diff` - [ ] Apply: `mise run provision-indri -- --tags caddy` Verification: - [ ] Test https://docs.tail8d86e.ts.net - [ ] Test https://docs.ops.eblu.me - [ ] Verify homepage dashboard shows docs link Post-merge: - [ ] Reset to main: `argocd app set quartz --revision main && argocd app sync quartz` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/85	2026-02-03 10:52:20 -08:00
Erich Blume	9719fc05f7	Update forgejo-runner to v2.3.0 (Node.js 24) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 09:29:50 -08:00
Erich Blume	737371ab59	Add pod state observability to minikube dashboard (#83 ) ## Summary - Add "Unhealthy Pods" stat panel showing count of pods in error states (ImagePullBackOff, CrashLoopBackOff, etc.) with red background when > 0 - Add "Pods by Waiting Reason" time series chart showing container waiting states over time - Provides visibility into stuck pods that ArgoCD doesn't track (since it manages CronJobs, not the Jobs/Pods they spawn) ## Context This addresses the issue where a `zim-watcher` cronjob pod was stuck in `ImagePullBackOff` for 11 days without any alerting. ArgoCD showed the CronJob as "Synced, Healthy" because it only manages the CronJob resource, not its spawned Jobs/Pods. ## Deployment and Testing - [ ] Sync grafana-config app to test branch - [ ] Verify dashboard renders correctly - [ ] Confirm "Unhealthy Pods" shows 0 (green) when no issues - [ ] Reset to main after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/83	2026-02-03 07:20:05 -08:00
Erich Blume	4d97ac4c26	Expand homepage widgets and info panels (#81 ) ## Summary - Add greeting and datetime info widgets to homepage header - Add Miniflux widget showing unread/read counts (via existing API key in 1Password) - Add Grafana widget showing dashboards/datasources/alerts (via existing credentials in 1Password) - Add ArgoCD to bookmarks section - Add TODO comments for widgets needing additional setup (Forgejo, Caddy, UniFi, Glances, Navidrome, Transmission, Immich) ## Deployment and Testing - [ ] Sync homepage app to deploy new ExternalSecrets - [ ] Verify greeting and datetime appear in header - [ ] Verify Miniflux widget shows unread/read counts - [ ] Verify Grafana widget shows dashboard stats - [ ] Check that services without credentials still display (just without widgets) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/81	2026-02-02 16:11:20 -08:00
Erich Blume	9db4c9d9ae	Replace homepage search widget with Quick Launch Use Quick Launch settings for Kagi search with suggestions instead of the search widget, which is the proper way to configure keyboard-driven search in homepage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-02 15:08:35 -08:00
Erich Blume	fc0ce955e4	Move DJ to Apps group on Homepage (#80 ) ## Summary - Change navidrome homepage group from Media to Apps 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/80	2026-01-31 20:29:43 -08:00
Erich Blume	ade21cc49e	Add Navidrome music streaming server (#79 ) ## Summary - Deploy Navidrome music streaming server to k8s - NFS mount for music library from sifaka:/volume1/music (read-only) - Local PVC for SQLite database and config (10Gi) - Tailscale ingress for dj.tail8d86e.ts.net - Caddy reverse proxy for dj.ops.eblu.me - Homepage annotations for dashboard discovery in Media group ## Deployment and Testing - [ ] Sync `apps` application to pick up new Application definition - [ ] Set navidrome app to feature branch and sync - [ ] Verify NFS mount with `kubectl exec` - [ ] Provision Caddy for dj.ops.eblu.me - [ ] Access https://dj.ops.eblu.me and create initial admin user - [ ] Verify Homepage shows DJ in Media group - [ ] Reset to main and resync after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/79	2026-01-31 20:19:31 -08:00
Erich Blume	b8b33b76c8	Remove Plex media server (#78 ) ## Summary - Remove plex_metrics ansible role - Remove Plex Grafana dashboard - Remove Plex log collection from Alloy config - Update indri-services-check to check Jellyfin instead of Plex ## Deployment and Testing - [x] Unloaded plex-metrics LaunchAgent on indri - [x] Deleted plex-metrics plist and script - [x] Deleted plex.prom textfile - [ ] Deploy Alloy config update - [ ] Sync grafana-config to remove dashboard 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78	2026-01-30 17:06:00 -08:00
Erich Blume	bcc8685316	Add Jellyfin media server deployment (#77 ) ## Summary - Add Jellyfin ansible role for native macOS deployment via Homebrew cask - Add jellyfin_metrics role for Prometheus textfile metrics collection - Add Caddy routing for jellyfin.ops.eblu.me - Add Alloy log collection for Jellyfin stdout/stderr - Add Grafana dashboard for Jellyfin monitoring ## Architecture Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams. ## Deployment and Testing - [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy` - [ ] Sync Grafana dashboard: `argocd app sync grafana-config` - [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me - [ ] Generate API key and save to `~/.jellyfin-api-key` - [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV) - [ ] Enable VideoToolbox hardware transcoding - [ ] Verify metrics in Grafana dashboard - [ ] Verify logs in Loki: `{service="jellyfin"}` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77	2026-01-30 16:57:26 -08:00

1 2

99 commits