blumeops

Author	SHA1	Message	Date
Erich Blume	07e9c810ca	Add RuntimeDefault seccomp profiles to all managed workloads Addresses 32 CIS Kubernetes Benchmark failures from Prowler scan (core_seccomp_profile_docker_default). Applied pod-level seccomp RuntimeDefault to 18 deployments/statefulsets and 2 cronjobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:19:40 -07:00
Erich Blume	6d65e6928c	C2: Deploy infrastructure alerting pipeline (#303 ) ## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. Design: - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes Chain (bottom-up): 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303	2026-03-22 14:52:56 -07:00
Erich Blume	3d2a97aaf9	Update kustomization tags to OCI-labeled builds (`613f05d`) Point all services at the `613f05d` images which carry the new consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4 version, no matching build) and docs/quartz (no `613f05d` build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 06:34:12 -07:00
Erich Blume	86220b7b88	Update Prometheus deployment to v3.10.0-0d27797 C0 fix-forward: update kustomization newTag and mark service reviewed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 08:46:07 -07:00
Erich Blume	21ddc74cdc	Revert PVC size changes, add hostpath comment StatefulSet volumeClaimTemplates are immutable and minikube's hostpath provisioner doesn't enforce PVC size limits anyway. Add comments noting the data grows freely on the 1.8TB backing disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:46:17 -07:00
Erich Blume	ef199b70f0	Increase Prometheus and Loki data retention Prometheus: 15d → 10y (3650d), PVC 20Gi → 200Gi Loki: 31d (744h) → 365d (8760h), PVC 20Gi → 50Gi Indri has 1.6 TB free on the minikube backing disk — the previous 15-day Prometheus retention was losing valuable long-term metrics data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:44:00 -07:00
Erich Blume	4dc3e5cae2	Add UnPoller for UniFi network metrics (#298 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (unpoller) (push) Successful in 2s Details Build Container / build (unpoller) (push) Successful in 7s Details ## Summary - Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus - Custom-built container from forge mirror (`containers/unpoller/Dockerfile`) - Credentials pulled from 1Password via external-secrets - Prometheus scrape job added, docs and service-versions updated ## Test plan - [ ] Build container: `mise run container-release unpoller v2.34.0` - [ ] Update kustomization tag with built image tag - [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller` - [ ] Verify pod connects to UX7 controller (check logs) - [ ] Confirm `unpoller` target appears in Prometheus - [ ] Query `unifi_` metrics in Grafana 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #298	2026-03-16 15:52:45 -07:00
Erich Blume	6e8d11c6bb	Add :kustomized sentinel tag to manifest images, review devpi Bare image references in manifests were ambiguous — unclear whether the tag was intentionally omitted or managed by kustomize. Add :kustomized sentinel to all 37 image refs overridden by kustomize images transformer. Add sync notes for tailscale-operator proxyclass (CRD fields not processed by kustomize). Mark devpi reviewed (6.19.1 is current). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 08:15:06 -08:00
Erich Blume	c281fb5403	Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286 ) ## Summary Adds the third observability pillar — distributed tracing — alongside existing metrics (Prometheus) and logs (Loki). - Grafana Tempo 2.10.1 on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus - Beyla eBPF auto-instrumentation via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes - Grafana integration — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking - Prometheus scrapes Tempo operational metrics ### Architecture ``` ringtail (k3s) indri (minikube) ┌──────────────────────┐ ┌─────────────────────┐ │ Alloy+Beyla (eBPF) │──OTLP HTTP────────→ │ Tempo │ │ ↳ Frigate, ntfy, │ via tailnet │ ↳ trace storage │ │ Ollama, Immich │ │ ↳ RED → Prometheus │ └──────────────────────┘ │ │ │ Grafana │ │ ↳ Tempo datasource │ └─────────────────────┘ ``` ### New files (12) - `docs/reference/services/tempo.md` — reference doc - `docs/changelog.d/feature-otel-tracing.feature.md` - `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files) - `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files) ### Modified files (6) - `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields - `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target - `service-versions.yaml` — tempo + alloy-tracing-ringtail entries - `docs/reference/services/grafana.md` — Tempo in datasources table - `docs/reference/reference.md` — Tempo in services index - `docs/reference/operations/observability.md` — Tempo in components list ## Deployment and Testing - [ ] Sync `apps` app to pick up new Application definitions - [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo` - [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo` - [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready` - [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring` - [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail` - [ ] Check Beyla discovery in alloy-tracing logs on ringtail - [ ] Sync grafana-config for updated datasources - [ ] Sync prometheus for updated scrape config - [ ] Test Grafana Tempo datasource connection - [ ] Generate test traffic and search traces in Grafana Explore → Tempo - [ ] After merge: reset all ArgoCD app revisions back to main Reviewed-on: #286	2026-03-05 10:51:07 -08:00
Erich Blume	95c8424e62	Add Transmission metrics exporter and Grafana dashboard (#271 ) ## Summary - Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091 - Add metrics port to the torrent service for Prometheus scraping - Add Prometheus scrape job targeting the transmission exporter - Create Grafana dashboard with: - Overview stats (download/upload speed, active/total torrents) - Transfer speed timeseries (download + upload over time) - Transfer volume stats (total downloaded/uploaded in selected range) - Per-torrent download and upload rate timeseries - Per-torrent details table (ratio, uploaded, percent done) ## Deployment and Testing - [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts - [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics` - [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me - [ ] Open Grafana, find "Transmission" dashboard, verify panels populate - [ ] Sync ArgoCD `prometheus` app from branch - [ ] Sync ArgoCD `grafana-config` app from branch Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271	2026-02-25 22:23:33 -08:00
Erich Blume	03d71544ec	Add multi-cluster observability with ringtail metrics and dashboards (#270 ) ## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - Kubernetes Clusters dashboard — multi-cluster with `cluster` and `namespace` template variables - Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270	2026-02-25 22:01:00 -08:00
Erich Blume	4f8f2985c1	Update prometheus and teslamate image tags after mirror migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 21:18:15 -08:00
Erich Blume	9b44a8ec51	Add kustomize images: and configMapGenerator: across services (#264 ) ## Summary - Move hardcoded image tags to kustomization.yaml `images:` transformer across 22 services — image names in manifests become version-agnostic templates, with tags centralized in one place per service - Replace hand-written ConfigMap manifests with `configMapGenerator:` in 12 services — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes - Create new `kustomization.yaml` for forgejo-runner and nvidia-device-plugin (switches ArgoCD from directory mode to kustomize mode, rendered output identical) ### Services modified Images only (8): cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent Images + configMapGenerator (10): alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus Images only, no configMapGenerator (4): authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is) Skipped entirely (6): argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml) ### What changes at deploy time - images: — no functional diff, `kustomize build` produces identical output with tags - configMapGenerator: — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change ## Test plan - [x] `kubectl kustomize` builds all 30 service directories successfully - [x] Image tags verified in rendered output for all modified services - [x] ConfigMap hash suffixes verified in rendered output - [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names - [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.) - [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes - [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264	2026-02-24 14:25:19 -08:00
Erich Blume	4acd2e58d4	Update prometheus and grafana to main-SHA container tags Prometheus: v3.9.1-74029e1 [branch] -> v3.9.1-2ba5d8a [main] Grafana: v12.3.3-09ac36b [branch] -> v12.3.3-d05d2fb [main] These images were built during PR development and referenced branch commits that won't survive branch cleanup. The [main] tags are identical rebuilds from the squash-merge commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 09:58:09 -08:00
Erich Blume	2ba5d8a8aa	Port Prometheus to local container build (#262 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (prometheus) (push) Successful in 2s Details Build Container / build (prometheus) (push) Successful in 7s Details ## Summary - Add three-stage Dockerfile for Prometheus v3.9.1 (Node UI → Go binaries → Alpine runtime) - Produces `prometheus` and `promtool` binaries with embedded web UI assets - Follows navidrome/ntfy pattern for supply chain control via Zot registry ## Deployment and Testing - [ ] `dagger call build --src=. --container-name=prometheus` succeeds - [ ] Container reports correct version via `prometheus --version` - [ ] `promtool --version` works - [ ] Update statefulset image reference after successful build - [ ] Deploy from branch: `argocd app set prometheus --revision <branch> && argocd app sync prometheus` - [ ] Health probes pass (`/-/healthy`, `/-/ready`) - [ ] Web UI loads, scrape targets work, remote write functions Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/262	2026-02-24 09:15:57 -08:00
Erich Blume	2c6c6a244a	Fix Frigate Prometheus metrics & rebuild Grafana dashboard (#252 ) ## Summary - Prometheus scrape target: Changed from `frigate.frigate.svc.cluster.local:5000` (broken after ringtail migration) to `nvr.ops.eblu.me` via HTTPS through Caddy on indri - Grafana dashboard: Rebuilt for Frigate 0.17 metrics — 12 panels total: - Row 1 (stats): Uptime, Inference Speed, Camera FPS, Detection FPS, GPU Usage, GPU Temp - Row 2 (timeseries): CPU Usage, Memory Usage - Row 3 (timeseries): Camera FPS + Skipped FPS, GPU Usage + Memory over time - Row 4 (timeseries): Storage Usage, Detection Events (rate by camera/label) ## Deployment and Testing 1. Sync prometheus app on branch: ``` argocd app set prometheus --revision fix/frigate-metrics-dashboard && argocd app sync prometheus ``` 2. Check `prometheus.ops.eblu.me/targets` — frigate job should show UP 3. Sync grafana-config: ``` argocd app sync grafana-config ``` 4. Check `grafana.ops.eblu.me` — Frigate NVR dashboard should show live data 5. After merge: reset both apps to `--revision main` and sync Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/252	2026-02-22 18:14:17 -08:00
Erich Blume	04c7f3c45a	Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190 ) ## Summary Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159): - Mosquitto — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth) - Ntfy — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me` - Frigate — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me` - frigate-notify — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT Also includes: - Prometheus scrape target for Frigate metrics - Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage) - Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me` ## Prerequisites - [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri) - [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields ## Deployment (after merge) ```bash argocd app sync apps argocd app sync mosquitto argocd app sync ntfy argocd app sync frigate argocd app sync grafana-config argocd app sync prometheus mise run provision-indri -- --tags caddy mise run services-check ``` ## Verification - [ ] Mosquitto pod running, accepting connections on 1883 - [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me` - [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed - [ ] Object detection working (ONNX, person/car/dog/cat) - [ ] Recordings appearing in NFS share on sifaka - [ ] frigate-notify sending detection alerts to Ntfy - [ ] Prometheus scraping Frigate metrics - [ ] Grafana dashboard showing Frigate data Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190	2026-02-14 21:27:44 -08:00
Erich Blume	b3747f6c95	Tier 1 version bumps (#186 ) All checks were successful Build Container / build (push) Successful in 8s Details ## Summary Audit and upgrade of all deployed images, helm charts, and custom container Dockerfiles to latest stable versions. This PR covers Tier 1 (low-risk minor/patch bumps only). ### Upstream images \| Image \| Old \| New \| \|-------\|-----\|-----\| \| kube-state-metrics \| v2.13.0 \| v2.18.0 \| \| prometheus \| v3.2.1 \| v3.9.1 \| \| loki \| 3.3.2 \| 3.6.5 \| \| alloy \| v1.5.1 \| v1.13.1 \| \| tailscale (proxy + operator) \| v1.92.5 \| v1.94.1 \| \| navidrome \| :latest \| v0.60.3 (pinned) \| ### Helm charts \| Chart \| Old \| New \| \|-------\|-----\|-----\| \| CloudNativePG \| v0.27.0 \| v0.27.1 \| \| 1Password Connect \| 2.2.1 \| 2.3.0 \| ### Custom containers (Dockerfiles updated, images not yet tagged) \| Container \| Changes \| New tag \| \|-----------\|---------\|---------\| \| miniflux \| 2.2.16→2.2.17 (security), alpine 3.22 \| v1.1.0 \| \| kubectl \| v1.34.1→v1.34.4, alpine 3.22 \| v1.1.0 \| \| kiwix-serve \| alpine 3.22 \| v1.1.0 \| \| nettest \| alpine 3.22 \| v0.14.0 \| \| transmission \| alpine 3.22, pkg 4.0.6-r4 \| v1.1.0 \| All custom containers verified with local `dagger call build`. ### Deferred to Tier 2 (separate PRs) - Forgejo runner 6→12 (major version scheme change) - Docker DinD 27→29 - Grafana chart 8→11 (repo migration) - External Secrets 1→2 (breaking changes) - Python 3.12→3.13, Elixir 1.18→1.19, Node 22→24 - Transmission 4.0.6→4.1.0 (not in Alpine yet) ## Deployment After merge: 1. Tag custom containers: `mise run container-tag-and-release <name> <version>` for each 2. Wait for CI builds to complete 3. `argocd app sync apps` then sync individual apps, or let ArgoCD auto-detect Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/186	2026-02-13 17:16:37 -08:00
Erich Blume	48ce5b4120	Recategorize homepage into Content and Misc groups (#179 ) ## Summary - Replace the three homepage groups (Apps, Observability, Infrastructure) with two cleaner groups - Content: Immich, Kiwix, Miniflux, DJ, Grafana - Misc: CV, TeslaMate, Transmission, Docs, Prometheus, PyPI ## Deployment and Testing - [ ] Sync affected ingresses via ArgoCD (all 11 services) - [ ] Verify homepage shows the two new groups correctly Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/179	2026-02-13 09:09:22 -08:00
Erich Blume	85e36cd807	Operations and observability for sifaka NAS (#135 ) ## Summary - Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring - Formalize existing `node_exporter` container under Ansible management - Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus - Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime) - Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS - Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles ## Prerequisites before deploy - [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP) - [ ] Verify `ssh eblume@sifaka 'docker ps'` works - [ ] Run `mise run provision-sifaka` to deploy containers - [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes - [ ] `argocd app sync prometheus` + `argocd app sync grafana-config` ## Test plan - [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics` - [ ] Verify Prometheus targets page shows both sifaka jobs as UP - [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135	2026-02-09 17:44:05 -08:00
Erich Blume	e6cf7e47e0	Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m8s Details ## Summary - Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy - Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test - Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses - Switch Alloy push endpoints from `.ops.eblu.me` (Caddy) to `.tail8d86e.ts.net` (Tailscale Ingress) - Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly ## Manual step (not in PR) Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes. ## Deployment order 1. Pulumi ACLs — `mise run tailnet-preview && mise run tailnet-up` 2. OAuth client — Manual update in Tailscale admin console 3. K8s Ingresses — `argocd app sync apps && argocd app sync docs loki prometheus` 4. Fly.io proxy — `mise run fly-deploy` 5. Verify — `mise run services-check`, check Grafana dashboards ## Test plan - [ ] `mise run tailnet-preview` shows clean diff - [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions - [ ] After deploy: Grafana dashboards show continued log/metric flow - [ ] `curl -sf https://docs.eblu.me` returns 200 - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126	2026-02-08 21:54:18 -08:00
Erich Blume	38538ad5f0	Replace hajimari with gethomepage (#75 ) ## Summary - Remove hajimari (unmaintained since Oct 2022, broken helm deps) - Add gethomepage (28k stars, actively maintained, monthly releases) - Migrate custom apps, bookmarks, and search config - Enable k8s RBAC for service autodiscovery - Configure Tailscale ingress at go.tail8d86e.ts.net ## Why the switch Hajimari hasn't released since October 2022. The helm chart has a broken dependency (bjw-s/common URL is 404), and unreleased code on main has bugs. gethomepage has similar k8s autodiscovery via ingress annotations and is very actively maintained. ## Deployment and Testing - [ ] Delete hajimari app from ArgoCD - [ ] Delete hajimari namespace - [ ] Sync apps to pick up new homepage app - [ ] Sync homepage app - [ ] Verify go.ops.eblu.me loads 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/75	2026-01-30 13:21:12 -08:00
Erich Blume	316a4c4e42	Shorten Hajimari info descriptions and hide URLs Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 16:34:46 -08:00
Erich Blume	d1164c8aac	Add Hajimari service dashboard (#73 ) ## Summary - Add Hajimari as a service dashboard/start page at `go.ops.eblu.me` - Auto-discovers k8s services from ingress annotations - Custom apps for non-k8s services: Forgejo, Registry, Sifaka NAS - Add `nas.ops.eblu.me` Caddy proxy to Synology dashboard ## Services Configured Auto-discovered (k8s ingresses with hajimari.io annotations): - Grafana, ArgoCD, Prometheus, Loki (Observability) - Miniflux, Kiwix, Transmission, TeslaMate, Immich (Apps) - PyPI/devpi (Infrastructure) Custom apps (non-k8s): - Forgejo (forge.ops.eblu.me) - Registry (registry.ops.eblu.me) - Sifaka NAS (nas.ops.eblu.me) Bookmarks: - Tailscale Admin, 1Password, Pulumi ## Deployment and Testing - [ ] Sync `apps` application to pick up new Hajimari Application - [ ] Sync `hajimari` application - [ ] Run `mise run provision-indri -- --tags caddy` for go/nas proxy entries - [ ] Re-sync all k8s apps with hajimari annotations (or wait for natural drift) - [ ] Verify https://go.ops.eblu.me shows dashboard with all services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/73	2026-01-29 15:51:42 -08:00
Erich Blume	e4a8405de7	Observability cleanup and k8s service monitoring (#43 ) (#43 ) ## Summary - Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s) - Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery - Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd - Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios) - Replace stale devpi dashboard with probe-based metrics (status, response time, uptime) - Add unified "K8s Services Health" dashboard for service uptime/response monitoring ## Manual cleanup already performed - Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom` - Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/` ## Deployment and Testing - [x] Sync `apps` application to pick up new alloy-k8s app - [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s` - [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent` - [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus` - [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config` - [x] Verify pod logs appear in Loki/Grafana - [x] Verify transmission metrics appear in Prometheus - [x] Verify service probe metrics appear in Prometheus - [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config - [ ] After merge, reset apps to main and resync 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43	2026-01-22 13:51:01 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00

26 commits