blumeops

Author	SHA1	Message	Date
Erich Blume	46cc3fbc2e	Update forgejo-runner job image to v0.20.0-448689b Built locally to break the chicken-and-egg: the old runner couldn't build its own replacement because it needed Dagger 0.20.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 11:05:21 -08:00
Erich Blume	c281fb5403	Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286 ) ## Summary Adds the third observability pillar — distributed tracing — alongside existing metrics (Prometheus) and logs (Loki). - Grafana Tempo 2.10.1 on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus - Beyla eBPF auto-instrumentation via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes - Grafana integration — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking - Prometheus scrapes Tempo operational metrics ### Architecture ``` ringtail (k3s) indri (minikube) ┌──────────────────────┐ ┌─────────────────────┐ │ Alloy+Beyla (eBPF) │──OTLP HTTP────────→ │ Tempo │ │ ↳ Frigate, ntfy, │ via tailnet │ ↳ trace storage │ │ Ollama, Immich │ │ ↳ RED → Prometheus │ └──────────────────────┘ │ │ │ Grafana │ │ ↳ Tempo datasource │ └─────────────────────┘ ``` ### New files (12) - `docs/reference/services/tempo.md` — reference doc - `docs/changelog.d/feature-otel-tracing.feature.md` - `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files) - `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files) ### Modified files (6) - `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields - `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target - `service-versions.yaml` — tempo + alloy-tracing-ringtail entries - `docs/reference/services/grafana.md` — Tempo in datasources table - `docs/reference/reference.md` — Tempo in services index - `docs/reference/operations/observability.md` — Tempo in components list ## Deployment and Testing - [ ] Sync `apps` app to pick up new Application definitions - [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo` - [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo` - [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready` - [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring` - [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail` - [ ] Check Beyla discovery in alloy-tracing logs on ringtail - [ ] Sync grafana-config for updated datasources - [ ] Sync prometheus for updated scrape config - [ ] Test Grafana Tempo datasource connection - [ ] Generate test traffic and search traces in Grafana Explore → Tempo - [ ] After merge: reset all ArgoCD app revisions back to main Reviewed-on: #286	2026-03-05 10:51:07 -08:00
Erich Blume	7bddc78c8a	Add ExternalSecret default fields to prevent ArgoCD drift The external-secrets operator adds conversionStrategy, decodingStrategy, and metadataPolicy defaults to the live object, causing perpetual OutOfSync in ArgoCD. Declare them explicitly to match. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 09:11:23 -08:00
Erich Blume	405fc59c12	Add Authentik OIDC login for ArgoCD (#284 ) ## Summary - Add Authentik OAuth2 provider + application blueprint for ArgoCD (ringtail side) - Add OIDC config to ArgoCD ConfigMap with Authentik as identity provider (indri side) - Map Authentik `admins` group to ArgoCD `role:admin` via RBAC policy - ExternalSecrets on both sides pull `argocd-client-secret` from 1Password - Local admin password remains as break-glass — both login methods coexist ## Pre-deployment manual step Add `argocd-client-secret` field to "Authentik (blumeops)" in 1Password with a random value (e.g., `openssl rand -hex 32`). ## Deployment order 1. Sync Authentik app on ringtail first (blueprint + secret + worker env var) 2. Sync ArgoCD app on indri second (cm, rbac, ExternalSecret) ## Verification - [ ] `argocd-client-secret` field added to 1Password - [ ] Authentik app synced on ringtail — blueprint applied, provider created - [ ] ArgoCD app synced on indri — OIDC config applied - [ ] SSO login works: visit `https://argocd.ops.eblu.me` → "Log in via Authentik" → admin access - [ ] Break-glass: local admin/password login still works Reviewed-on: #284	2026-03-05 09:07:25 -08:00
Erich Blume	91c755ddd6	Pin kiwix-serve image tag to v3.8.2-f6f0f79 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 08:17:40 -08:00
Erich Blume	75814e032c	Pin transmission-exporter image tag to v1.0.1-c93448f Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 08:05:17 -08:00
Erich Blume	797133b28e	Fix per-torrent rate panels showing cumulative bytes instead of rates All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s Details Build Container / build (transmission-exporter) (push) Successful in 38s Details Dashboard "Download/Upload Rate by Torrent" panels were querying transmission_torrent_download_bytes (total_size * percent_done) and transmission_torrent_upload_bytes (uploaded_ever) — cumulative byte gauges, not rates. Added new metrics using Transmission's native rate_download/rate_upload and updated dashboard queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 08:01:37 -08:00
Erich Blume	6ae18cde1e	Pin transmission-exporter image tag to v1.0.0-f2704b2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 21:55:59 -08:00
Erich Blume	f2704b26da	Replace transmission-exporter with homegrown Python exporter (#283 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s Details Build Container / build (transmission-exporter) (push) Successful in 19s Details ## Summary - Replace unmaintained `metalmatze/transmission-exporter:master` sidecar with a homegrown Python exporter - Uses `prometheus_client` + `transmission-rpc` with collect-on-scrape pattern (fresh metrics per scrape, no stale labels) - Same metric names so existing Grafana Transmission dashboard works unchanged - Container built with `uv` for dependency management, follows `grafana-sidecar` Dockerfile pattern ## Changes - New: `containers/transmission-exporter/exporter.py` — single-file exporter (~130 lines) - New: `containers/transmission-exporter/Dockerfile` — multi-stage Alpine build with uv - Modified: `argocd/manifests/torrent/deployment.yaml` — swap sidecar image reference - Modified: `argocd/manifests/torrent/kustomization.yaml` — add image tag entry - Modified: `service-versions.yaml` — add transmission-exporter entry ## Deployment and Testing - [ ] Build container: `mise run container-build-and-release transmission-exporter` - [ ] Update kustomization.yaml newTag with build SHA - [ ] Branch deploy: `argocd app set torrent --revision feature/transmission-exporter-python && argocd app sync torrent` - [ ] Verify metrics: `kubectl -n torrent --context=minikube-indri port-forward svc/transmission 19091:19091` then `curl localhost:19091/metrics \| grep transmission_` - [ ] Verify Grafana Transmission dashboard panels populate - [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent` Reviewed-on: #283	2026-03-04 21:55:00 -08:00
Erich Blume	91d84e54d5	Replace OOMKilled stat with detail table, shrink waiting reason panel The count-only stat wasn't actionable. New table shows pod name, container, restart count, and memory limit for each OOMKilled container. Waiting reason panel narrowed to make room. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 20:58:11 -08:00
Erich Blume	008da43736	Add OOMKill observability to Kubernetes Clusters dashboard OOMKilled containers previously only appeared briefly in "Unhealthy Pods" while dying, then vanished on restart. New panels use persistent metrics (last_terminated_reason) and restart rate tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 20:53:07 -08:00
Erich Blume	e90c287504	Add qwen3.5:9b to Ollama model list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 19:49:39 -08:00
Erich Blume	b460333da0	Upgrade Transmission to 4.1.1 (#282 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container (Nix) / detect (push) Successful in 2s Details Build Container (Nix) / build (transmission) (push) Successful in 2s Details Build Container / build (transmission) (push) Successful in 6s Details ## Summary - Upgrade Transmission from 4.0.6-r4 to 4.1.1-r1 - Uses Alpine edge community repo for transmission packages, keeping stable alpine:3.22 base - Fix stale image reference in service doc (was linuxserver, now custom registry image) - Mark transmission as reviewed in service-versions.yaml ## Context Service review found Transmission two minor versions behind (4.0.6 → 4.1.1). Alpine 3.22 only packages 4.0.6, so transmission is installed from edge's community repo with an exact version pin. 4.1.0 added improved µTP performance, IPv6/dual-stack UDP tracker, JSON-RPC 2.0 API. 4.1.1 is a bugfix release (20+ fixes). Dagger test build passed locally. ## Deployment and Testing - [ ] Build container via Forgejo workflow (`mise run container-build-and-release transmission`) - [ ] Update kustomization.yaml with new image tag - [ ] `argocd app set torrent --revision feature/transmission-review && argocd app sync torrent` - [ ] Verify web UI at https://torrent.ops.eblu.me - [ ] Check Grafana Transmission dashboard still receives metrics - [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent` ## Note The transmission-exporter sidecar (OOMKilling every ~30min, 294 restarts) is being tracked separately as a future replacement project. Reviewed-on: #282	2026-03-04 07:44:33 -08:00
Erich Blume	d7f0aa6f96	Fix Frigate database path to use persistent volume The database was at /config/frigate.db (emptyDir, ephemeral) instead of /db/frigate.db (PVC, persistent). Every pod restart wiped the database, losing all recording history and leaving orphaned files on NFS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 15:18:16 -08:00
Erich Blume	135883079c	Bump frigate memory limit from 2Gi to 3Gi ONNX detector + CUDA ffmpeg + workers consume ~1.9Gi at steady state, causing intermittent OOMKills at the 2Gi limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:57:15 -08:00
Erich Blume	3d065b94f9	Pin grafana-sidecar to main build tag v1.28.0-a2bb9ab (built from merge commit on main). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:51:01 -08:00
Erich Blume	a2bb9abbdb	Home-build grafana-sidecar container (#281 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (grafana-sidecar) (push) Successful in 2s Details Build Container / build (grafana-sidecar) (push) Successful in 6s Details ## Summary - Home-build the k8s-sidecar container (`grafana-sidecar`) from forge mirror, replacing upstream `quay.io/kiwigrid/k8s-sidecar:1.28.0` - Pinned to v1.28.0 — v2.x deferred due to 135% memory regression and readOnlyRootFilesystem crashloop - Adds Dockerfile, service-versions entry, docs, and changelog fragment - Manifest switch to home-built image pending container build ## Deployment and Testing - [ ] `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization.yaml with built image tag - [ ] `argocd app set grafana --revision feature/grafana-sidecar && argocd app sync grafana` - [ ] Verify sidecar logs and dashboards at https://grafana.ops.eblu.me - [ ] Post-merge: `argocd app set grafana --revision main && argocd app sync grafana` Reviewed-on: #281	2026-03-03 13:48:24 -08:00
Erich Blume	876e51dd77	Allow implicit octals in yamllint and normalize k8s mode values Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:10:44 -08:00
Erich Blume	eceea2126b	Add Gandi bookmark to homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:05:50 -08:00
Erich Blume	51626e6630	Update Loki to v3.6.5-3dc4ed7 container image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:01:49 -08:00
Erich Blume	3dc4ed730b	Build Loki container image locally (#280 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (loki) (push) Successful in 2s Details Build Container / build (loki) (push) Successful in 7s Details ## Summary - Add two-stage Dockerfile for Loki (Go build → Alpine runtime) in `containers/loki/` - Rewrite kustomize image to `registry.ops.eblu.me/blumeops/loki` - Tag is `v3.6.5-placeholder` until first CI build; will be updated post-build ## Details - UID 10001 matches existing StatefulSet `securityContext` (runAsUser/fsGroup) - CGO_ENABLED=0, ldflags embed version via `github.com/grafana/loki/v3/pkg/util/build` - Clones from `forge.ops.eblu.me/mirrors/loki` (mirror created this session) - Pattern follows miniflux (two-stage Go) + prometheus (ldflags) ## Deployment and Testing - [ ] Trigger container build: `mise run container-build-and-release loki` - [ ] Update kustomize tag to actual build tag - [ ] Deploy from branch: `argocd app set loki --revision feature/loki-container && argocd app sync loki` - [ ] Verify `/ready` endpoint and log ingestion - [ ] After merge: update to `[main]` tag (C0 follow-up) Reviewed-on: #280	2026-03-03 13:00:43 -08:00
Erich Blume	f914a14653	Update teslamate to v3.0.0-eb9bc57 container image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 12:02:26 -08:00
Erich Blume	01d3b4d1c7	Switch forgejo-runner ArgoCD app to internal SSH repo URL Was the only app still using https://forge.eblu.me (public proxy) for git polling. All other apps already use the internal SSH endpoint at forge.ops.eblu.me. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:43:01 -08:00
Erich Blume	82884436df	Route runner polling through internal forge.ops.eblu.me The k8s and ringtail runners were hitting forge.eblu.me (fly.io proxy) for every FetchTask poll (~every 2s), round-tripping through the public internet unnecessarily. Use forge.ops.eblu.me (Caddy on indri, tailnet) for infrastructure workloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:33:40 -08:00
Erich Blume	7b68be2e80	Add fly.io proxy observability and app logs to Forgejo dashboard Rename "Forgejo Repository Health" to "Forgejo" and add proxy metrics (request rate, error rate, RPS, latency, bandwidth), proxy access logs, and Forgejo application logs from Loki. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:24:53 -08:00
Erich Blume	86a0dee000	Remove ollama LAN NodePort service The sanctioned ingress is ollama.ops.eblu.me via tailnet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:00:05 -08:00
Erich Blume	3af346f1cd	Move ollama LAN NodePort to port 80 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 09:37:50 -08:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	a32c99a252	Limit ollama to one loaded model and one parallel request Prevents OOM when switching between models — only one 14B model fits in 16GB VRAM at a time with KV cache for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 21:23:12 -08:00
Erich Blume	203e3cd567	Add NodePort service for ollama LAN access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 20:57:18 -08:00
Erich Blume	31d925814f	Deploy Ollama LLM server on ringtail (#277 ) ## Summary - Deploy Ollama as a new ArgoCD-managed service on ringtail's k3s cluster with GPU acceleration - Declarative model management via `models.txt` + sidecar sync script (mirrors kiwix torrent pattern) - Initial models: `qwen2.5:14b`, `deepseek-r1:14b`, `phi4:14b`, `gemma3:12b` - hostPath PV on `/mnt/storage1/ollama` for fast local model storage (200Gi) - Tailscale ingress at `ollama.ops.eblu.me` for API access from tailnet - Enable GPU time-slicing (`replicas: 2`) on nvidia-device-plugin so Frigate and Ollama share the RTX 4080 ## Deployment and Testing - [ ] Deploy nvidia-device-plugin changes first: `argocd app sync nvidia-device-plugin` - [ ] Verify GPU time-slicing: `kubectl describe node ringtail --context=k3s-ringtail` shows `nvidia.com/gpu: 2` - [ ] Sync `apps` app with `--revision feature/ollama-ringtail` - [ ] Set ollama app to branch: `argocd app set ollama --revision feature/ollama-ringtail && argocd app sync ollama` - [ ] Verify model-sync sidecar pulls models: `kubectl logs -n ollama deploy/ollama -c model-sync --context=k3s-ringtail` - [ ] Test API: `curl https://ollama.ops.eblu.me/api/tags` - [ ] Test inference: `curl https://ollama.ops.eblu.me/api/generate -d '{"model":"qwen2.5:14b","prompt":"Hello"}'` - [ ] Verify Frigate still works after GPU sharing change - [ ] After merge: `argocd app set ollama --revision main && argocd app sync ollama` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/277	2026-03-02 20:39:51 -08:00
Forgejo Actions	0f79c61c42	Update docs release to v1.12.1 - Built changelog from towncrier fragments [skip ci]	2026-03-02 18:17:07 -08:00
Forgejo Actions	847e47eaf3	Update docs release to v1.12.0 - Built changelog from towncrier fragments [skip ci]	2026-03-01 17:24:09 -08:00
Erich Blume	503775085d	Deploy authentik 2026.2.0 with migration ordering fix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 16:32:10 -08:00
Erich Blume	90621e4155	Deploy authentik 2026.2.0 with entry_points fix Update image tag to v2026.2.0-78027eb-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 16:04:29 -08:00
Erich Blume	e2c650b027	Deploy authentik 2026.2.0 with BASE_DIR fix Update image tag to v2026.2.0-e49d966-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:55:50 -08:00
Erich Blume	c0e29476f3	Deploy authentik 2026.2.0 with TMPDIR fix Update image tag to v2026.2.0-b7bfb0b-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:53:09 -08:00
Erich Blume	38da372f94	Deploy authentik 2026.2.0 with /tmp fix Update image tag to v2026.2.0-2ac353b-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:51:17 -08:00
Erich Blume	098f3e517c	Deploy authentik 2026.2.0 (source-built) to ArgoCD Update image tag to v2026.2.0-efa9806-nix — the first source-built authentik container from the build-authentik-from-source chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:44:35 -08:00
Erich Blume	02eb169403	Pin blumeops-pg to PostgreSQL 18.3 Replace floating :18 tag with pinned :18.3 (upstream out-of-cycle release fixing 18.2 regressions). Stamps service as reviewed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 16:25:32 -08:00
Erich Blume	776caa87f5	Sync Frigate zone coordinates from live API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 07:52:09 -08:00
Forgejo Actions	fa223f8e3b	Update docs release to v1.11.5 - Built changelog from towncrier fragments [skip ci]	2026-02-26 07:56:02 -08:00
Erich Blume	be3cdad1cb	Add HA for CV and Docs: zero-downtime deploys (#273 ) ## Summary - Set `replicas: 2` with `maxUnavailable: 0` / `maxSurge: 1` on CV and Docs deployments so rolling updates never drop below 2 ready pods - Add PodDisruptionBudgets (`minAvailable: 1`) to protect against node drains and cluster maintenance - Add Fly.io cache purge step to `cv-deploy.yaml` workflow (docs already had this) so CV deploys don't serve stale cached content ## Deployment and Testing - [ ] `argocd app diff cv` / `argocd app diff docs` from branch - [ ] Deploy from branch: `argocd app set cv --revision feature/ha-cv-docs-zero-downtime && argocd app sync cv` - [ ] Verify 2 pods running: `kubectl get pods -n cv --context=minikube-indri` - [ ] Test rolling restart: `kubectl rollout restart deployment/cv -n cv --context=minikube-indri` - [ ] During rollout, confirm continuous availability via `curl -I https://cv.eblu.me` - [ ] After merge: reset ArgoCD to main, re-sync both apps Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/273	2026-02-26 07:53:21 -08:00
Erich Blume	fb83c5c577	Add explicit ExternalSecret defaults for SSA sync parity The external-secrets webhook injects conversionStrategy, decodingStrategy, and metadataPolicy defaults on admission. Declaring them explicitly prevents ArgoCD SSA from flagging the resource as OutOfSync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 07:02:54 -08:00
Erich Blume	db561c6b0e	Upgrade ArgoCD v3.2.6 → v3.3.2 with Server-Side Apply (#272 ) ## Summary - Upgrade ArgoCD from v3.2.6 to v3.3.2 - Enable `ServerSideApply=true` sync option (required by v3.3 — ApplicationSet CRD exceeds client-side apply annotation limit) - Update service-versions.yaml with review for argocd and 1password-connect ## Breaking changes reviewed - Server-Side Apply required: Added to syncOptions ✅ - Source Hydrator git notes: Not used — N/A - Application path cleaning removed: Not used — N/A - Settings API field restriction: Authenticated access only — N/A ## Deployment and Testing - [ ] Sync the `apps` app first (picks up SSA syncOption change) - [ ] `argocd app set argocd --revision feature/argocd-v3.3.2` - [ ] `argocd app sync argocd` - [ ] Verify all argocd pods running with v3.3.2 images - [ ] Verify other apps still sync correctly - [ ] After merge: `argocd app set argocd --revision main && argocd app sync argocd` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/272	2026-02-26 06:51:50 -08:00
Erich Blume	95c8424e62	Add Transmission metrics exporter and Grafana dashboard (#271 ) ## Summary - Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091 - Add metrics port to the torrent service for Prometheus scraping - Add Prometheus scrape job targeting the transmission exporter - Create Grafana dashboard with: - Overview stats (download/upload speed, active/total torrents) - Transfer speed timeseries (download + upload over time) - Transfer volume stats (total downloaded/uploaded in selected range) - Per-torrent download and upload rate timeseries - Per-torrent details table (ratio, uploaded, percent done) ## Deployment and Testing - [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts - [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics` - [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me - [ ] Open Grafana, find "Transmission" dashboard, verify panels populate - [ ] Sync ArgoCD `prometheus` app from branch - [ ] Sync ArgoCD `grafana-config` app from branch Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271	2026-02-25 22:23:33 -08:00
Erich Blume	03d71544ec	Add multi-cluster observability with ringtail metrics and dashboards (#270 ) ## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - Kubernetes Clusters dashboard — multi-cluster with `cluster` and `namespace` template variables - Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270	2026-02-25 22:01:00 -08:00
Erich Blume	2243f2e0a1	Filter driveway zone to person/dog/cat only in Frigate Parked car was being re-detected every few minutes at night due to IR illumination noise triggering motion detection. Restrict the driveway zone to [person, dog, cat] so cars and birds no longer create events there. Cars still alert via the driveway_entrance zone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 20:45:07 -08:00
Erich Blume	de54b4e33d	Port CloudNative-PG off Helm to direct release manifest (#268 ) ## Summary - Point ArgoCD app directly at forge-mirrored upstream repo (`mirrors/cloudnative-pg`) instead of the Helm charts repo - Use `directory.include` to select the specific release manifest (`cnpg-1.27.1.yaml`) from the `releases/` directory - No vendored files, no Helm — upgrades are a two-line change (`targetRevision` + `directory.include`) - Delete unused `values.yaml` (was empty, all Helm defaults) ## Deployment and Testing - [ ] Register mirror repo in ArgoCD: `argocd repo add ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git --ssh-private-key-path <key>` - [ ] `argocd app set cloudnative-pg --revision feature/cnpg-direct-source && argocd app sync cloudnative-pg` - [ ] Verify operator pod running: `kubectl get pods -n cnpg-system --context=minikube-indri` - [ ] Verify CRDs exist: `kubectl get crd --context=minikube-indri \| grep cnpg` - [ ] Verify existing clusters healthy: `kubectl get clusters -A --context=minikube-indri` - [ ] After merge: `argocd app set cloudnative-pg --revision main && argocd app sync cloudnative-pg` ## Notes - The forge mirror was created via `mise run mirror-create` from `https://github.com/cloudnative-pg/cloudnative-pg.git` - ArgoCD may need the mirror repo added to its known repositories if the credential template doesn't already match `mirrors/*` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/268	2026-02-25 17:37:53 -08:00
Erich Blume	285ad4141f	Fix Frigate detection events rate metric name in Grafana dashboard The panel queried frigate_camera_events but the actual metric exposed by Frigate is frigate_camera_events_total with a "camera" label (not "camera_name"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 16:51:57 -08:00

1 2 3 4 5 ...

258 commits