blumeops

Author	SHA1	Message	Date
Erich Blume	797133b28e	Fix per-torrent rate panels showing cumulative bytes instead of rates All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s Details Build Container / build (transmission-exporter) (push) Successful in 38s Details Dashboard "Download/Upload Rate by Torrent" panels were querying transmission_torrent_download_bytes (total_size * percent_done) and transmission_torrent_upload_bytes (uploaded_ever) — cumulative byte gauges, not rates. Added new metrics using Transmission's native rate_download/rate_upload and updated dashboard queries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-05 08:01:37 -08:00
Erich Blume	6ae18cde1e	Pin transmission-exporter image tag to v1.0.0-f2704b2 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 21:55:59 -08:00
Erich Blume	f2704b26da	Replace transmission-exporter with homegrown Python exporter (#283 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s Details Build Container / build (transmission-exporter) (push) Successful in 19s Details ## Summary - Replace unmaintained `metalmatze/transmission-exporter:master` sidecar with a homegrown Python exporter - Uses `prometheus_client` + `transmission-rpc` with collect-on-scrape pattern (fresh metrics per scrape, no stale labels) - Same metric names so existing Grafana Transmission dashboard works unchanged - Container built with `uv` for dependency management, follows `grafana-sidecar` Dockerfile pattern ## Changes - New: `containers/transmission-exporter/exporter.py` — single-file exporter (~130 lines) - New: `containers/transmission-exporter/Dockerfile` — multi-stage Alpine build with uv - Modified: `argocd/manifests/torrent/deployment.yaml` — swap sidecar image reference - Modified: `argocd/manifests/torrent/kustomization.yaml` — add image tag entry - Modified: `service-versions.yaml` — add transmission-exporter entry ## Deployment and Testing - [ ] Build container: `mise run container-build-and-release transmission-exporter` - [ ] Update kustomization.yaml newTag with build SHA - [ ] Branch deploy: `argocd app set torrent --revision feature/transmission-exporter-python && argocd app sync torrent` - [ ] Verify metrics: `kubectl -n torrent --context=minikube-indri port-forward svc/transmission 19091:19091` then `curl localhost:19091/metrics \| grep transmission_` - [ ] Verify Grafana Transmission dashboard panels populate - [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent` Reviewed-on: #283	2026-03-04 21:55:00 -08:00
Erich Blume	91d84e54d5	Replace OOMKilled stat with detail table, shrink waiting reason panel The count-only stat wasn't actionable. New table shows pod name, container, restart count, and memory limit for each OOMKilled container. Waiting reason panel narrowed to make room. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 20:58:11 -08:00
Erich Blume	008da43736	Add OOMKill observability to Kubernetes Clusters dashboard OOMKilled containers previously only appeared briefly in "Unhealthy Pods" while dying, then vanished on restart. New panels use persistent metrics (last_terminated_reason) and restart rate tracking. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 20:53:07 -08:00
Erich Blume	e90c287504	Add qwen3.5:9b to Ollama model list Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 19:49:39 -08:00
Erich Blume	b460333da0	Upgrade Transmission to 4.1.1 (#282 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container (Nix) / detect (push) Successful in 2s Details Build Container (Nix) / build (transmission) (push) Successful in 2s Details Build Container / build (transmission) (push) Successful in 6s Details ## Summary - Upgrade Transmission from 4.0.6-r4 to 4.1.1-r1 - Uses Alpine edge community repo for transmission packages, keeping stable alpine:3.22 base - Fix stale image reference in service doc (was linuxserver, now custom registry image) - Mark transmission as reviewed in service-versions.yaml ## Context Service review found Transmission two minor versions behind (4.0.6 → 4.1.1). Alpine 3.22 only packages 4.0.6, so transmission is installed from edge's community repo with an exact version pin. 4.1.0 added improved µTP performance, IPv6/dual-stack UDP tracker, JSON-RPC 2.0 API. 4.1.1 is a bugfix release (20+ fixes). Dagger test build passed locally. ## Deployment and Testing - [ ] Build container via Forgejo workflow (`mise run container-build-and-release transmission`) - [ ] Update kustomization.yaml with new image tag - [ ] `argocd app set torrent --revision feature/transmission-review && argocd app sync torrent` - [ ] Verify web UI at https://torrent.ops.eblu.me - [ ] Check Grafana Transmission dashboard still receives metrics - [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent` ## Note The transmission-exporter sidecar (OOMKilling every ~30min, 294 restarts) is being tracked separately as a future replacement project. Reviewed-on: #282	2026-03-04 07:44:33 -08:00
Erich Blume	d7f0aa6f96	Fix Frigate database path to use persistent volume The database was at /config/frigate.db (emptyDir, ephemeral) instead of /db/frigate.db (PVC, persistent). Every pod restart wiped the database, losing all recording history and leaving orphaned files on NFS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 15:18:16 -08:00
Erich Blume	135883079c	Bump frigate memory limit from 2Gi to 3Gi ONNX detector + CUDA ffmpeg + workers consume ~1.9Gi at steady state, causing intermittent OOMKills at the 2Gi limit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:57:15 -08:00
Erich Blume	3d065b94f9	Pin grafana-sidecar to main build tag v1.28.0-a2bb9ab (built from merge commit on main). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:51:01 -08:00
Erich Blume	a2bb9abbdb	Home-build grafana-sidecar container (#281 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (grafana-sidecar) (push) Successful in 2s Details Build Container / build (grafana-sidecar) (push) Successful in 6s Details ## Summary - Home-build the k8s-sidecar container (`grafana-sidecar`) from forge mirror, replacing upstream `quay.io/kiwigrid/k8s-sidecar:1.28.0` - Pinned to v1.28.0 — v2.x deferred due to 135% memory regression and readOnlyRootFilesystem crashloop - Adds Dockerfile, service-versions entry, docs, and changelog fragment - Manifest switch to home-built image pending container build ## Deployment and Testing - [ ] `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization.yaml with built image tag - [ ] `argocd app set grafana --revision feature/grafana-sidecar && argocd app sync grafana` - [ ] Verify sidecar logs and dashboards at https://grafana.ops.eblu.me - [ ] Post-merge: `argocd app set grafana --revision main && argocd app sync grafana` Reviewed-on: #281	2026-03-03 13:48:24 -08:00
Erich Blume	876e51dd77	Allow implicit octals in yamllint and normalize k8s mode values Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:10:44 -08:00
Erich Blume	eceea2126b	Add Gandi bookmark to homepage dashboard Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:05:50 -08:00
Erich Blume	51626e6630	Update Loki to v3.6.5-3dc4ed7 container image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 13:01:49 -08:00
Erich Blume	3dc4ed730b	Build Loki container image locally (#280 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (loki) (push) Successful in 2s Details Build Container / build (loki) (push) Successful in 7s Details ## Summary - Add two-stage Dockerfile for Loki (Go build → Alpine runtime) in `containers/loki/` - Rewrite kustomize image to `registry.ops.eblu.me/blumeops/loki` - Tag is `v3.6.5-placeholder` until first CI build; will be updated post-build ## Details - UID 10001 matches existing StatefulSet `securityContext` (runAsUser/fsGroup) - CGO_ENABLED=0, ldflags embed version via `github.com/grafana/loki/v3/pkg/util/build` - Clones from `forge.ops.eblu.me/mirrors/loki` (mirror created this session) - Pattern follows miniflux (two-stage Go) + prometheus (ldflags) ## Deployment and Testing - [ ] Trigger container build: `mise run container-build-and-release loki` - [ ] Update kustomize tag to actual build tag - [ ] Deploy from branch: `argocd app set loki --revision feature/loki-container && argocd app sync loki` - [ ] Verify `/ready` endpoint and log ingestion - [ ] After merge: update to `[main]` tag (C0 follow-up) Reviewed-on: #280	2026-03-03 13:00:43 -08:00
Erich Blume	f914a14653	Update teslamate to v3.0.0-eb9bc57 container image Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 12:02:26 -08:00
Erich Blume	01d3b4d1c7	Switch forgejo-runner ArgoCD app to internal SSH repo URL Was the only app still using https://forge.eblu.me (public proxy) for git polling. All other apps already use the internal SSH endpoint at forge.ops.eblu.me. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:43:01 -08:00
Erich Blume	82884436df	Route runner polling through internal forge.ops.eblu.me The k8s and ringtail runners were hitting forge.eblu.me (fly.io proxy) for every FetchTask poll (~every 2s), round-tripping through the public internet unnecessarily. Use forge.ops.eblu.me (Caddy on indri, tailnet) for infrastructure workloads. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:33:40 -08:00
Erich Blume	7b68be2e80	Add fly.io proxy observability and app logs to Forgejo dashboard Rename "Forgejo Repository Health" to "Forgejo" and add proxy metrics (request rate, error rate, RPS, latency, bandwidth), proxy access logs, and Forgejo application logs from Loki. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:24:53 -08:00
Erich Blume	86a0dee000	Remove ollama LAN NodePort service The sanctioned ingress is ollama.ops.eblu.me via tailnet. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 10:00:05 -08:00
Erich Blume	3af346f1cd	Move ollama LAN NodePort to port 80 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-03 09:37:50 -08:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	a32c99a252	Limit ollama to one loaded model and one parallel request Prevents OOM when switching between models — only one 14B model fits in 16GB VRAM at a time with KV cache for context. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 21:23:12 -08:00
Erich Blume	203e3cd567	Add NodePort service for ollama LAN access Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 20:57:18 -08:00
Erich Blume	31d925814f	Deploy Ollama LLM server on ringtail (#277 ) ## Summary - Deploy Ollama as a new ArgoCD-managed service on ringtail's k3s cluster with GPU acceleration - Declarative model management via `models.txt` + sidecar sync script (mirrors kiwix torrent pattern) - Initial models: `qwen2.5:14b`, `deepseek-r1:14b`, `phi4:14b`, `gemma3:12b` - hostPath PV on `/mnt/storage1/ollama` for fast local model storage (200Gi) - Tailscale ingress at `ollama.ops.eblu.me` for API access from tailnet - Enable GPU time-slicing (`replicas: 2`) on nvidia-device-plugin so Frigate and Ollama share the RTX 4080 ## Deployment and Testing - [ ] Deploy nvidia-device-plugin changes first: `argocd app sync nvidia-device-plugin` - [ ] Verify GPU time-slicing: `kubectl describe node ringtail --context=k3s-ringtail` shows `nvidia.com/gpu: 2` - [ ] Sync `apps` app with `--revision feature/ollama-ringtail` - [ ] Set ollama app to branch: `argocd app set ollama --revision feature/ollama-ringtail && argocd app sync ollama` - [ ] Verify model-sync sidecar pulls models: `kubectl logs -n ollama deploy/ollama -c model-sync --context=k3s-ringtail` - [ ] Test API: `curl https://ollama.ops.eblu.me/api/tags` - [ ] Test inference: `curl https://ollama.ops.eblu.me/api/generate -d '{"model":"qwen2.5:14b","prompt":"Hello"}'` - [ ] Verify Frigate still works after GPU sharing change - [ ] After merge: `argocd app set ollama --revision main && argocd app sync ollama` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/277	2026-03-02 20:39:51 -08:00
Forgejo Actions	0f79c61c42	Update docs release to v1.12.1 - Built changelog from towncrier fragments [skip ci]	2026-03-02 18:17:07 -08:00
Forgejo Actions	847e47eaf3	Update docs release to v1.12.0 - Built changelog from towncrier fragments [skip ci]	2026-03-01 17:24:09 -08:00
Erich Blume	503775085d	Deploy authentik 2026.2.0 with migration ordering fix Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 16:32:10 -08:00
Erich Blume	90621e4155	Deploy authentik 2026.2.0 with entry_points fix Update image tag to v2026.2.0-78027eb-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 16:04:29 -08:00
Erich Blume	e2c650b027	Deploy authentik 2026.2.0 with BASE_DIR fix Update image tag to v2026.2.0-e49d966-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:55:50 -08:00
Erich Blume	c0e29476f3	Deploy authentik 2026.2.0 with TMPDIR fix Update image tag to v2026.2.0-b7bfb0b-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:53:09 -08:00
Erich Blume	38da372f94	Deploy authentik 2026.2.0 with /tmp fix Update image tag to v2026.2.0-2ac353b-nix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:51:17 -08:00
Erich Blume	098f3e517c	Deploy authentik 2026.2.0 (source-built) to ArgoCD Update image tag to v2026.2.0-efa9806-nix — the first source-built authentik container from the build-authentik-from-source chain. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 15:44:35 -08:00
Erich Blume	02eb169403	Pin blumeops-pg to PostgreSQL 18.3 Replace floating :18 tag with pinned :18.3 (upstream out-of-cycle release fixing 18.2 regressions). Stamps service as reviewed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 16:25:32 -08:00
Erich Blume	776caa87f5	Sync Frigate zone coordinates from live API Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 07:52:09 -08:00
Forgejo Actions	fa223f8e3b	Update docs release to v1.11.5 - Built changelog from towncrier fragments [skip ci]	2026-02-26 07:56:02 -08:00
Erich Blume	be3cdad1cb	Add HA for CV and Docs: zero-downtime deploys (#273 ) ## Summary - Set `replicas: 2` with `maxUnavailable: 0` / `maxSurge: 1` on CV and Docs deployments so rolling updates never drop below 2 ready pods - Add PodDisruptionBudgets (`minAvailable: 1`) to protect against node drains and cluster maintenance - Add Fly.io cache purge step to `cv-deploy.yaml` workflow (docs already had this) so CV deploys don't serve stale cached content ## Deployment and Testing - [ ] `argocd app diff cv` / `argocd app diff docs` from branch - [ ] Deploy from branch: `argocd app set cv --revision feature/ha-cv-docs-zero-downtime && argocd app sync cv` - [ ] Verify 2 pods running: `kubectl get pods -n cv --context=minikube-indri` - [ ] Test rolling restart: `kubectl rollout restart deployment/cv -n cv --context=minikube-indri` - [ ] During rollout, confirm continuous availability via `curl -I https://cv.eblu.me` - [ ] After merge: reset ArgoCD to main, re-sync both apps Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/273	2026-02-26 07:53:21 -08:00
Erich Blume	fb83c5c577	Add explicit ExternalSecret defaults for SSA sync parity The external-secrets webhook injects conversionStrategy, decodingStrategy, and metadataPolicy defaults on admission. Declaring them explicitly prevents ArgoCD SSA from flagging the resource as OutOfSync. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 07:02:54 -08:00
Erich Blume	db561c6b0e	Upgrade ArgoCD v3.2.6 → v3.3.2 with Server-Side Apply (#272 ) ## Summary - Upgrade ArgoCD from v3.2.6 to v3.3.2 - Enable `ServerSideApply=true` sync option (required by v3.3 — ApplicationSet CRD exceeds client-side apply annotation limit) - Update service-versions.yaml with review for argocd and 1password-connect ## Breaking changes reviewed - Server-Side Apply required: Added to syncOptions ✅ - Source Hydrator git notes: Not used — N/A - Application path cleaning removed: Not used — N/A - Settings API field restriction: Authenticated access only — N/A ## Deployment and Testing - [ ] Sync the `apps` app first (picks up SSA syncOption change) - [ ] `argocd app set argocd --revision feature/argocd-v3.3.2` - [ ] `argocd app sync argocd` - [ ] Verify all argocd pods running with v3.3.2 images - [ ] Verify other apps still sync correctly - [ ] After merge: `argocd app set argocd --revision main && argocd app sync argocd` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/272	2026-02-26 06:51:50 -08:00
Erich Blume	95c8424e62	Add Transmission metrics exporter and Grafana dashboard (#271 ) ## Summary - Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091 - Add metrics port to the torrent service for Prometheus scraping - Add Prometheus scrape job targeting the transmission exporter - Create Grafana dashboard with: - Overview stats (download/upload speed, active/total torrents) - Transfer speed timeseries (download + upload over time) - Transfer volume stats (total downloaded/uploaded in selected range) - Per-torrent download and upload rate timeseries - Per-torrent details table (ratio, uploaded, percent done) ## Deployment and Testing - [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts - [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics` - [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me - [ ] Open Grafana, find "Transmission" dashboard, verify panels populate - [ ] Sync ArgoCD `prometheus` app from branch - [ ] Sync ArgoCD `grafana-config` app from branch Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271	2026-02-25 22:23:33 -08:00
Erich Blume	03d71544ec	Add multi-cluster observability with ringtail metrics and dashboards (#270 ) ## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - Kubernetes Clusters dashboard — multi-cluster with `cluster` and `namespace` template variables - Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270	2026-02-25 22:01:00 -08:00
Erich Blume	2243f2e0a1	Filter driveway zone to person/dog/cat only in Frigate Parked car was being re-detected every few minutes at night due to IR illumination noise triggering motion detection. Restrict the driveway zone to [person, dog, cat] so cars and birds no longer create events there. Cars still alert via the driveway_entrance zone. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 20:45:07 -08:00
Erich Blume	de54b4e33d	Port CloudNative-PG off Helm to direct release manifest (#268 ) ## Summary - Point ArgoCD app directly at forge-mirrored upstream repo (`mirrors/cloudnative-pg`) instead of the Helm charts repo - Use `directory.include` to select the specific release manifest (`cnpg-1.27.1.yaml`) from the `releases/` directory - No vendored files, no Helm — upgrades are a two-line change (`targetRevision` + `directory.include`) - Delete unused `values.yaml` (was empty, all Helm defaults) ## Deployment and Testing - [ ] Register mirror repo in ArgoCD: `argocd repo add ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git --ssh-private-key-path <key>` - [ ] `argocd app set cloudnative-pg --revision feature/cnpg-direct-source && argocd app sync cloudnative-pg` - [ ] Verify operator pod running: `kubectl get pods -n cnpg-system --context=minikube-indri` - [ ] Verify CRDs exist: `kubectl get crd --context=minikube-indri \| grep cnpg` - [ ] Verify existing clusters healthy: `kubectl get clusters -A --context=minikube-indri` - [ ] After merge: `argocd app set cloudnative-pg --revision main && argocd app sync cloudnative-pg` ## Notes - The forge mirror was created via `mise run mirror-create` from `https://github.com/cloudnative-pg/cloudnative-pg.git` - ArgoCD may need the mirror repo added to its known repositories if the credential template doesn't already match `mirrors/*` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/268	2026-02-25 17:37:53 -08:00
Erich Blume	285ad4141f	Fix Frigate detection events rate metric name in Grafana dashboard The panel queried frigate_camera_events but the actual metric exposed by Frigate is frigate_camera_events_total with a "camera" label (not "camera_name"). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-25 16:51:57 -08:00
Forgejo Actions	4736c7e9bd	Update docs release to v1.11.4 - Built changelog from towncrier fragments [skip ci]	2026-02-25 07:04:23 -08:00
Erich Blume	5f9bc20345	Fix mirror org refs in ArgoCD apps and widen credential template (#266 ) ## Summary - Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials) - Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts) - Fix stale alloy clone comment in Ansible defaults - Bump immich v2.5.2 → v2.5.6 (bug-fix patches only) - Update ArgoCD README bootstrap command and credential docs ## Context Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit ``cd57814``. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos. ## Deployment 1. Sync the ArgoCD `argocd` app first (picks up the widened credential template) 2. Sync the `apps` app (picks up new repo URLs for all 8 apps) 3. Verify immich resolves its ComparisonError: `argocd app get immich` 4. Sync immich to deploy v2.5.6: `argocd app sync immich` 5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266	2026-02-25 06:55:53 -08:00
Erich Blume	4f8f2985c1	Update prometheus and teslamate image tags after mirror migration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 21:18:15 -08:00
Erich Blume	e0f9ebebdf	Update homepage, navidrome, ntfy, miniflux image tags after mirror migration Prometheus and teslamate builds still in progress — will update in a follow-up commit once their `33b7f0f` tags land. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 21:06:08 -08:00
Erich Blume	61ee6a4d38	Fix Grafana ConfigMap labels lost in configMapGenerator migration The hand-written configmap.yaml had app.kubernetes.io/name and app.kubernetes.io/instance labels; configMapGenerator dropped them. Add options.labels to both generator entries to restore parity. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 14:46:20 -08:00
Erich Blume	9b44a8ec51	Add kustomize images: and configMapGenerator: across services (#264 ) ## Summary - Move hardcoded image tags to kustomization.yaml `images:` transformer across 22 services — image names in manifests become version-agnostic templates, with tags centralized in one place per service - Replace hand-written ConfigMap manifests with `configMapGenerator:` in 12 services — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes - Create new `kustomization.yaml` for forgejo-runner and nvidia-device-plugin (switches ArgoCD from directory mode to kustomize mode, rendered output identical) ### Services modified Images only (8): cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent Images + configMapGenerator (10): alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus Images only, no configMapGenerator (4): authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is) Skipped entirely (6): argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml) ### What changes at deploy time - images: — no functional diff, `kustomize build` produces identical output with tags - configMapGenerator: — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change ## Test plan - [x] `kubectl kustomize` builds all 30 service directories successfully - [x] Image tags verified in rendered output for all modified services - [x] ConfigMap hash suffixes verified in rendered output - [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names - [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.) - [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes - [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264	2026-02-24 14:25:19 -08:00

1 2 3 4 5 ...

252 commits