Commit graph

235 commits

Author SHA1 Message Date
82884436df Route runner polling through internal forge.ops.eblu.me
The k8s and ringtail runners were hitting forge.eblu.me (fly.io proxy)
for every FetchTask poll (~every 2s), round-tripping through the public
internet unnecessarily. Use forge.ops.eblu.me (Caddy on indri, tailnet)
for infrastructure workloads.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:33:40 -08:00
7b68be2e80 Add fly.io proxy observability and app logs to Forgejo dashboard
Rename "Forgejo Repository Health" to "Forgejo" and add proxy metrics
(request rate, error rate, RPS, latency, bandwidth), proxy access logs,
and Forgejo application logs from Loki.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:24:53 -08:00
86a0dee000 Remove ollama LAN NodePort service
The sanctioned ingress is ollama.ops.eblu.me via tailnet.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:00:05 -08:00
3af346f1cd Move ollama LAN NodePort to port 80
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 09:37:50 -08:00
a87c997ee1 Expose Forgejo publicly at forge.eblu.me (#278)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m28s
## Summary

Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service.

- **Forgejo hardening:** Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO)
- **Tailscale Ingress:** ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint
- **Fly.io proxy:** nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit
- **Authentik:** OAuth callback updated to forge.eblu.me
- **DNS/TLS:** CNAME record in Pulumi, cert in fly-setup
- **Rename:** ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is)

## Deployment Order

1. `mise run provision-indri -- --tags forgejo` (config changes)
2. Verify forge.ops.eblu.me still works
3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator`
4. Verify `curl https://forge.tail8d86e.ts.net`
5. `cd fly && fly deploy`
6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/`
7. `fly certs add forge.eblu.me -a blumeops-proxy`
8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik`
9. `mise run dns-preview && mise run dns-up`
10. Full verification (see below)
11. Rehearse `mise run fly-shutoff`
12. After merge: reset ArgoCD revisions to main, re-sync

## Verification Checklist

- [ ] forge.eblu.me loads, shows public repos
- [ ] forge.ops.eblu.me still works from tailnet
- [ ] SSH clone via forge.ops.eblu.me:2222 works
- [ ] HTTPS clone via forge.eblu.me works
- [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH
- [ ] /swagger returns 403
- [ ] Rapid login attempts trigger 429 rate limit
- [ ] fail2ban bans after 5 failed logins in 10 minutes
- [ ] ArgoCD can still sync (SSH unaffected)
- [ ] `mise run fly-shutoff` stops all public traffic
- [ ] `mise run services-check` passes

Reviewed-on: #278
2026-03-03 08:40:41 -08:00
a32c99a252 Limit ollama to one loaded model and one parallel request
Prevents OOM when switching between models — only one 14B model
fits in 16GB VRAM at a time with KV cache for context.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 21:23:12 -08:00
203e3cd567 Add NodePort service for ollama LAN access
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-02 20:57:18 -08:00
31d925814f Deploy Ollama LLM server on ringtail (#277)
## Summary
- Deploy Ollama as a new ArgoCD-managed service on ringtail's k3s cluster with GPU acceleration
- Declarative model management via `models.txt` + sidecar sync script (mirrors kiwix torrent pattern)
- Initial models: `qwen2.5:14b`, `deepseek-r1:14b`, `phi4:14b`, `gemma3:12b`
- hostPath PV on `/mnt/storage1/ollama` for fast local model storage (200Gi)
- Tailscale ingress at `ollama.ops.eblu.me` for API access from tailnet
- Enable GPU time-slicing (`replicas: 2`) on nvidia-device-plugin so Frigate and Ollama share the RTX 4080

## Deployment and Testing
- [ ] Deploy nvidia-device-plugin changes first: `argocd app sync nvidia-device-plugin`
- [ ] Verify GPU time-slicing: `kubectl describe node ringtail --context=k3s-ringtail` shows `nvidia.com/gpu: 2`
- [ ] Sync `apps` app with `--revision feature/ollama-ringtail`
- [ ] Set ollama app to branch: `argocd app set ollama --revision feature/ollama-ringtail && argocd app sync ollama`
- [ ] Verify model-sync sidecar pulls models: `kubectl logs -n ollama deploy/ollama -c model-sync --context=k3s-ringtail`
- [ ] Test API: `curl https://ollama.ops.eblu.me/api/tags`
- [ ] Test inference: `curl https://ollama.ops.eblu.me/api/generate -d '{"model":"qwen2.5:14b","prompt":"Hello"}'`
- [ ] Verify Frigate still works after GPU sharing change
- [ ] After merge: `argocd app set ollama --revision main && argocd app sync ollama`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/277
2026-03-02 20:39:51 -08:00
Forgejo Actions
0f79c61c42 Update docs release to v1.12.1
- Built changelog from towncrier fragments

[skip ci]
2026-03-02 18:17:07 -08:00
Forgejo Actions
847e47eaf3 Update docs release to v1.12.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-01 17:24:09 -08:00
503775085d Deploy authentik 2026.2.0 with migration ordering fix
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 16:32:10 -08:00
90621e4155 Deploy authentik 2026.2.0 with entry_points fix
Update image tag to v2026.2.0-78027eb-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 16:04:29 -08:00
e2c650b027 Deploy authentik 2026.2.0 with BASE_DIR fix
Update image tag to v2026.2.0-e49d966-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:55:50 -08:00
c0e29476f3 Deploy authentik 2026.2.0 with TMPDIR fix
Update image tag to v2026.2.0-b7bfb0b-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:53:09 -08:00
38da372f94 Deploy authentik 2026.2.0 with /tmp fix
Update image tag to v2026.2.0-2ac353b-nix.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:51:17 -08:00
098f3e517c Deploy authentik 2026.2.0 (source-built) to ArgoCD
Update image tag to v2026.2.0-efa9806-nix — the first source-built
authentik container from the build-authentik-from-source chain.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-01 15:44:35 -08:00
02eb169403 Pin blumeops-pg to PostgreSQL 18.3
Replace floating :18 tag with pinned :18.3 (upstream out-of-cycle
release fixing 18.2 regressions). Stamps service as reviewed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 16:25:32 -08:00
776caa87f5 Sync Frigate zone coordinates from live API
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-27 07:52:09 -08:00
Forgejo Actions
fa223f8e3b Update docs release to v1.11.5
- Built changelog from towncrier fragments

[skip ci]
2026-02-26 07:56:02 -08:00
be3cdad1cb Add HA for CV and Docs: zero-downtime deploys (#273)
## Summary
- Set `replicas: 2` with `maxUnavailable: 0` / `maxSurge: 1` on CV and Docs deployments so rolling updates never drop below 2 ready pods
- Add PodDisruptionBudgets (`minAvailable: 1`) to protect against node drains and cluster maintenance
- Add Fly.io cache purge step to `cv-deploy.yaml` workflow (docs already had this) so CV deploys don't serve stale cached content

## Deployment and Testing
- [ ] `argocd app diff cv` / `argocd app diff docs` from branch
- [ ] Deploy from branch: `argocd app set cv --revision feature/ha-cv-docs-zero-downtime && argocd app sync cv`
- [ ] Verify 2 pods running: `kubectl get pods -n cv --context=minikube-indri`
- [ ] Test rolling restart: `kubectl rollout restart deployment/cv -n cv --context=minikube-indri`
- [ ] During rollout, confirm continuous availability via `curl -I https://cv.eblu.me`
- [ ] After merge: reset ArgoCD to main, re-sync both apps

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/273
2026-02-26 07:53:21 -08:00
fb83c5c577 Add explicit ExternalSecret defaults for SSA sync parity
The external-secrets webhook injects conversionStrategy, decodingStrategy,
and metadataPolicy defaults on admission. Declaring them explicitly prevents
ArgoCD SSA from flagging the resource as OutOfSync.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 07:02:54 -08:00
db561c6b0e Upgrade ArgoCD v3.2.6 → v3.3.2 with Server-Side Apply (#272)
## Summary
- Upgrade ArgoCD from v3.2.6 to v3.3.2
- Enable `ServerSideApply=true` sync option (required by v3.3 — ApplicationSet CRD exceeds client-side apply annotation limit)
- Update service-versions.yaml with review for argocd and 1password-connect

## Breaking changes reviewed
- **Server-Side Apply required**: Added to syncOptions 
- **Source Hydrator git notes**: Not used — N/A
- **Application path cleaning removed**: Not used — N/A
- **Settings API field restriction**: Authenticated access only — N/A

## Deployment and Testing
- [ ] Sync the `apps` app first (picks up SSA syncOption change)
- [ ] `argocd app set argocd --revision feature/argocd-v3.3.2`
- [ ] `argocd app sync argocd`
- [ ] Verify all argocd pods running with v3.3.2 images
- [ ] Verify other apps still sync correctly
- [ ] After merge: `argocd app set argocd --revision main && argocd app sync argocd`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/272
2026-02-26 06:51:50 -08:00
95c8424e62 Add Transmission metrics exporter and Grafana dashboard (#271)
## Summary
- Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091
- Add metrics port to the torrent service for Prometheus scraping
- Add Prometheus scrape job targeting the transmission exporter
- Create Grafana dashboard with:
  - Overview stats (download/upload speed, active/total torrents)
  - Transfer speed timeseries (download + upload over time)
  - Transfer volume stats (total downloaded/uploaded in selected range)
  - Per-torrent download and upload rate timeseries
  - Per-torrent details table (ratio, uploaded, percent done)

## Deployment and Testing
- [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts
- [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics`
- [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me
- [ ] Open Grafana, find "Transmission" dashboard, verify panels populate
- [ ] Sync ArgoCD `prometheus` app from branch
- [ ] Sync ArgoCD `grafana-config` app from branch

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271
2026-02-25 22:23:33 -08:00
03d71544ec Add multi-cluster observability with ringtail metrics and dashboards (#270)
## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
  - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
  - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels

## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods

## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
2026-02-25 22:01:00 -08:00
2243f2e0a1 Filter driveway zone to person/dog/cat only in Frigate
Parked car was being re-detected every few minutes at night due to IR
illumination noise triggering motion detection. Restrict the driveway
zone to [person, dog, cat] so cars and birds no longer create events
there. Cars still alert via the driveway_entrance zone.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 20:45:07 -08:00
de54b4e33d Port CloudNative-PG off Helm to direct release manifest (#268)
## Summary
- Point ArgoCD app directly at forge-mirrored upstream repo (`mirrors/cloudnative-pg`) instead of the Helm charts repo
- Use `directory.include` to select the specific release manifest (`cnpg-1.27.1.yaml`) from the `releases/` directory
- No vendored files, no Helm — upgrades are a two-line change (`targetRevision` + `directory.include`)
- Delete unused `values.yaml` (was empty, all Helm defaults)

## Deployment and Testing
- [ ] Register mirror repo in ArgoCD: `argocd repo add ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git --ssh-private-key-path <key>`
- [ ] `argocd app set cloudnative-pg --revision feature/cnpg-direct-source && argocd app sync cloudnative-pg`
- [ ] Verify operator pod running: `kubectl get pods -n cnpg-system --context=minikube-indri`
- [ ] Verify CRDs exist: `kubectl get crd --context=minikube-indri | grep cnpg`
- [ ] Verify existing clusters healthy: `kubectl get clusters -A --context=minikube-indri`
- [ ] After merge: `argocd app set cloudnative-pg --revision main && argocd app sync cloudnative-pg`

## Notes
- The forge mirror was created via `mise run mirror-create` from `https://github.com/cloudnative-pg/cloudnative-pg.git`
- ArgoCD may need the mirror repo added to its known repositories if the credential template doesn't already match `mirrors/*`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/268
2026-02-25 17:37:53 -08:00
285ad4141f Fix Frigate detection events rate metric name in Grafana dashboard
The panel queried frigate_camera_events but the actual metric exposed
by Frigate is frigate_camera_events_total with a "camera" label
(not "camera_name").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 16:51:57 -08:00
Forgejo Actions
4736c7e9bd Update docs release to v1.11.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-25 07:04:23 -08:00
5f9bc20345 Fix mirror org refs in ArgoCD apps and widen credential template (#266)
## Summary

- Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials)
- Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts)
- Fix stale alloy clone comment in Ansible defaults
- Bump immich v2.5.2 → v2.5.6 (bug-fix patches only)
- Update ArgoCD README bootstrap command and credential docs

## Context

Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit `cd57814`. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos.

## Deployment

1. Sync the ArgoCD `argocd` app first (picks up the widened credential template)
2. Sync the `apps` app (picks up new repo URLs for all 8 apps)
3. Verify immich resolves its ComparisonError: `argocd app get immich`
4. Sync immich to deploy v2.5.6: `argocd app sync immich`
5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266
2026-02-25 06:55:53 -08:00
4f8f2985c1 Update prometheus and teslamate image tags after mirror migration
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:18:15 -08:00
e0f9ebebdf Update homepage, navidrome, ntfy, miniflux image tags after mirror migration
Prometheus and teslamate builds still in progress — will update in a
follow-up commit once their 33b7f0f tags land.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 21:06:08 -08:00
61ee6a4d38 Fix Grafana ConfigMap labels lost in configMapGenerator migration
The hand-written configmap.yaml had app.kubernetes.io/name and
app.kubernetes.io/instance labels; configMapGenerator dropped them.
Add options.labels to both generator entries to restore parity.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 14:46:20 -08:00
9b44a8ec51 Add kustomize images: and configMapGenerator: across services (#264)
## Summary

- Move hardcoded image tags to kustomization.yaml `images:` transformer across **22 services** — image names in manifests become version-agnostic templates, with tags centralized in one place per service
- Replace hand-written ConfigMap manifests with `configMapGenerator:` in **12 services** — config data extracted to standalone files, generated ConfigMaps include content hashes that trigger automatic pod rollouts on changes
- Create new `kustomization.yaml` for **forgejo-runner** and **nvidia-device-plugin** (switches ArgoCD from directory mode to kustomize mode, rendered output identical)

### Services modified

**Images only (8):** cv, devpi, docs, kube-state-metrics, miniflux, navidrome, teslamate, torrent

**Images + configMapGenerator (10):** alloy-k8s, forgejo-runner, frigate, grafana, homepage, kiwix, loki, mosquitto, ntfy, prometheus

**Images only, no configMapGenerator (4):** authentik (skip blueprints — special YAML tags), tailscale-operator-base (Deployment only, CRD image fields left as-is)

**Skipped entirely (6):** argocd (remote upstream), databases (no image fields), external-secrets, grafana-config (cross-kustomization dashboards), immich (Helm-managed), 1password-connect/cloudnative-pg (no kustomization.yaml)

### What changes at deploy time

- **images:** — no functional diff, `kustomize build` produces identical output with tags
- **configMapGenerator:** — ConfigMap names gain hash suffixes (e.g., `prometheus-config` → `prometheus-config-6f42fhctcb`) and all Deployment/StatefulSet/DaemonSet references are updated automatically. Pods will restart once per service on first sync due to the name change

## Test plan

- [x] `kubectl kustomize` builds all 30 service directories successfully
- [x] Image tags verified in rendered output for all modified services
- [x] ConfigMap hash suffixes verified in rendered output
- [x] ConfigMap references in Deployments/StatefulSets confirmed to use hashed names
- [x] All pre-commit hooks pass (yamllint, shellcheck, prettier, etc.)
- [ ] `argocd app diff` each service to confirm only expected ConfigMap name changes
- [ ] Deploy from branch starting with a low-risk service (e.g., mosquitto)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/264
2026-02-24 14:25:19 -08:00
86aeb60ec9 Fix TeslaMate dashboards: add database to PostgreSQL jsonData
Grafana 12.x's grafana-postgresql-datasource plugin requires the
database name in jsonData, not just the top-level database field.
Without it, the frontend blocks all queries with "no default database
configured", causing all TeslaMate panels to show "No Data."

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:49:07 -08:00
495c3e8496 Fix Grafana OAuth role mapping from Authentik groups
The INI parser was stripping outer single quotes from
role_attribute_path = 'Admin', causing Grafana to evaluate 'Admin'
as a JMESPath field identifier instead of a string literal. This
resulted in all OAuth users getting the default Viewer role.

Replaced with a proper group-based expression that checks for the
'admins' Authentik group and maps to Admin/Viewer accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:41:08 -08:00
4acd2e58d4 Update prometheus and grafana to main-SHA container tags
Prometheus: v3.9.1-74029e1 [branch] -> v3.9.1-2ba5d8a [main]
Grafana: v12.3.3-09ac36b [branch] -> v12.3.3-d05d2fb [main]

These images were built during PR development and referenced branch
commits that won't survive branch cleanup. The [main] tags are
identical rebuilds from the squash-merge commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 09:58:09 -08:00
2ba5d8a8aa Port Prometheus to local container build (#262)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (prometheus) (push) Successful in 2s
Build Container / build (prometheus) (push) Successful in 7s
## Summary
- Add three-stage Dockerfile for Prometheus v3.9.1 (Node UI → Go binaries → Alpine runtime)
- Produces `prometheus` and `promtool` binaries with embedded web UI assets
- Follows navidrome/ntfy pattern for supply chain control via Zot registry

## Deployment and Testing
- [ ] `dagger call build --src=. --container-name=prometheus` succeeds
- [ ] Container reports correct version via `prometheus --version`
- [ ] `promtool --version` works
- [ ] Update statefulset image reference after successful build
- [ ] Deploy from branch: `argocd app set prometheus --revision <branch> && argocd app sync prometheus`
- [ ] Health probes pass (`/-/healthy`, `/-/ready`)
- [ ] Web UI loads, scrape targets work, remote write functions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/262
2026-02-24 09:15:57 -08:00
Forgejo Actions
2f78d180e8 Update docs release to v1.11.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-23 21:04:33 -08:00
d05d2fbaff C2: Upgrade Grafana to 12.x with Nix container and Kustomize (#260)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 1s
Build Container (Nix) / build (grafana) (push) Successful in 2s
Build Container / build (grafana) (push) Successful in 7s
## Summary

Mikado chain to upgrade Grafana from 11.4.0 (Helm chart) to 12.x with:
- Home-built Nix container image (`forge.ops.eblu.me/eblume/grafana`)
- Kustomize manifests replacing the Helm chart
- Single-source ArgoCD app

## Chain

Goal: `upgrade-grafana`
Leaves: `build-grafana-container`, `kustomize-grafana-deployment`

Track with: `mise run docs-mikado upgrade-grafana`

## Test plan
- [ ] Container builds successfully via Nix
- [ ] Container pushed to registry
- [ ] Kustomize manifests produce equivalent resources to current Helm
- [ ] Pod runs, UI loads, OIDC works, datasources healthy
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/260
2026-02-23 18:07:18 -08:00
9b419abf24 Update RUNNER_LABELS to use runner-job-image:v0.19.11-4c5e0f0
Now that the image is built under the new name, point the forgejo
runner at it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 17:47:14 -08:00
75fde54355 Fix Grafana TeslaMate dashboard folder provisioning (#253)
## Summary
- `foldersFromFilesStructure` was `false` in Grafana's sidecar provider config, causing Grafana to ignore the subdirectory structure the sidecar creates from `grafana_folder` annotations
- All 18 TeslaMate dashboards were appearing in the root "Dashboards" folder despite having `grafana_folder: "TeslaMate"` annotations on their ConfigMaps
- Flipping to `true` makes Grafana replicate the sidecar's directory structure as UI folders

## Deployment and Testing
- [ ] Sync `grafana` app: `argocd app sync grafana`
- [ ] Verify TeslaMate dashboards appear under a "TeslaMate" folder in Grafana's dashboard list
- [ ] Verify other dashboards remain in the root "Dashboards" folder

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/253
2026-02-22 18:38:51 -08:00
6871bf32a8 Remove unused gpu panel in frigate dashboard 2026-02-22 18:26:47 -08:00
2c6c6a244a Fix Frigate Prometheus metrics & rebuild Grafana dashboard (#252)
## Summary

- **Prometheus scrape target:** Changed from `frigate.frigate.svc.cluster.local:5000` (broken after ringtail migration) to `nvr.ops.eblu.me` via HTTPS through Caddy on indri
- **Grafana dashboard:** Rebuilt for Frigate 0.17 metrics — 12 panels total:
  - Row 1 (stats): Uptime, Inference Speed, Camera FPS, Detection FPS, GPU Usage, GPU Temp
  - Row 2 (timeseries): CPU Usage, Memory Usage
  - Row 3 (timeseries): Camera FPS + Skipped FPS, GPU Usage + Memory over time
  - Row 4 (timeseries): Storage Usage, Detection Events (rate by camera/label)

## Deployment and Testing

1. Sync prometheus app on branch:
   ```
   argocd app set prometheus --revision fix/frigate-metrics-dashboard && argocd app sync prometheus
   ```
2. Check `prometheus.ops.eblu.me/targets` — frigate job should show UP
3. Sync grafana-config:
   ```
   argocd app sync grafana-config
   ```
4. Check `grafana.ops.eblu.me` — Frigate NVR dashboard should show live data
5. After merge: reset both apps to `--revision main` and sync

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/252
2026-02-22 18:14:17 -08:00
Forgejo Actions
dda7d719b3 Update docs release to v1.11.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-22 17:52:05 -08:00
e655f4556e Upgrade k8s forgejo-runner from v6.3.1 to v12.7.0 (#251)
## Summary

Completes the `upgrade-k8s-runner` mikado chain. Both prerequisites (workflow validation in Dagger, config review against v12 defaults) were resolved in #250.

- Bump runner image `code.forgejo.org/forgejo/runner:6.3.1` → `12.7.0`
- Update `service-versions.yaml` to track new version
- Mark goal card complete (remove `status: active`)

## Deployment and Testing

After merge:
1. `argocd app sync forgejo-runner`
2. Verify runner registers in Forgejo admin → runners
3. Trigger a test workflow (e.g. `branch-cleanup.yaml` manual dispatch)

Rollback: revert image tag to `6.3.1`, push, sync.
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/251
2026-02-22 17:43:39 -08:00
0f6a1898f0 Prepare forgejo-runner v12 upgrade (leaf nodes) (#250)
## Summary
- Review runner config against v12.7.0 defaults — added `shutdown_timeout: 3h`, no breaking changes found
- Add `validate_workflows` Dagger function using `forgejo-runner validate --directory .` inside upstream container
- All 6 workflows pass v12.7.0 schema validation
- Wire `mise run validate-workflows` task and pre-commit hook on `.forgejo/workflows/` changes
- Mark both leaf Mikado cards (`review-runner-config-v12`, `validate-workflows-against-v12`) complete

## Mikado State
After merge, `upgrade-k8s-runner` goal card has no unmet dependencies — ready to execute the actual image bump in a follow-up PR.

## Test Plan
- [x] `dagger call validate-workflows --src=.` passes (all 6 workflows OK)
- [x] Pre-commit hooks pass
- [ ] Reviewer: confirm `shutdown_timeout: 3h` addition to ConfigMap looks reasonable

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/250
2026-02-22 17:38:32 -08:00
d51c180fe6 Switch Frigate detection model from YOLO-NAS-S to YOLOv9-c (#246)
## Summary
- Replace abandoned YOLO-NAS-S (320x320, `yolonas`) with YOLOv9-c (640x640, `yolo-generic`)
- YOLOv9-c benefits from CUDA Graphs in Frigate 0.17 on the RTX 4080
- Add `export_yolov9` Dagger pipeline and `frigate-export-model` mise task for reproducible model exports
- Model already deployed to `sifaka:/volume1/frigate/models/yolov9-c-640.onnx`

## Config changes
- `model_type: yolonas` → `yolo-generic`
- `input_dtype: int` → `float`
- `width/height: 320` → `640`
- `path:` → `yolov9-c-640.onnx`

## Deployment and Testing
- [ ] Merge and sync Frigate ArgoCD app: `argocd app sync frigate`
- [ ] Verify Frigate starts and detects objects at https://nvr.ops.eblu.me
- [ ] Confirm GPU inference via Frigate system metrics

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/246
2026-02-22 15:14:45 -08:00
2c081eed28 Add Forgejo repository health metrics and Grafana dashboard (#245)
## Summary
- New `forgejo_metrics` Ansible role that queries the Forgejo REST API every 60s and writes Prometheus textfile metrics (open PRs, issues, languages, releases, commits, Actions runs/duration/success)
- Grafana dashboard "Forgejo Repository Health" with 12 panels across 4 rows: overview stats, CI/CD health, repository info, and staleness tracking
- Deletes superseded `forgejo-actions-dashboard` plan doc (this implementation covers a broader scope)

## Deployment and Testing
- [ ] `mise run provision-indri -- --tags forgejo_metrics` to deploy the collector
- [ ] `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/forgejo.prom'` to verify metrics
- [ ] `argocd app sync grafana-config` to deploy the dashboard
- [ ] Check Grafana dashboard "Forgejo Repository Health" loads with data
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/245
2026-02-22 11:16:03 -08:00
Forgejo Actions
c21cf54847 Update docs release to v1.11.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-22 10:21:19 -08:00
c897fc8e1f Use Zot registry icon on homepage dashboard
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 09:19:46 -08:00