Commit graph

290 commits

Author SHA1 Message Date
b0023fef92 Switch Mealie OIDC to confidential client
Mealie requires OIDC_CLIENT_SECRET even though its docs say "public
client with PKCE". The token exchange happens server-side in Mealie's
Python backend, so the secret never reaches the browser.

- Generate client secret, store in 1Password
- Add to Authentik external-secret and worker env
- Switch blueprint from public to confidential
- Add ExternalSecret for mealie namespace
- Update docs to reflect confidential client

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:50:34 -07:00
ac83bd14e3 Add borgmatic backup for Mealie SQLite, set image tag
- Add before_backup hook to borgmatic: kubectl exec + python3 sqlite3
  .backup to safely dump the database, then kubectl cp to host
- Include k8s-dumps directory in borgmatic source_directories
- Generic pattern: borgmatic_k8s_sqlite_dumps list in defaults
- Fix PVC storageClassName: standard (not local-path) on minikube
- Set container image tag to v3.12.0-5c5fd18 from CI build

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:36:00 -07:00
30a114462c Allow all Authentik users to access Mealie
Remove admins-only policy binding from Mealie app. Any authenticated
Authentik user can log in (account auto-created). Mealie's
OIDC_ADMIN_GROUP=admins handles admin privilege mapping internally.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:19:26 -07:00
5c5fd18cac Add Authentik OIDC integration for Mealie
Configure Mealie as a public PKCE client in Authentik. Mealie's OIDC
flow runs client-side (Vue.js SPA) so it uses PKCE instead of a
client_secret. No 1Password secret or ExternalSecret needed.

- Add mealie.yaml blueprint to Authentik configmap (public client, admins group)
- Add OIDC env vars to Mealie deployment
- Update service docs

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:15:21 -07:00
bd2b02f51c Add Mealie recipe manager service
Deploy Mealie on minikube-indri for meal planning and prep automation.
Built from source via forge mirror (mirrors/mealie) with multi-stage
Dockerfile: Node.js frontend + Python/uv backend. Includes ArgoCD app,
k8s manifests, Caddy proxy entry, and service documentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 21:07:25 -07:00
b54d87e071 Fix shell syntax error in unpoller dashboard initcontainer
Comments can't appear inside a for-in list in sh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:59:03 -07:00
b0846ab5fa Update unpoller container tag to main build
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-16 15:54:03 -07:00
4dc3e5cae2 Add UnPoller for UniFi network metrics (#298)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (unpoller) (push) Successful in 2s
Build Container / build (unpoller) (push) Successful in 7s
## Summary
- Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus
- Custom-built container from forge mirror (`containers/unpoller/Dockerfile`)
- Credentials pulled from 1Password via external-secrets
- Prometheus scrape job added, docs and service-versions updated

## Test plan
- [ ] Build container: `mise run container-release unpoller v2.34.0`
- [ ] Update kustomization tag with built image tag
- [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller`
- [ ] Verify pod connects to UX7 controller (check logs)
- [ ] Confirm `unpoller` target appears in Prometheus
- [ ] Query `unifi_` metrics in Grafana

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #298
2026-03-16 15:52:45 -07:00
4ca2e39901 Externalize TeslaMate dashboards to forge mirror (#296)
## Summary
- Replaces 18 TeslaMate dashboard ConfigMaps (713 KB / 22,080 lines) with a Grafana init container
- Init container fetches dashboard JSON directly from `mirrors/teslamate` on forge, pinned to `v3.0.0`
- Grafana's file provider picks them up from `/tmp/dashboards/TeslaMate/` via `foldersFromFilesStructure`
- Non-TeslaMate dashboards remain as ConfigMaps (unchanged)

## How it works
- New `init-teslamate-dashboards` init container uses busybox `wget` to fetch each JSON file from `https://forge.eblu.me/mirrors/teslamate/raw/tag/v3.0.0/grafana/dashboards/`
- Files land in `/tmp/dashboards/TeslaMate/`, same emptyDir volume the sidecar uses
- The sidecar continues to handle ConfigMap-based dashboards; the init container handles TeslaMate
- Version pin is in the init container args (TESLAMATE_VERSION)

## Deployment and Testing
- [ ] Sync `grafana` app from branch — verify init container runs and fetches dashboards
- [ ] Sync `grafana-config` app from branch — verify TeslaMate ConfigMaps are pruned
- [ ] Check Grafana UI: TeslaMate folder should still contain all 18 dashboards
- [ ] Verify non-TeslaMate dashboards are unaffected
- [ ] After merge: sync both apps from main

Reviewed-on: #296
2026-03-15 18:31:19 -07:00
2bea048dbf Externalize Tailscale operator to forge mirror (#295)
## Summary
- Mirrors `tailscale/tailscale` on forge (`mirrors/tailscale`)
- Replaces vendored `operator.yaml` (495 KB / 5,386 lines) with ArgoCD apps sourcing the upstream static manifest, pinned via `targetRevision: v1.94.2`
- Adds `tailscale-operator-base` app for indri and `tailscale-operator-base-ringtail` for ringtail
- Local kustomization retains only ProxyClass and DNSConfig custom resources
- Updates `[[tailscale-operator]]` doc to reflect new sourcing

## Deployment and Testing
- [ ] Register `mirrors/tailscale` repo in ArgoCD (it needs to know about the new repo)
- [ ] Sync `apps` app to pick up the new `tailscale-operator-base` app definitions
- [ ] Sync `tailscale-operator-base` — verify CRDs, RBAC, operator Deployment come up
- [ ] Sync `tailscale-operator` — verify ProxyClass, DNSConfig still apply cleanly
- [ ] Verify existing Tailscale Ingresses still work (ProxyGroup pods healthy)
- [ ] Repeat for ringtail cluster
- [ ] After merge: apps already point at tags, no revision reset needed

Reviewed-on: #295
2026-03-15 17:44:35 -07:00
Forgejo Actions
cb95db0bc9 Update docs release to v1.14.1
- Built changelog from towncrier fragments

[skip ci]
2026-03-14 10:11:06 -07:00
ab8ea6f301 Bump Grafana Alloy to v1.14.0 (#292)
## Summary
- Bump alloy-k8s, alloy-ringtail, and alloy-tracing-ringtail image tags from v1.13.1 to v1.14.0
- Mark indri alloy (ansible) as reviewed at v1.14.0 — source rebuild from forge mirror needed
- Add missing alloy-ringtail entry to service-versions.yaml
- Update alloy reference doc

## Breaking changes reviewed
- `loki.secretfilter` options removed — not used in our configs
- OTel Collector upgraded to v0.142.0 — Kafka receiver changes don't affect us
- Exporter queue default changes — our tracing pipeline (Beyla → batch → otlphttp) uses simple config, low risk

## Deployment and Testing
- [ ] Sync alloy-k8s: `argocd app set alloy-k8s --revision bump/alloy-v1.14.0 && argocd app sync alloy-k8s`
- [ ] Sync alloy-ringtail: `argocd app set alloy-ringtail --revision bump/alloy-v1.14.0 --server ringtail-argocd && argocd app sync alloy-ringtail`
- [ ] Sync alloy-tracing-ringtail similarly
- [ ] Verify metrics flowing in Grafana
- [ ] Verify traces flowing to Tempo (ringtail)
- [ ] Rebuild indri alloy from source (`v1.14.0` tag on forge mirror), SCP to indri, restart
- [ ] After merge: reset ArgoCD revisions to main, re-sync

Reviewed-on: #292
2026-03-13 16:25:27 -07:00
c26026f4e9 Bump Ollama memory to 24Gi and enable flash attention
The 27B Q4_K_M model needs ~7.3 GiB system RAM for CPU-offloaded layers
but only 6.8 GiB was available within the 22Gi cgroup. Bumping to 24Gi
and enabling flash attention (reduces KV cache memory) should provide
enough headroom.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:33:22 -07:00
6d4929a66c Add qwen3.5:27b to Ollama and bump memory limit to 22Gi
The 27B Q4_K_M model is ~17 GB, exceeding the 16 GB VRAM on the RTX 4080
by ~1 GB. Ollama will offload a few layers to CPU RAM, so the pod memory
limit needs headroom beyond the previous 16Gi.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:55:51 -07:00
40f1568088 Remove unused Mosquitto MQTT broker from ringtail
Mosquitto has been dormant since frigate-notify switched from MQTT to
webapi polling (529ba10). Tear down live infra (ArgoCD app, namespace)
and remove all manifests, service-versions entry, services-check, and
doc references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:37:31 -07:00
87d4de244b Review jobsync: add to services-check and homepage (#291)
## Summary
- Add jobsync pod check (ringtail k3s) and HTTP endpoint to `services-check`
- Add JobSync entry to homepage dashboard under new "Apps" group
- Mark jobsync as reviewed at v1.1.4 (current with upstream)
- Changelog fragment added

## Deployment and Testing
- [ ] Sync homepage app from branch: `argocd app set homepage --revision review/jobsync && argocd app sync homepage`
- [ ] Verify JobSync appears on go.ops.eblu.me dashboard
- [ ] Run `mise run services-check` to verify new checks pass
- [ ] After merge: `argocd app set homepage --revision main && argocd app sync homepage`

Reviewed-on: #291
2026-03-11 17:36:51 -07:00
Forgejo Actions
ebba3d6e5b Update docs release to v1.14.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-09 12:03:30 -07:00
0ef5fe5792 Update docs container to v1.28.2-4f0476a (SPA disabled)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-09 12:00:54 -07:00
953640d2b7 Deploy docs with fixed robots.txt (v1.28.2-ede9a51)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 20:21:05 -07:00
770a7b2d6a Add JobSync reference card, observability docs, and RAPIDAPI_KEY plumbing (#289)
## Summary
- Add JobSync service reference card (`docs/reference/services/jobsync.md`) with architecture, secrets, observability, and JSearch API docs
- Add JobSync and Ollama to ringtail's workloads table (both were missing)
- Add JobSync to the reference index
- Wire `RAPIDAPI_KEY` through ExternalSecret and deployment env var for JSearch job search automation
- Document Loki log queries for observability (no metrics endpoint exists)
- Update deploy-jobsync how-to with new env var, observability section, and reference card link

## Deployment and Testing
- [ ] Sign up for RapidAPI JSearch API (free tier: 500 req/month)
- [ ] Add `rapidapi_key` field to "JobSync" 1Password item
- [ ] Merge PR
- [ ] `argocd app sync jobsync` to pick up new env var
- [ ] Verify job search works at https://jobsync.ops.eblu.me/dashboard/automations

Reviewed-on: #289
2026-03-08 15:06:52 -07:00
c9270c7645 Update jobsync image to v1.1.4-3a811fb-nix (main build)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-08 11:13:34 -07:00
3a811fb188 Deploy JobSync — job search tracker on ringtail k3s (#288)
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container / build (jobsync) (push) Successful in 2s
Build Container (Nix) / build (jobsync) (push) Successful in 8s
## Summary

C2 Mikado chain to deploy [JobSync](https://github.com/Gsync/jobsync) — a self-hosted job application tracker — to ringtail's k3s cluster.

### Mikado Graph

```
deploy-jobsync (goal)
├── build-jobsync-container
│   └── mirror-jobsync
└── integrate-jobsync-ollama
```

### What is JobSync?

Next.js app with SQLite for tracking job applications. Features resume management, application pipeline tracking, and AI-powered resume review/job matching.

### Key Decisions

- **Ringtail k3s** (not minikube-indri) — colocates with Ollama for zero-latency AI
- **Nix container** via `buildLayeredImage` — no Dockerfile, mirrors upstream source on forge
- **Ollama for AI** — uses existing deployment, no API keys needed for AI features
- **No upstream fork** — vanilla JobSync, Anthropic AI deferred to future work if needed

### Current Status

Planning phase — cards committed, ready for review before implementation begins.

Reviewed-on: #288
2026-03-08 11:02:05 -07:00
14e931591b Fix 1Password Connect numeric log levels misclassified in Grafana (#287)
## Summary
- 1Password Connect uses non-standard numeric log levels (`1`=error, `2`=warn, `3`=info, `4`=debug, `5`=trace) per [1Password/connect#44](https://github.com/1Password/connect/issues/44)
- Alloy extracts the `level` JSON field as-is, so info-level health checks get `level="3"` in Loki
- Grafana expects string level labels — numeric values are unrecognized, causing misclassified log severity/coloring
- Adds a `stage.match` + `stage.template` in the Alloy pipeline scoped to `{namespace="1password"}` to normalize numeric levels to standard strings
- Other services are completely unaffected (scoped by namespace, not global)

## Deployment and Testing
- [ ] Sync alloy-k8s from branch: `argocd app set alloy-k8s --revision fix/onepassword-numeric-log-levels && argocd app sync alloy-k8s`
- [ ] Wait ~2 minutes for new logs to flow
- [ ] Verify level labels: `curl -sG "http://localhost:3100/loki/api/v1/label/level/values" --data-urlencode 'query={namespace="1password"}'` should show `"info"` and `"warn"` instead of `"3"` and `"2"`
- [ ] Check Grafana log panel for 1password namespace — logs should no longer appear as errors
- [ ] After merge: `argocd app set alloy-k8s --revision main && argocd app sync alloy-k8s`

Reviewed-on: #287
2026-03-07 13:57:04 -08:00
590cb1d25d Document required preview directory for Frigate NFS volume
Frigate 0.17 does not auto-create clips/previews/<camera>/, causing
review page previews to silently fail with 500 errors.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-07 08:46:23 -08:00
Forgejo Actions
2809ba6f50 Update docs release to v1.13.3
- Built changelog from towncrier fragments

[skip ci]
2026-03-06 20:49:01 -08:00
b793299d6d Upgrade Dagger engine from v0.20.0 to v0.20.1
Phase 2 of Dagger upgrade: bump engine version, update runner
deployment to v0.20.1-24f7512, and fix docs reference card version.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 20:41:02 -08:00
Forgejo Actions
e95fb9a555 Update docs release to v1.13.2
- Built changelog from towncrier fragments

[skip ci]
2026-03-06 19:03:24 -08:00
a7c21bd8a6 Update docs quartz container to v1.28.2-b64010b
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 18:58:40 -08:00
Forgejo Actions
8b0ff3d7a5 Update docs release to v1.13.1
- Built changelog from towncrier fragments

[skip ci]
2026-03-06 10:00:42 -08:00
1537412c09 Update docs quartz container to v1.28.2-6636576
Picks up spider-trap nginx guards from 6636576.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 09:52:31 -08:00
6e8d11c6bb Add :kustomized sentinel tag to manifest images, review devpi
Bare image references in manifests were ambiguous — unclear whether the
tag was intentionally omitted or managed by kustomize. Add :kustomized
sentinel to all 37 image refs overridden by kustomize images transformer.
Add sync notes for tailscale-operator proxyclass (CRD fields not processed
by kustomize). Mark devpi reviewed (6.19.1 is current).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 08:15:06 -08:00
Forgejo Actions
d98ef984ea Update docs release to v1.13.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-05 11:11:38 -08:00
46cc3fbc2e Update forgejo-runner job image to v0.20.0-448689b
Built locally to break the chicken-and-egg: the old runner couldn't
build its own replacement because it needed Dagger 0.20.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 11:05:21 -08:00
c281fb5403 Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286)
## Summary

Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki).

- **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus
- **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
- **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
- **Prometheus** scrapes Tempo operational metrics

### Architecture

```
ringtail (k3s)                                indri (minikube)
┌──────────────────────┐                      ┌─────────────────────┐
│ Alloy+Beyla (eBPF)   │──OTLP HTTP────────→ │ Tempo               │
│  ↳ Frigate, ntfy,    │  via tailnet         │  ↳ trace storage    │
│    Ollama, Immich     │                      │  ↳ RED → Prometheus │
└──────────────────────┘                      │                     │
                                              │ Grafana             │
                                              │  ↳ Tempo datasource │
                                              └─────────────────────┘
```

### New files (12)
- `docs/reference/services/tempo.md` — reference doc
- `docs/changelog.d/feature-otel-tracing.feature.md`
- `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files)
- `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files)

### Modified files (6)
- `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields
- `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target
- `service-versions.yaml` — tempo + alloy-tracing-ringtail entries
- `docs/reference/services/grafana.md` — Tempo in datasources table
- `docs/reference/reference.md` — Tempo in services index
- `docs/reference/operations/observability.md` — Tempo in components list

## Deployment and Testing

- [ ] Sync `apps` app to pick up new Application definitions
- [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo`
- [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo`
- [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready`
- [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring`
- [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail`
- [ ] Check Beyla discovery in alloy-tracing logs on ringtail
- [ ] Sync grafana-config for updated datasources
- [ ] Sync prometheus for updated scrape config
- [ ] Test Grafana Tempo datasource connection
- [ ] Generate test traffic and search traces in Grafana Explore → Tempo
- [ ] After merge: reset all ArgoCD app revisions back to main

Reviewed-on: #286
2026-03-05 10:51:07 -08:00
7bddc78c8a Add ExternalSecret default fields to prevent ArgoCD drift
The external-secrets operator adds conversionStrategy, decodingStrategy,
and metadataPolicy defaults to the live object, causing perpetual
OutOfSync in ArgoCD. Declare them explicitly to match.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 09:11:23 -08:00
405fc59c12 Add Authentik OIDC login for ArgoCD (#284)
## Summary
- Add Authentik OAuth2 provider + application blueprint for ArgoCD (ringtail side)
- Add OIDC config to ArgoCD ConfigMap with Authentik as identity provider (indri side)
- Map Authentik `admins` group to ArgoCD `role:admin` via RBAC policy
- ExternalSecrets on both sides pull `argocd-client-secret` from 1Password
- Local admin password remains as break-glass — both login methods coexist

## Pre-deployment manual step
Add `argocd-client-secret` field to "Authentik (blumeops)" in 1Password with a random value (e.g., `openssl rand -hex 32`).

## Deployment order
1. Sync Authentik app on ringtail first (blueprint + secret + worker env var)
2. Sync ArgoCD app on indri second (cm, rbac, ExternalSecret)

## Verification
- [ ] `argocd-client-secret` field added to 1Password
- [ ] Authentik app synced on ringtail — blueprint applied, provider created
- [ ] ArgoCD app synced on indri — OIDC config applied
- [ ] SSO login works: visit `https://argocd.ops.eblu.me` → "Log in via Authentik" → admin access
- [ ] Break-glass: local admin/password login still works

Reviewed-on: #284
2026-03-05 09:07:25 -08:00
91c755ddd6 Pin kiwix-serve image tag to v3.8.2-f6f0f79
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:17:40 -08:00
75814e032c Pin transmission-exporter image tag to v1.0.1-c93448f
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:05:17 -08:00
797133b28e Fix per-torrent rate panels showing cumulative bytes instead of rates
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s
Build Container / build (transmission-exporter) (push) Successful in 38s
Dashboard "Download/Upload Rate by Torrent" panels were querying
transmission_torrent_download_bytes (total_size * percent_done) and
transmission_torrent_upload_bytes (uploaded_ever) — cumulative byte
gauges, not rates. Added new metrics using Transmission's native
rate_download/rate_upload and updated dashboard queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:01:37 -08:00
6ae18cde1e Pin transmission-exporter image tag to v1.0.0-f2704b2
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 21:55:59 -08:00
f2704b26da Replace transmission-exporter with homegrown Python exporter (#283)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s
Build Container / build (transmission-exporter) (push) Successful in 19s
## Summary
- Replace unmaintained `metalmatze/transmission-exporter:master` sidecar with a homegrown Python exporter
- Uses `prometheus_client` + `transmission-rpc` with collect-on-scrape pattern (fresh metrics per scrape, no stale labels)
- Same metric names so existing Grafana Transmission dashboard works unchanged
- Container built with `uv` for dependency management, follows `grafana-sidecar` Dockerfile pattern

## Changes
- **New:** `containers/transmission-exporter/exporter.py` — single-file exporter (~130 lines)
- **New:** `containers/transmission-exporter/Dockerfile` — multi-stage Alpine build with uv
- **Modified:** `argocd/manifests/torrent/deployment.yaml` — swap sidecar image reference
- **Modified:** `argocd/manifests/torrent/kustomization.yaml` — add image tag entry
- **Modified:** `service-versions.yaml` — add transmission-exporter entry

## Deployment and Testing
- [ ] Build container: `mise run container-build-and-release transmission-exporter`
- [ ] Update kustomization.yaml newTag with build SHA
- [ ] Branch deploy: `argocd app set torrent --revision feature/transmission-exporter-python && argocd app sync torrent`
- [ ] Verify metrics: `kubectl -n torrent --context=minikube-indri port-forward svc/transmission 19091:19091` then `curl localhost:19091/metrics | grep transmission_`
- [ ] Verify Grafana Transmission dashboard panels populate
- [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent`

Reviewed-on: #283
2026-03-04 21:55:00 -08:00
91d84e54d5 Replace OOMKilled stat with detail table, shrink waiting reason panel
The count-only stat wasn't actionable. New table shows pod name, container,
restart count, and memory limit for each OOMKilled container. Waiting reason
panel narrowed to make room.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:58:11 -08:00
008da43736 Add OOMKill observability to Kubernetes Clusters dashboard
OOMKilled containers previously only appeared briefly in "Unhealthy Pods"
while dying, then vanished on restart. New panels use persistent metrics
(last_terminated_reason) and restart rate tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:53:07 -08:00
e90c287504 Add qwen3.5:9b to Ollama model list
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 19:49:39 -08:00
b460333da0 Upgrade Transmission to 4.1.1 (#282)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container (Nix) / detect (push) Successful in 2s
Build Container (Nix) / build (transmission) (push) Successful in 2s
Build Container / build (transmission) (push) Successful in 6s
## Summary
- Upgrade Transmission from 4.0.6-r4 to 4.1.1-r1
- Uses Alpine edge community repo for transmission packages, keeping stable alpine:3.22 base
- Fix stale image reference in service doc (was linuxserver, now custom registry image)
- Mark transmission as reviewed in service-versions.yaml

## Context
Service review found Transmission two minor versions behind (4.0.6 → 4.1.1). Alpine 3.22 only packages 4.0.6, so transmission is installed from edge's community repo with an exact version pin.

4.1.0 added improved µTP performance, IPv6/dual-stack UDP tracker, JSON-RPC 2.0 API. 4.1.1 is a bugfix release (20+ fixes).

Dagger test build passed locally.

## Deployment and Testing
- [ ] Build container via Forgejo workflow (`mise run container-build-and-release transmission`)
- [ ] Update kustomization.yaml with new image tag
- [ ] `argocd app set torrent --revision feature/transmission-review && argocd app sync torrent`
- [ ] Verify web UI at https://torrent.ops.eblu.me
- [ ] Check Grafana Transmission dashboard still receives metrics
- [ ] After merge: `argocd app set torrent --revision main && argocd app sync torrent`

## Note
The transmission-exporter sidecar (OOMKilling every ~30min, 294 restarts) is being tracked separately as a future replacement project.

Reviewed-on: #282
2026-03-04 07:44:33 -08:00
d7f0aa6f96 Fix Frigate database path to use persistent volume
The database was at /config/frigate.db (emptyDir, ephemeral) instead of
/db/frigate.db (PVC, persistent). Every pod restart wiped the database,
losing all recording history and leaving orphaned files on NFS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 15:18:16 -08:00
135883079c Bump frigate memory limit from 2Gi to 3Gi
ONNX detector + CUDA ffmpeg + workers consume ~1.9Gi at steady state,
causing intermittent OOMKills at the 2Gi limit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:57:15 -08:00
3d065b94f9 Pin grafana-sidecar to main build tag
v1.28.0-a2bb9ab (built from merge commit on main).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:51:01 -08:00
a2bb9abbdb Home-build grafana-sidecar container (#281)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (grafana-sidecar) (push) Successful in 2s
Build Container / build (grafana-sidecar) (push) Successful in 6s
## Summary
- Home-build the k8s-sidecar container (`grafana-sidecar`) from forge mirror, replacing upstream `quay.io/kiwigrid/k8s-sidecar:1.28.0`
- Pinned to v1.28.0 — v2.x deferred due to 135% memory regression and readOnlyRootFilesystem crashloop
- Adds Dockerfile, service-versions entry, docs, and changelog fragment
- Manifest switch to home-built image pending container build

## Deployment and Testing
- [ ] `mise run container-build-and-release grafana-sidecar`
- [ ] Update kustomization.yaml with built image tag
- [ ] `argocd app set grafana --revision feature/grafana-sidecar && argocd app sync grafana`
- [ ] Verify sidecar logs and dashboards at https://grafana.ops.eblu.me
- [ ] Post-merge: `argocd app set grafana --revision main && argocd app sync grafana`

Reviewed-on: #281
2026-03-03 13:48:24 -08:00
876e51dd77 Allow implicit octals in yamllint and normalize k8s mode values
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 13:10:44 -08:00