blumeops

Author	SHA1	Message	Date
Erich Blume	5096223b48	C1: clean up cv + docs minikube artifacts (#343 ) ## Summary Follow-up to #342. The cv and docs services are now live on indri (Caddy file_server backed by ansible-managed tarball extraction) and verified working. This PR removes the dead minikube artifacts and the tooling shims that referenced them. ## Changes Deletions: - ``argocd/apps/{cv,docs}.yaml`` - ``argocd/manifests/{cv,docs}/`` (deployment, service, ingress, pdb, kustomization) - ``containers/{cv,quartz}/`` (Dockerfiles + start scripts) Tooling: - ``mise-tasks/container-version-check``: remove the ``quartz``→``docs`` CONTAINER_TO_SERVICE mapping (containers/quartz no longer exists) - ``service-versions.yaml``: bump ``docs.current-version`` to ``v1.16.0`` (the blumeops docs release tag) and trim the migration-window comment ## Live state context The argocd Applications ``cv`` and ``docs`` were already deleted from the cluster manually as part of the cutover; this PR just removes the YAML files that the ``apps`` app-of-apps was still ingesting. After merge, ``argocd app sync apps`` will reconcile and the ``apps`` Application returns to Synced. The Caddyfile ``handle_errors`` bug that briefly crashed all ``*.ops.eblu.me`` services during cutover is fixed in a separate C0 (``2ee53fe``) on main, not here. ## Test plan - [x] ``mise run container-version-check --all-files`` clean - [x] ``mise run service-review --type ansible`` shows cv at 1.0.3, docs at v1.16.0 - [ ] After merge: ``argocd app sync apps`` returns clean (cv/docs entries gone, no children to reconcile) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #343	2026-04-29 15:18:39 -07:00
Erich Blume	8d634861f6	C1: migrate cv + docs from minikube to indri-native (#342 ) ## Summary Replace the cv (`cv.eblu.me`) and docs (`docs.eblu.me`) minikube Deployments with indri-native ansible roles. Caddy serves the extracted release tarballs directly via a new `kind: static` service-block — no daemon, no nginx pod, no ProxyGroup ingress on the request path. Mirrors the rationale of the recent devpi migration; part of the broader minikube wind-down. ## What's in this commit - `ansible/roles/{cv,docs}` — sentinel-gated tarball download + extract into `~/{cv,docs}/content/` - `ansible/roles/caddy/` — new `kind: static` branch in the Caddyfile template (encoded gzip, immutable cache headers for fingerprinted assets, optional `try_html` for Quartz-style clean URLs, optional per-path `download_paths` for the resume PDF's `Content-Disposition`) - `ansible/playbooks/indri.yml` — wires `cv` and `docs` roles before `caddy` - `service-versions.yaml` — both services flip to `type: ansible`. `docs.current-version` stays at `1.28.2` for this commit so `container-version-check` keeps passing while `containers/quartz/Dockerfile` still exists; it moves to the docs release tag in the cleanup commit - `.forgejo/workflows/{cv-deploy,build-blumeops}.yaml` — deploy step now bumps `cv_version`/`docs_version` in the role defaults and pushes; running ansible + purging the Fly cache is manual from gilbert (matches devpi) - Docs: `docs/how-to/operations/{cv,docs}-on-indri.md`, updated `docs/reference/services/{cv,docs}.md`, changelog fragment ## What is not in this commit The dead artifacts. After PR review and successful cutover, a follow-up commit deletes: - `argocd/apps/{cv,docs}.yaml` and `argocd/manifests/{cv,docs}/` - `containers/cv/`, `containers/quartz/` - `CONTAINER_TO_SERVICE['quartz']` mapping in `mise-tasks/container-version-check` - bumps `docs.current-version` in `service-versions.yaml` to the release tag ## Cutover plan (manual, from gilbert, after review) 1. Take down old: - Remove the cv and docs Applications: `argocd app delete cv --cascade && argocd app delete docs --cascade` - Verify k8s namespaces gone: `kubectl --context=minikube-indri get ns \| grep -E '^(cv\|docs)\\b'` (should be empty) - Verify tailnet MagicDNS no longer advertises the VIPs: `nslookup cv.tail8d86e.ts.net` and `nslookup docs.tail8d86e.ts.net` should both fail 2. Bring up new: - `mise run provision-indri -- --tags cv,docs,caddy --check --diff` (already validated on branch) - `mise run provision-indri -- --tags cv,docs,caddy` - `fly ssh console -a blumeops-proxy -C "sh -c 'rm -rf /tmp/cache && nginx -s reload'"` 3. Verify: `mise run services-check` and the curl checks listed in `docs/how-to/operations/{cv,docs}-on-indri.md` 4. Cleanup commit + merge. Total expected downtime: minutes (not the few-hour budget you authorized). ## Test plan - [ ] `mise run provision-indri -- --tags cv,docs --check --diff` clean - [ ] `mise run provision-indri -- --tags caddy --check --diff` shows only the cv + docs blocks changing as previewed in the PR thread - [ ] After cutover: `cv.eblu.me`, `cv.ops.eblu.me`, `docs.eblu.me`, `docs.ops.eblu.me` all return 200 - [ ] `cv.eblu.me/resume.pdf` includes `Content-Disposition: attachment` - [ ] A clean Quartz URL (e.g. `docs.eblu.me/explanation/agent-change-process`) resolves to the right page - [ ] `mise run services-check` clean - [ ] `mise run service-review --type ansible` shows cv and docs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #342	2026-04-29 14:55:11 -07:00
Erich Blume	14ca0160ba	Migrate devpi from minikube to indri (launchd) (#341 ) ## Summary Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent. ## What changed - New ansible role `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box). - Caddy: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`. - Playbook: `indri.yml` gains pre_tasks for the root password and the new role. - service-versions.yaml: devpi flipped from `type: argocd` to `type: ansible`. - ArgoCD: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted. - Docs: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list. ## Already deployed (live on indri) - Service running: `launchctl list mcquack.eblume.devpi` → PID 53888 - `curl https://pypi.ops.eblu.me/+api` returns 200 ✅ - `mise run docs-mikado` works again ✅ - 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/` - Minikube namespace and PVC fully reclaimed ## Test plan - [ ] `mise run services-check` (after merge) - [ ] CI workflows that use devpi succeed - [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines) ## Context This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain. Reviewed-on: #341	2026-04-29 13:38:36 -07:00
Erich Blume	4d76fd5de5	C0: prowler — rebuild image against main HEAD Squash-merge of #340 changed the SHA. Bump prowler tag from v5.23.0-2daf629 (PR branch) to v5.23.0-495e45d (main HEAD) so the Dockerfile changes are present in the image deployed off main. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:49:27 -07:00
Erich Blume	495e45d01d	Address 6 critical Prowler IaC findings (mute + grafana RBAC tighten) (#340 ) ## Summary The weekly Prowler IaC scan reported 6 critical findings against `argocd/manifests/`. They split cleanly into two patterns: - Legitimate-by-design RBAC → mute with new compensating controls - `external-secrets-controller`, `external-secrets-cert-controller` manage `secrets` (KSV-0041) and the cert-controller mutates its own webhook configurations (KSV-0114). This is what the operator is for. New CC: `operator-purpose-bound-rbac`. - `kube-state-metrics` (both `minikube-indri` and `k3s-ringtail`) holds `list/watch` on secrets to expose `kube_secret_info` and `kube_secret_labels` metrics. KSM's metric schema only reads metadata, never the `data:` field. New CC: `kube-state-metrics-metadata-only`. - Over-broad RBAC → fix - `grafana-clusterrole` had `get/watch/list` on `secrets` because the dashboard-sidecar config used `RESOURCE=both` (ConfigMaps + Secrets). Nothing in the cluster labels Secrets with `grafana_dashboard=1`, so this was unused power. Switched both sidecar instances to `RESOURCE=configmap` and removed `secrets` from the ClusterRole. The IaC cronjob also did not previously pass `--mutelist-file`, which is why every IaC finding reported as unmuted regardless of mutelist configuration. The new `mutelist/iac.yaml` is bundled into the existing `prowler-mutelist` ConfigMap and mounted via `items:` selector. ## Test plan - [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/prowler/` — already passes locally - [ ] `kubectl --context=minikube-indri kustomize argocd/manifests/grafana/` — already passes locally - [ ] Deploy from this branch via `argocd app set prowler --revision prowler-iac-mutelist && argocd app sync prowler` and same for `grafana` - [ ] Manually trigger the IaC cronjob and verify `MUTED=True` on the 6 critical findings (`kubectl --context=minikube-indri -n prowler create job --from=cronjob/prowler-iac-scan prowler-iac-test`) - [ ] Restart grafana pod and confirm dashboards still render (sidecar still finds them via ConfigMap watch) - [ ] After verify, `argocd app set <app> --revision main && argocd app sync <app>` post-merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #340	2026-04-29 10:43:32 -07:00
Erich Blume	7d94b9073a	C0: docs — default argocd login to --sso; drop extraneous --grpc-web Now that argocd's Authentik OAuth2 client is public, `argocd login --sso` works for day-to-day use. Promote it to the default in AGENTS.md, argocd-cli reference, and troubleshooting; keep the admin/password flow documented as a break-glass fallback for when Authentik is unavailable. Also drops --grpc-web from every interactive login command — confirmed extraneous (login succeeds without it). Left in CI workflows and `argocd cluster add` untouched; those are different contexts that I didn't re-test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 10:43:21 -07:00
Erich Blume	86317315ed	C0: remove argocd OIDC client_secret wiring Now that argocd's Authentik OAuth2 client is public (PKCE-only), the client_secret plumbing is dead code: - delete argocd-oidc-authentik ExternalSecret and drop it from kustomization - remove AUTHENTIK_ARGOCD_CLIENT_SECRET env from authentik-worker - remove argocd-client-secret mapping from authentik-config ExternalSecret The argocd-client-secret field in the 1Password "Authentik (blumeops)" item is now unreferenced and can be deleted there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 10:38:26 -07:00
Erich Blume	0e62ad5596	C0: argocd OIDC — switch to public client for CLI SSO Changes argocd's Authentik OAuth2 client from confidential to public and drops the clientSecret from argocd-cm. Public + PKCE works for both the web UI (argocd-server backend) and the argocd CLI (`argocd login --sso`) without a shared secret, matching OAuth 2.1 guidance. Confidential → public was needed because the CLI can't hold a client secret; Authentik's per-app issuer model made the alternative ("cliClientID" pattern with separate public client) awkward since it requires a shared issuer across apps which Authentik doesn't serve. Follow-up: deadcode AUTHENTIK_ARGOCD_CLIENT_SECRET env wiring and the argocd-oidc-authentik ExternalSecret once verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 10:34:39 -07:00
Erich Blume	225b0e7008	C0: allow argocd CLI --sso localhost callback Adds http://localhost:8085/auth/callback to the ArgoCD OAuth2 provider's redirect_uris so `argocd login --sso` works. Loopback redirect is the RFC 8252 pattern for native CLI apps; PKCE (already enabled) covers the code-interception risk. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 10:18:08 -07:00
Erich Blume	a9ef02a602	C0: bump frigate-notify to v0.5.4-e928054-nix (workdir fix) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 09:44:24 -07:00
Erich Blume	c88b6d773c	C0: point frigate-notify at local registry tag v0.5.4-fb4bf5a-nix Built from main in run #516 after #339 merged. Follows the navidrome kustomization convention (deployment image = local ref + :kustomized, kustomization override = newTag only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 09:31:29 -07:00
Erich Blume	fb32cc07c4	chore: repoint runner-job-image tag at CI-built v0.20.6-50f8c2a Swaps the k8s runner label from the local bootstrap tag (v0.20.6-9b6be09) to the equivalent image rebuilt by CI from main. Functionally identical; closes the bootstrap loop. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 08:38:33 -07:00
Erich Blume	50f8c2a33f	Roll k8s runner to runner-job-image v0.20.6-9b6be09 Points the k8s Forgejo runner label at the locally-bootstrapped runner-job-image built from the Alpine container.py on this branch. Once merged, CI will rebuild the same image from the same SHA. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-21 08:28:18 -07:00
Erich Blume	21177ff47f	chore: update forgejo-runner image tag	2026-04-20 09:11:37 -07:00
Erich Blume	1425bf1f5c	Upgrade forgejo-runner to v12.8, adopt server.connections, and clean up docs (#338 ) ## Summary - consolidate forgejo-runner how-to docs into current cards - upgrade the k8s forgejo-runner deployment to the latest v12.8.x runner image - switch the k8s runner from first-boot register flow to declarative server.connections config - keep the runner image on the native Dagger build path and update the surrounding manifests/secrets ## Notes - PR opened early for C1 review - implementation and deployment verification will follow in subsequent commits Reviewed-on: #338	2026-04-20 09:03:54 -07:00
Erich Blume	55abb17f50	Add resource limits to ArgoCD pods to prevent unbounded consumption All 7 ArgoCD containers had no resource limits, allowing them to consume unlimited CPU/memory during node pressure events. This contributed to cluster-wide probe timeout cascades on minikube-indri. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 13:04:27 -07:00
Forgejo Actions	bdfcb4b677	Update docs release to v1.16.0 - Built changelog from towncrier fragments [skip ci]	2026-04-18 10:00:54 -07:00
Erich Blume	c8da243663	Run alloy-tracing as root for eBPF capabilities The nix-built Alloy image sets User=65534 (nobody). Even with privileged: true, a non-root user gets no effective capabilities (CapEff=0). Override with runAsUser: 0 so Beyla gets CAP_BPF and CAP_SYS_ADMIN needed for eBPF instrumentation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 08:42:26 -07:00
Forgejo Actions	a72a2c2bd4	Update docs release to v1.15.7 - Built changelog from towncrier fragments [skip ci]	2026-04-18 08:14:58 -07:00
Erich Blume	b4472c7849	Deploy devpi 6.19.3 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-18 08:04:23 -07:00
Erich Blume	fe0e913963	Switch Fly proxy to upstream keepalive pools (#337 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools - Reuses TLS connections through the Tailscale tunnel instead of handshaking per request - Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS) ## Trade-off DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this. ## Still TODO on this branch - [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder) - [ ] Docs pass - [ ] Deploy from branch and verify latency improvement - [ ] Changelog fragment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #337	2026-04-17 16:39:52 -07:00
Erich Blume	1c0ee099fb	Move forge-specific latency panels to Forgejo dashboard Fly.io dashboard keeps aggregate all-hosts p50/p90/p99. Forge-filtered upstream response time panel moves to Forgejo's "Public Proxy" section. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:13:40 -07:00
Erich Blume	d7af004842	Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m53s Details - Enable Forgejo /metrics endpoint (app.ini [metrics] section) - Add Alloy scrape target for Forgejo metrics on indri - Add upstream_response_time histogram to Fly proxy Alloy config - Replace single p95 panel with p50/p90/p99 + upstream breakdown filtered to forge.eblu.me host Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 15:05:59 -07:00
Erich Blume	0a98f76068	Update kiwix-serve to Dagger-built container (Alpine 3.23) Points kustomization at v3.8.2-7a42aeb, the first image built from the new container.py (replacing the Dockerfile). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 14:27:42 -07:00
Erich Blume	5ec2411e20	Update navidrome, miniflux, forgejo-runner image tags to Alpine 3.23 builds [main] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-16 15:37:30 -07:00
Erich Blume	fb1e8ff672	Deploy transmission containers from Dagger builds Update kustomization image tags to the new container.py-built images (v4.1.1-r1-2c483ce, v1.0.1-2c483ce). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-15 11:34:28 -07:00
Erich Blume	30ed018fd8	Update prowler image tag to v5.23.0-7c1cd11 [main] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 13:51:26 -07:00
Erich Blume	7c1cd11e45	Upgrade Prowler to 5.23.0, remove registry workaround (#336 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (prowler) (push) Successful in 36s Details ## Summary - Upgrade Prowler from 5.22.0 to 5.23.0 - Remove the `enumerate-images` init container workaround from `cronjob-image-scan.yaml` - Use native `--registry` and `--image-filter` flags now that upstream fix (PR prowler-cloud/prowler#10470) is released The init container was a workaround for prowler-cloud/prowler#10457 where `--registry` args weren't forwarded to the provider constructor. We wrote the fix, it was merged, and v5.23.0 includes it. ## Test plan - [ ] Build new container (`mise run container-release prowler 5.23.0`) - [ ] Update kustomization.yaml with new image tag - [ ] Sync prowler ArgoCD app from branch - [ ] Manually trigger image scan job and verify `--registry` works natively - [ ] Verify CIS and IaC scan cronjobs still work 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #336	2026-04-14 13:45:28 -07:00
Erich Blume	be30668eef	Automate Prowler MANUAL finding verification (#335 ) ## Summary - Adds automated node-level verification to `review-compliance-reports`: kubelet file perms/ownership, kubelet config args, etcd CA separation, RBAC cluster-admin bindings - Mutes the 14 MANUAL Prowler findings via new `manual-node-checks.yaml` mutelist file - New `node-config-automated-verification` compensating control documents the approach - Script fails loudly (red FAIL + verdict panel) if any check deviates from expected values ## Test plan - [x] `mise run review-compliance-reports` — all 12 node checks PASS - [x] Injected bad expected value (perms 400 vs actual 600) — FAIL rendered correctly - [x] Fixed colon-in-binding-name bug (kubeadm:cluster-admins) with tab-separated jsonpath - [ ] After merge: sync prowler mutelist ConfigMap and verify next scan shows 0 MANUAL findings ## Note Prowler coverage is minikube-indri only — ringtail/k3s is a known gap tracked separately. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #335	2026-04-14 13:00:44 -07:00
Forgejo Actions	8c2f035e6d	Update docs release to v1.15.6 - Built changelog from towncrier fragments [skip ci]	2026-04-14 11:46:42 -07:00
Forgejo Actions	f2514a6f02	Update docs release to v1.15.5 - Built changelog from towncrier fragments [skip ci]	2026-04-14 11:29:27 -07:00
Erich Blume	9d85c97b9b	Update forgejo-runner kustomization tag to main-branch image C0 follow-up: switch from branch-built tag to main-built v12.7.3-0e93cc0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:10:36 -07:00
Erich Blume	0e93cc08b4	Build forgejo-runner container locally (#334 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dagger (forgejo-runner) (push) Successful in 1m21s Details ## Summary - Add native Dagger `container.py` for forgejo-runner (Go + Alpine runtime, static binary with CGO for SQLite) - Update kustomization to point to local registry image (tag is placeholder until CI builds) - Uses existing `clone_from_forge("forgejo-runner", ...)` mirror ## Test plan - [x] `dagger call build --src=. --container-name=forgejo-runner` passes locally - [ ] CI container build from branch succeeds - [ ] Update kustomization tag to built image, deploy from branch via ArgoCD `--revision` - [ ] Verify runner registers and picks up jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #334	2026-04-14 11:06:36 -07:00
Erich Blume	ccaef4c1a7	Document devpi cold cache failure mode and deploy teslamate v3.0.0-08c698e After a DR rebuild, devpi's empty cache causes race conditions under concurrent load — metadata is served but wheel files 404. Also deploys the first container.py-built teslamate image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:38:06 -07:00
Erich Blume	2d2d495f95	Fix paperless redis: use upstream valkey instead of amd64-only nix image The authentik-redis image is nix-built on ringtail (amd64 only) and was previously running under QEMU emulation on arm64 minikube. Discovered during DR recovery when fresh minikube lacked binfmt registration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 17:48:20 -07:00
Erich Blume	db6d8af8b1	Update grafana-sidecar image tag to v2.6.0-61fcd5d (merge build) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:02:39 -07:00
Erich Blume	61fcd5d70a	Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s Details ## Summary - Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable) - Port build from Dockerfile to native Dagger container.py - Add liveness/readiness probes using the new /healthz endpoint on port 8080 - Update docs to reflect container.py migration and remove stale pin note ## Test plan - [ ] Build container: `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization tag with new image tag - [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana` - [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz` - [ ] Verify dashboards load in Grafana UI - [ ] `mise run services-check` Reviewed-on: #332	2026-04-13 07:57:13 -07:00
Erich Blume	a18ec9d958	Update miniflux to main image tag, disable OTEL metrics in Dagger module Point miniflux kustomization at the main-built v2.2.19-138e23d image (replacing the branch tag). Disable the OTLP metrics exporter at module import time to prevent ~11s retry delays in CI — the env var must be set inside the module, not the runner shell, because the SDK runs inside the Dagger engine container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:59:32 -07:00
Erich Blume	138e23d525	Miniflux 2.2.19 + container.py migration + ty typechecker (#331 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (miniflux) (push) Successful in 1m3s Details ## Summary - Upgrade miniflux from 2.2.17 to 2.2.19 (security hardening, performance improvements) - Migrate miniflux from Dockerfile to native Dagger container.py build - Refactor `alpine_runtime()` helper to support existing users (nobody/65534) - Add `ty` (Astral) Python typechecker to prek hooks ## Test plan - [ ] `dagger call build --src=. --container-name=miniflux` succeeds - [ ] `dagger call container-version --container-name=miniflux` returns 2.2.19 - [ ] `mise run container-version-check` passes - [ ] `ty check` passes cleanly - [ ] `prek run --all-files` passes - [ ] CI builds container successfully - [ ] Miniflux healthcheck passes after deploy from branch Reviewed-on: #331	2026-04-12 08:54:32 -07:00
Erich Blume	94c937d588	Disable OTLP metrics exporter in CI, update navidrome to main tag The Dagger Python SDK's OTLP metrics exporter hits a non-functional local endpoint (500s), burning ~9s per retry cycle. Set OTEL_METRICS_EXPORTER=none in the build-dagger CI job. Also update navidrome kustomization to the main-SHA tag (`c86b5d7`). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 17:26:25 -07:00
Erich Blume	c86b5d7772	Native Dagger container builds + Navidrome v0.61.1 (#330 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (navidrome) (push) Successful in 22m26s Details ## Summary - Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops` - Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step - Migrate navidrome as the first container (`containers/navidrome/container.py`) - Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding) - Add `dagger call container-version` for CI version extraction without Dockerfile parsing - All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode - Legacy `docker_build()` fallback preserved for all other containers ## Motivation When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline. ## Container build dispatch needed After merge, dispatch container build for navidrome: ``` mise run container-build-and-release navidrome --ref `470b4bd` ``` ## Deploy steps 1. Wait for container build to complete 2. Back up navidrome-data PVC (non-reversible DB migrations) 3. `argocd app set navidrome --revision main && argocd app sync navidrome` 4. Verify at https://dj.ops.eblu.me ## Future Remaining containers migrate incrementally in follow-up PRs using the same pattern. Reviewed-on: #330	2026-04-11 17:11:56 -07:00
Erich Blume	5757df115d	Upgrade ollama from 0.17.5 to 0.20.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 06:42:05 -07:00
Erich Blume	22fc615a28	Update paperless image tag to main build Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:01:02 -07:00
Erich Blume	07f52e9488	Deploy Paperless-ngx document management (#328 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (paperless) (push) Successful in 9s Details ## Summary - Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri - Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay - PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery - NFS document storage on sifaka (`/volume1/paperless`) - Authentik OIDC SSO via baked JSON blob from 1Password - Caddy route at `paperless.ops.eblu.me` - 1Password item "Paperless (blumeops)" created with all secrets ## Files - `containers/paperless/Dockerfile` — multi-stage build - `argocd/manifests/paperless/` — full k8s manifest set - `argocd/apps/paperless.yaml` — ArgoCD application - `argocd/manifests/databases/` — CNPG role + ExternalSecret - `ansible/roles/caddy/defaults/main.yml` — Caddy route - `service-versions.yaml` — version tracking entry - `docs/reference/services/paperless.md` — reference card ## Remaining deploy steps 1. Build container: `mise run container-build-and-release paperless` 2. Update kustomization.yaml `newTag` with actual image tag 3. Create Authentik application/provider for paperless 4. Create `paperless` database on blumeops-pg 5. Sync ArgoCD apps, then sync paperless from branch 6. Provision Caddy: `mise run provision-indri -- --tags caddy` 7. Verify at https://paperless.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #328	2026-04-08 17:54:12 -07:00
Erich Blume	22b77ac141	Fix Frigate preview config and services-check NoData detection preview.quality was at the top level (invalid); moved under record with a valid preset (very_low). Also fix services-check to catch Grafana "Alerting (NoData)" state which was silently passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:12:42 -07:00
Erich Blume	ec63d560f3	Deploy authentik 2026.2.2 container to ringtail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 10:56:50 -07:00
Erich Blume	0366a0346b	Set Frigate preview quality to CRF 8 for faster timeline loading Previews are ~4MB/hour at default quality (CRF 1), served over NFS from sifaka. Reducing to CRF 8 shrinks preview files to improve review page load times when scrubbing older footage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 08:43:43 -07:00
Erich Blume	936d29bbe1	Fix UnPoller dashboard UIDs exceeding Grafana 12's 40-char limit Strip redundant "unifi-poller-" prefix from generated slugs, bringing UIDs from 45-48 chars down to 32-35 chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 07:03:39 -07:00
Erich Blume	3c894e659d	Pin kube-state-metrics to main-SHA container tags C0 follow-up to #327: update from branch-SHA tags to main-SHA tags after squash-merge rebuild. indri: v2.18.0-f59f885 ringtail: v2.18.0-f59f885-nix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 16:10:14 -07:00
Erich Blume	f59f8859dc	Localize kube-state-metrics container (Dockerfile + nix) (#327 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (kube-state-metrics) (push) Successful in 5s Details Build Container / build-nix (kube-state-metrics) (push) Successful in 7s Details ## Summary - Build kube-state-metrics v2.18.0 locally from forge mirror, replacing upstream `registry.k8s.io` image - Dockerfile (two-stage Go build) for indri/minikube - default.nix (buildGoModule + buildLayeredImage) for ringtail/k3s - Both kustomization files updated with `newName` pointing to local registry ## Verification - [x] Nix build succeeded on ringtail (`nix-build` → 10-layer image) - [x] Dockerfile build succeeded locally (`dagger call build` → ~2min) - [x] `container-version-check --all-files` passes (2.18.0 consistent across Dockerfile, nix, service-versions.yaml) - [ ] CI builds container images from this branch - [ ] Update kustomization `newTag` with SHA-tagged version from CI - [ ] ArgoCD sync on both clusters ## Test plan - Trigger CI build: `mise run container-build-and-release kube-state-metrics` - Verify tags: `mise run container-list kube-state-metrics` - Update newTag in kustomization files with CI-produced tag - Sync ArgoCD on indri: `argocd app sync kube-state-metrics` - Sync ArgoCD on ringtail: `argocd app sync kube-state-metrics --context=k3s-ringtail` (note: argocd uses its own auth, not kubectl context) - Verify metrics still flowing to Prometheus Reviewed-on: #327	2026-04-07 16:09:25 -07:00

1 2 3 4 5 ...

426 commits