blumeops

Author	SHA1	Message	Date
Erich Blume	9d85c97b9b	Update forgejo-runner kustomization tag to main-branch image v1.15.5 C0 follow-up: switch from branch-built tag to main-built v12.7.3-0e93cc0. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 11:10:36 -07:00
Erich Blume	0e93cc08b4	Build forgejo-runner container locally (#334 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dagger (forgejo-runner) (push) Successful in 1m21s Details ## Summary - Add native Dagger `container.py` for forgejo-runner (Go + Alpine runtime, static binary with CGO for SQLite) - Update kustomization to point to local registry image (tag is placeholder until CI builds) - Uses existing `clone_from_forge("forgejo-runner", ...)` mirror ## Test plan - [x] `dagger call build --src=. --container-name=forgejo-runner` passes locally - [ ] CI container build from branch succeeds - [ ] Update kustomization tag to built image, deploy from branch via ArgoCD `--revision` - [ ] Verify runner registers and picks up jobs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #334	2026-04-14 11:06:36 -07:00
Erich Blume	223b134776	Document uv.lock as the source of devpi dependency in Dagger builds The lockfile bakes in devpi URLs — Dagger does a locked install, not fresh resolution. This is the mechanism behind the cold-cache failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:41:45 -07:00
Erich Blume	ccaef4c1a7	Document devpi cold cache failure mode and deploy teslamate v3.0.0-08c698e After a DR rebuild, devpi's empty cache causes race conditions under concurrent load — metadata is served but wheel files 404. Also deploys the first container.py-built teslamate image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:38:06 -07:00
Erich Blume	08c698e833	Migrate teslamate to native Dagger container.py (#333 ) Some checks failed Build Container / detect (push) Successful in 2s Details Build Container / build-dagger (teslamate) (push) Failing after 6s Details ## Summary - Replace legacy Dockerfile with native Dagger `container.py` build - Two-stage pipeline: Elixir+Node builder, Debian slim runtime - Uses shared helpers (`clone_from_forge`, `oci_labels`) - Delete old Dockerfile (pipeline auto-discovers container.py) - Update build-container-image docs and mark service reviewed ## Test plan - [x] `dagger call build --src=. --container-name=teslamate` succeeds locally - [ ] CI container build passes - [ ] Deploy from branch and verify teslamate starts cleanly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #333	2026-04-14 07:20:52 -07:00
Erich Blume	4ca0630d76	Review enforce-tag-immutability doc: add review date and zot reference link Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-14 07:00:55 -07:00
Erich Blume	d7c3c687f4	Document DR rebuild procedure and update restart-indri - New how-to: rebuild-minikube-cluster with full bootstrap procedure validated during 2026-04-13 DR event - Update restart-indri: warn about minikube delete, macOS permission dialog on first Tailscale SSH, forgejo_actions_secrets dep cycle - Update disaster-recovery reference: link to rebuild procedure - Update CLAUDE.md: never run minikube delete Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 18:07:54 -07:00
Erich Blume	405dab8b59	Add changelog fragments for DR recovery work Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 17:59:16 -07:00
Erich Blume	cd5b6b63f7	Add paperless DB to borgmatic backups Discovered during DR that paperless was the only service DB not backed up by borgmatic. Uses same blumeops-pg cluster on port 5432. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 17:58:06 -07:00
Erich Blume	2d2d495f95	Fix paperless redis: use upstream valkey instead of amd64-only nix image The authentik-redis image is nix-built on ringtail (amd64 only) and was previously running under QEMU emulation on arm64 minikube. Discovered during DR recovery when fresh minikube lacked binfmt registration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 17:48:20 -07:00
Erich Blume	fca3010042	Hints about service version tracking	2026-04-13 08:40:49 -07:00
Erich Blume	22a417ac3c	Oops, looks like a log file got lost, nbd	2026-04-13 08:36:20 -07:00
Erich Blume	f61bb4f2e7	Add uv.lock for version pinning of dagger pipeline	2026-04-13 08:35:01 -07:00
Erich Blume	b5551e227e	Route Dagger build telemetry to Tempo The Dagger engine's internal OTLP proxy returns 500 on /v1/metrics when there's no real backend, causing ~9s retry warnings per pipeline step. Point OTEL_EXPORTER_OTLP_ENDPOINT at Tempo to give it a real endpoint. Also removes the stale os.environ workaround from main.py (the SDK initializes telemetry before our module loads, so it had no effect). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:27:12 -07:00
Erich Blume	ab834b641a	Fix OTEL metrics exporter warnings in Dagger builds The Dagger engine shim sets OTEL_METRICS_EXPORTER before our module loads, so os.environ.setdefault was a no-op. Switch to a hard override. Remove the redundant workflow-level env var since the fix belongs in the module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:11:15 -07:00
Erich Blume	db6d8af8b1	Update grafana-sidecar image tag to v2.6.0-61fcd5d (merge build) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:02:39 -07:00
Erich Blume	61fcd5d70a	Upgrade grafana-sidecar 1.28.0 → 2.6.0 + container.py port (#332 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dagger (grafana-sidecar) (push) Successful in 1m50s Details ## Summary - Upgrade grafana-sidecar from 1.28.0 to 2.6.0 (the 2.x memory regression #462 is resolved; ~35MB static overhead is acceptable) - Port build from Dockerfile to native Dagger container.py - Add liveness/readiness probes using the new /healthz endpoint on port 8080 - Update docs to reflect container.py migration and remove stale pin note ## Test plan - [ ] Build container: `mise run container-build-and-release grafana-sidecar` - [ ] Update kustomization tag with new image tag - [ ] Deploy from branch: `argocd app set grafana --revision grafana-sidecar-2.6.0 && argocd app sync grafana` - [ ] Verify sidecar health endpoint: `kubectl exec -n monitoring <pod> -c grafana-sc-dashboard -- wget -qO- http://localhost:8080/healthz` - [ ] Verify dashboards load in Grafana UI - [ ] `mise run services-check` Reviewed-on: #332	2026-04-13 07:57:13 -07:00
Erich Blume	6455d93cb3	Review local-registry control: fix inaccurate description, enumerate exceptions The control claimed all images came from the private registry, but 12+ services pull from external public registries. Updated description to reflect reality and catalogued external-image categories in notes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:59:37 -07:00
Erich Blume	6e60287e99	Doc review: delete install-dagger-on-nix-runner, add service-versions ref card Outdated leaf card removed; zot.md now links to new service-versions reference card instead. Added reverse link from review-services. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:52:38 -07:00
Erich Blume	8d80a4a3a5	Rewrite runner-logs: API-based log fetching, multi-repo support Replace broken SSH+filesystem log retrieval with Forgejo web API endpoint. Fix CLI to use run numbers (not task IDs), add --repo for querying any forge repo (e.g. sporks), --limit/-n for listing size. Document runner-logs as the way to verify build success in CLAUDE.md and container build docs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 09:42:58 -07:00
Erich Blume	a18ec9d958	Update miniflux to main image tag, disable OTEL metrics in Dagger module Point miniflux kustomization at the main-built v2.2.19-138e23d image (replacing the branch tag). Disable the OTLP metrics exporter at module import time to prevent ~11s retry delays in CI — the env var must be set inside the module, not the runner shell, because the SDK runs inside the Dagger engine container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-12 08:59:32 -07:00
Erich Blume	138e23d525	Miniflux 2.2.19 + container.py migration + ty typechecker (#331 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (miniflux) (push) Successful in 1m3s Details ## Summary - Upgrade miniflux from 2.2.17 to 2.2.19 (security hardening, performance improvements) - Migrate miniflux from Dockerfile to native Dagger container.py build - Refactor `alpine_runtime()` helper to support existing users (nobody/65534) - Add `ty` (Astral) Python typechecker to prek hooks ## Test plan - [ ] `dagger call build --src=. --container-name=miniflux` succeeds - [ ] `dagger call container-version --container-name=miniflux` returns 2.2.19 - [ ] `mise run container-version-check` passes - [ ] `ty check` passes cleanly - [ ] `prek run --all-files` passes - [ ] CI builds container successfully - [ ] Miniflux healthcheck passes after deploy from branch Reviewed-on: #331	2026-04-12 08:54:32 -07:00
Erich Blume	dc5bffdd97	Update ringtail flake inputs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 21:14:46 -07:00
Erich Blume	c06eccc61c	Review hosts.md: add last-reviewed, normalize links, add reference tag Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 21:06:53 -07:00
Erich Blume	94c937d588	Disable OTLP metrics exporter in CI, update navidrome to main tag The Dagger Python SDK's OTLP metrics exporter hits a non-functional local endpoint (500s), burning ~9s per retry cycle. Set OTEL_METRICS_EXPORTER=none in the build-dagger CI job. Also update navidrome kustomization to the main-SHA tag (`c86b5d7`). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 17:26:25 -07:00
Erich Blume	c86b5d7772	Native Dagger container builds + Navidrome v0.61.1 (#330 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (navidrome) (push) Successful in 22m26s Details ## Summary - Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops` - Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step - Migrate navidrome as the first container (`containers/navidrome/container.py`) - Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding) - Add `dagger call container-version` for CI version extraction without Dockerfile parsing - All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode - Legacy `docker_build()` fallback preserved for all other containers ## Motivation When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline. ## Container build dispatch needed After merge, dispatch container build for navidrome: ``` mise run container-build-and-release navidrome --ref `470b4bd` ``` ## Deploy steps 1. Wait for container build to complete 2. Back up navidrome-data PVC (non-reversible DB migrations) 3. `argocd app set navidrome --revision main && argocd app sync navidrome` 4. Verify at https://dj.ops.eblu.me ## Future Remaining containers migrate incrementally in follow-up PRs using the same pattern. Reviewed-on: #330	2026-04-11 17:11:56 -07:00
Erich Blume	4fc0192731	Track Fly.io proxy component versions in service-versions.yaml Add flyio-tailscale (v1.94.1), flyio-nginx (1.29.6-alpine), and flyio-alloy (v1.14.1) entries with new `fly` service type so future upgrades go through the service-review workflow. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:40:57 -07:00
Erich Blume	e02305e72d	Pin Fly.io Tailscale to v1.94.1 to fix MagicDNS regression in v1.96.5 All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m20s Details Tailscale :stable pulled v1.96.5 during last deploy, which returns SERVFAIL for tailnet DNS names (no upstream resolvers set). This broke all public routing (forge/docs/cv.eblu.me) through the Fly proxy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:32:38 -07:00
Erich Blume	b08b1a833f	Fix services-check to show all firing alerts per alert name check_alert() used head -1 to display only the first firing instance, silently swallowing additional alerts (e.g. frigate pod-not-ready was hidden behind ollama). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:10:09 -07:00
Erich Blume	a75f28e073	Fix fly.io proxy rate limit to key on real client IP All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m24s Details The general rate limit zone used $binary_remote_addr (Fly's internal proxy IP), causing all external clients to share one bucket. Switch to $http_fly_client_ip to match forge_auth's correct behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 19:00:33 -07:00
Erich Blume	40556e5a2d	Review gandi.md: add missing forge.eblu.me CNAME record The Pulumi code has had a forge.eblu.me CNAME since it was added, but the doc's DNS table only listed docs and cv. Also fixed the __main__.py description to mention CNAMEs alongside A records. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 09:54:46 -07:00
Erich Blume	5757df115d	Upgrade ollama from 0.17.5 to 0.20.4 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-09 06:42:05 -07:00
Erich Blume	22fc615a28	Update paperless image tag to main build Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 19:01:02 -07:00
Erich Blume	07f52e9488	Deploy Paperless-ngx document management (#328 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (paperless) (push) Successful in 9s Details ## Summary - Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri - Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay - PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery - NFS document storage on sifaka (`/volume1/paperless`) - Authentik OIDC SSO via baked JSON blob from 1Password - Caddy route at `paperless.ops.eblu.me` - 1Password item "Paperless (blumeops)" created with all secrets ## Files - `containers/paperless/Dockerfile` — multi-stage build - `argocd/manifests/paperless/` — full k8s manifest set - `argocd/apps/paperless.yaml` — ArgoCD application - `argocd/manifests/databases/` — CNPG role + ExternalSecret - `ansible/roles/caddy/defaults/main.yml` — Caddy route - `service-versions.yaml` — version tracking entry - `docs/reference/services/paperless.md` — reference card ## Remaining deploy steps 1. Build container: `mise run container-build-and-release paperless` 2. Update kustomization.yaml `newTag` with actual image tag 3. Create Authentik application/provider for paperless 4. Create `paperless` database on blumeops-pg 5. Sync ArgoCD apps, then sync paperless from branch 6. Provision Caddy: `mise run provision-indri -- --tags caddy` 7. Verify at https://paperless.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #328	2026-04-08 17:54:12 -07:00
Erich Blume	e04455c911	Add changelog fragment for adding-a-service tutorial review Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:29:54 -07:00
Erich Blume	d3235c5ca9	Review adding-a-service tutorial: fix ingress, repoURL, add kustomize and reference card steps - Fix Tailscale Ingress: move hostname to tls.hosts, remove from rules (ProxyGroup compat) - Update ArgoCD repoURL to forge.ops.eblu.me:2222 - Add kustomization.yaml section with :kustomized sentinel tag pattern - Add Step 5: Create a Reference Card (keep under 30s reading time) - Set last-reviewed: 2026-04-08 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:28:46 -07:00
Erich Blume	22b77ac141	Fix Frigate preview config and services-check NoData detection preview.quality was at the top level (invalid); moved under record with a valid preset (very_low). Also fix services-check to catch Grafana "Alerting (NoData)" state which was silently passing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 11:12:42 -07:00
Erich Blume	ec63d560f3	Deploy authentik 2026.2.2 container to ringtail Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 10:56:50 -07:00
Erich Blume	2eb28301e4	Upgrade authentik 2026.2.0 → 2026.2.2 (patch release) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-nix (authentik) (push) Successful in 1m6s Details Bug-fix release with web UI fixes, LDAP page size, and SAML SLO redirect. Also bumps client-go to v3.2026.2.1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 10:53:03 -07:00
Erich Blume	0366a0346b	Set Frigate preview quality to CRF 8 for faster timeline loading Previews are ~4MB/hour at default quality (CRF 1), served over NFS from sifaka. Reducing to CRF 8 shrinks preview files to improve review page load times when scrubbing older footage. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 08:43:43 -07:00
Erich Blume	936d29bbe1	Fix UnPoller dashboard UIDs exceeding Grafana 12's 40-char limit Strip redundant "unifi-poller-" prefix from generated slugs, bringing UIDs from 45-48 chars down to 32-35 chars. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-08 07:03:39 -07:00
Erich Blume	3c894e659d	Pin kube-state-metrics to main-SHA container tags C0 follow-up to #327: update from branch-SHA tags to main-SHA tags after squash-merge rebuild. indri: v2.18.0-f59f885 ringtail: v2.18.0-f59f885-nix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 16:10:14 -07:00
Erich Blume	f59f8859dc	Localize kube-state-metrics container (Dockerfile + nix) (#327 ) All checks were successful Build Container / detect (push) Successful in 2s Details Build Container / build-dockerfile (kube-state-metrics) (push) Successful in 5s Details Build Container / build-nix (kube-state-metrics) (push) Successful in 7s Details ## Summary - Build kube-state-metrics v2.18.0 locally from forge mirror, replacing upstream `registry.k8s.io` image - Dockerfile (two-stage Go build) for indri/minikube - default.nix (buildGoModule + buildLayeredImage) for ringtail/k3s - Both kustomization files updated with `newName` pointing to local registry ## Verification - [x] Nix build succeeded on ringtail (`nix-build` → 10-layer image) - [x] Dockerfile build succeeded locally (`dagger call build` → ~2min) - [x] `container-version-check --all-files` passes (2.18.0 consistent across Dockerfile, nix, service-versions.yaml) - [ ] CI builds container images from this branch - [ ] Update kustomization `newTag` with SHA-tagged version from CI - [ ] ArgoCD sync on both clusters ## Test plan - Trigger CI build: `mise run container-build-and-release kube-state-metrics` - Verify tags: `mise run container-list kube-state-metrics` - Update newTag in kustomization files with CI-produced tag - Sync ArgoCD on indri: `argocd app sync kube-state-metrics` - Sync ArgoCD on ringtail: `argocd app sync kube-state-metrics --context=k3s-ringtail` (note: argocd uses its own auth, not kubectl context) - Verify metrics still flowing to Prometheus Reviewed-on: #327	2026-04-07 16:09:25 -07:00
Erich Blume	84eda0301f	Bump authentik worker memory limit 1Gi → 2Gi (OOMKilled after ringtail restart) Worker forks 4 Dramatiq processes each loading the full Django app (~250MB each), hitting the 1Gi limit on startup. Ringtail has ample RAM headroom. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 15:39:29 -07:00
Erich Blume	efae404d1e	Remove superuser from teslamate PG role, transfer extension ownership teslamate had superuser on the shared blumeops-pg cluster (which also hosts miniflux and authentik). Downgraded to plain database owner with extension ownership (cube, earthdistance) transferred manually so it can still ALTER EXTENSION UPDATE. earthdistance is untrusted in PG so DROP+CREATE would need temporary superuser escalation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 15:36:39 -07:00
Erich Blume	fc34a7da5b	Review postgresql.md: add authentik user/db, immich-pg borgmatic secret Doc review found the authentik database, user, and external secret were missing, along with the immich-pg borgmatic secret. Added Cluster column to Users table for clarity. Set last-reviewed: 2026-04-07. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 15:21:48 -07:00
Erich Blume	1fd8aae8f6	Upgrade ArgoCD v3.3.2 → v3.3.6, SHA-pin install manifest Patch upgrade with bug fixes (diff normalization, installation ID cache). Pin the upstream manifest URL to commit SHA for supply chain integrity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-07 08:21:11 -07:00
Erich Blume	e85c71e73f	Add changelog fragments for seccomp hardening and bracket fix Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 10:38:41 -07:00
Erich Blume	d3d67272a7	Fix blumeops-tasks swallowing bracket content in descriptions Rich markup parser interprets [text] as style tags, stripping wiki-links like [[review-compensating-controls]] to empty []. Escape description lines with rich.markup.escape(). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 10:37:40 -07:00
Erich Blume	59f3422d3e	Review compensating control: tailscale-network-isolation Verified: tailscale serve status shows only svc:k8s, ACLs restrict tag:flyio-target to port 443 with admin/operator ownership only, indri has no flyio-target tag. All 10 muted findings remain valid. Noted gap: no automated alerting on new flyio-target devices. Tracked in Todoist as MC4 (Manual Compliance Control Check CronJob). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-06 10:35:13 -07:00

1 2 3 4 5 ...

863 commits