blumeops

Author	SHA1	Message	Date
Erich Blume	b5551e227e	Route Dagger build telemetry to Tempo The Dagger engine's internal OTLP proxy returns 500 on /v1/metrics when there's no real backend, causing ~9s retry warnings per pipeline step. Point OTEL_EXPORTER_OTLP_ENDPOINT at Tempo to give it a real endpoint. Also removes the stale os.environ workaround from main.py (the SDK initializes telemetry before our module loads, so it had no effect). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:27:12 -07:00
Erich Blume	ab834b641a	Fix OTEL metrics exporter warnings in Dagger builds The Dagger engine shim sets OTEL_METRICS_EXPORTER before our module loads, so os.environ.setdefault was a no-op. Switch to a hard override. Remove the redundant workflow-level env var since the fix belongs in the module. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-13 08:11:15 -07:00
Erich Blume	94c937d588	Disable OTLP metrics exporter in CI, update navidrome to main tag The Dagger Python SDK's OTLP metrics exporter hits a non-functional local endpoint (500s), burning ~9s per retry cycle. Set OTEL_METRICS_EXPORTER=none in the build-dagger CI job. Also update navidrome kustomization to the main-SHA tag (`c86b5d7`). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-11 17:26:25 -07:00
Erich Blume	c86b5d7772	Native Dagger container builds + Navidrome v0.61.1 (#330 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dagger (navidrome) (push) Successful in 22m26s Details ## Summary - Move Dagger module from `.dagger/` to repo root (`src/blumeops/`), rename `blumeops-ci` → `blumeops` - Replace opaque `docker_build()` with native Dagger pipelines that surface full build errors per step - Migrate navidrome as the first container (`containers/navidrome/container.py`) - Upgrade navidrome from v0.60.3 to v0.61.1 (major artwork overhaul, SQLite FTS5 search, server-managed transcoding) - Add `dagger call container-version` for CI version extraction without Dockerfile parsing - All mise tasks (`container-list`, `container-version-check`, `container-build-and-release`) updated for hybrid mode - Legacy `docker_build()` fallback preserved for all other containers ## Motivation When navidrome v0.61.0 added a new Go build tag (`sqlite_fts5`), `docker_build()` showed only "exit code: 1". We had to run `docker build --progress=plain` manually to find `undefined: buildtags.SQLITE_FTS5`. Native Dagger pipelines show the full error inline. ## Container build dispatch needed After merge, dispatch container build for navidrome: ``` mise run container-build-and-release navidrome --ref `470b4bd` ``` ## Deploy steps 1. Wait for container build to complete 2. Back up navidrome-data PVC (non-reversible DB migrations) 3. `argocd app set navidrome --revision main && argocd app sync navidrome` 4. Verify at https://dj.ops.eblu.me ## Future Remaining containers migrate incrementally in follow-up PRs using the same pattern. Reviewed-on: #330	2026-04-11 17:11:56 -07:00
Erich Blume	99a1a49175	Revert kingfisher skip in container build workflow Kingfisher will build via Nix on ringtail instead of Dockerfile on indri, so the skip is no longer needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 21:42:38 -07:00
Erich Blume	a842b9c1e8	Skip kingfisher in CI container builds Kingfisher's Rust + Boost/vectorscan build exhausts indri's memory (aws-sdk-ec2 alone needs 2-3GB for rustc). Build locally on Gilbert and push manually until we have a beefier build host. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 20:52:07 -07:00
Erich Blume	fd0bebb0fc	Localize authentik-redis container (#309 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dockerfile (alloy) (push) Successful in 12s Details Build Container / build-dockerfile (ntfy) (push) Successful in 11s Details Build Container / build-nix (alloy) (push) Successful in 20s Details Build Container / build-nix (authentik) (push) Successful in 6m10s Details Build Container / build-nix (authentik-redis) (push) Successful in 20s Details Build Container / build-nix (ntfy) (push) Successful in 6s Details ## Summary - Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs - Introduce attached service pattern: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates - Document the pattern in [[review-services]] so future attached services slot in cleanly - Backfill `parent: grafana` on existing `grafana-sidecar` entry ## Version drift protection 1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build` 2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update 3. Both must agree before commit succeeds ## Test plan - [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`) - [ ] Update kustomization `newTag` to branch-built image tag - [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`) - [ ] Verify Authentik login, session persistence, and task queue still work - [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #309	2026-03-24 13:27:36 -07:00
Erich Blume	0d422f5234	Update tooling dependencies (March 2026) (#307 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 2m51s Details ## Summary Monthly tooling dependency update per [[update-tooling-dependencies]]. - Prek hooks: trufflehog v3.93.4→v3.94.0, ruff v0.15.2→v0.15.7, shfmt v3.12.0-2→v3.13.0-1, ansible-lint floor→26.3.0, ansible-core floor→2.18 - Fly.io proxy: nginx 1.28.2→1.29.6, Grafana Alloy v1.13.1→v1.14.1 - Forgejo workflows: actions/checkout v4.3.1→v6.0.2 (SHA-pinned across all 5 workflows) - Mise tasks: tightened Python lower bounds — rich≥14.0.0, typer≥0.24.0, httpx≥0.28.1, pyyaml≥6.0.2 ## Test plan - [x] `prek run --all-files` passes - [ ] Verify Fly.io deploy succeeds after merge (nginx minor bump + Alloy bump) - [ ] Spot-check a workflow run with the new actions/checkout v6 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #307	2026-03-24 08:11:46 -07:00
Erich Blume	bd0ff30d3f	Unify container build workflows (#306 ) All checks were successful Build Container / detect (push) Successful in 3s Details ## Summary - Merges `build-container.yaml` and `build-container-nix.yaml` into a single workflow - Detect job classifies each changed container by presence of `Dockerfile` and/or `default.nix` - Dockerfile containers build on `k8s` (indri) via Dagger; Nix containers build on `nix-container-builder` (ringtail) via nix-build + skopeo - Containers with both build files (alloy, nettest, ntfy) get built on both runners ## Test plan - [ ] Push a change to a Dockerfile-only container (e.g. grafana) — verify it builds on k8s only - [ ] Push a change to a nix-only container (e.g. jobsync) — verify it builds on nix-container-builder only - [ ] Push a change to a dual container (e.g. ntfy) — verify it builds on both runners - [ ] Test workflow_dispatch with a specific container name 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #306	2026-03-23 20:55:50 -07:00
Erich Blume	a87c997ee1	Expose Forgejo publicly at forge.eblu.me (#278 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m28s Details ## Summary Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service. - Forgejo hardening: Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO) - Tailscale Ingress: ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint - Fly.io proxy: nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit - Authentik: OAuth callback updated to forge.eblu.me - DNS/TLS: CNAME record in Pulumi, cert in fly-setup - Rename: ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is) ## Deployment Order 1. `mise run provision-indri -- --tags forgejo` (config changes) 2. Verify forge.ops.eblu.me still works 3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator` 4. Verify `curl https://forge.tail8d86e.ts.net` 5. `cd fly && fly deploy` 6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/` 7. `fly certs add forge.eblu.me -a blumeops-proxy` 8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik` 9. `mise run dns-preview && mise run dns-up` 10. Full verification (see below) 11. Rehearse `mise run fly-shutoff` 12. After merge: reset ArgoCD revisions to main, re-sync ## Verification Checklist - [ ] forge.eblu.me loads, shows public repos - [ ] forge.ops.eblu.me still works from tailnet - [ ] SSH clone via forge.ops.eblu.me:2222 works - [ ] HTTPS clone via forge.eblu.me works - [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH - [ ] /swagger returns 403 - [ ] Rapid login attempts trigger 429 rate limit - [ ] fail2ban bans after 5 failed logins in 10 minutes - [ ] ArgoCD can still sync (SSH unaffected) - [ ] `mise run fly-shutoff` stops all public traffic - [ ] `mise run services-check` passes Reviewed-on: #278	2026-03-03 08:40:41 -08:00
Erich Blume	be3cdad1cb	Add HA for CV and Docs: zero-downtime deploys (#273 ) ## Summary - Set `replicas: 2` with `maxUnavailable: 0` / `maxSurge: 1` on CV and Docs deployments so rolling updates never drop below 2 ready pods - Add PodDisruptionBudgets (`minAvailable: 1`) to protect against node drains and cluster maintenance - Add Fly.io cache purge step to `cv-deploy.yaml` workflow (docs already had this) so CV deploys don't serve stale cached content ## Deployment and Testing - [ ] `argocd app diff cv` / `argocd app diff docs` from branch - [ ] Deploy from branch: `argocd app set cv --revision feature/ha-cv-docs-zero-downtime && argocd app sync cv` - [ ] Verify 2 pods running: `kubectl get pods -n cv --context=minikube-indri` - [ ] Test rolling restart: `kubectl rollout restart deployment/cv -n cv --context=minikube-indri` - [ ] During rollout, confirm continuous availability via `curl -I https://cv.eblu.me` - [ ] After merge: reset ArgoCD to main, re-sync both apps Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/273	2026-02-26 07:53:21 -08:00
Erich Blume	7641018c6a	Fix container build workflows to checkout dispatch ref When manually dispatching a container build with --ref, the build job now checks out the specified commit instead of the branch HEAD. This allows building containers from feature branches before merging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 17:24:32 -08:00
Erich Blume	84d2cdcf14	Update tooling dependencies (Feb 2026 cycle) Pre-commit: trufflehog v3.93.4, ruff v0.15.2, shellcheck v0.11.0.1, prettier v3.8.1, actionlint v1.7.11 Fly.io: pin nginx 1.28.2-alpine, bump alloy v1.5.1 -> v1.13.1 Forgejo workflows: pin actions/checkout to SHA (v4.3.1) Mise tasks: normalize httpx>=0.28.0, typer>=0.15.0 across all scripts Add how-to doc for the monthly tooling dependency update cycle. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 13:22:09 -08:00
Erich Blume	abb7e6fe88	Add branch-cleanup mise task (#247 ) ## Summary - New `mise run branch-cleanup` task that finds branches merged into main and deletes them locally and on the Forgejo remote - Configurable `--cutoff` (default 30 days) skips branches with recent HEAD commits - Supports `--dry-run`, `--local-only`, `--remote-only` flags - Interactive confirmation before any deletion ## Test plan - [x] `mise run branch-cleanup -- --dry-run` shows correct table of candidates - [ ] Run without `--dry-run` to confirm actual deletion works Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/247	2026-02-22 16:00:09 -08:00
Erich Blume	ff63679efb	Enable zot registry auth + wire CI credentials (#237 ) ## Summary - Enable OIDC + API key authentication on zot registry with three-tier accessControl - `anonymousPolicy: ["read"]` — anyone can pull - `artifact-workloads` group: `["read", "create"]` — CI push, no overwrite/delete - `admins` group: `["read", "create", "update", "delete"]` — break-glass - Wire both CI push paths (Dagger and Nix/skopeo) with `ZOT_CI_API_KEY` credentials - Add `artifact-workloads` PolicyBinding in Authentik blueprint for zot app access - Add `ZOT_CI_API_KEY` to Forgejo Actions secrets via existing ansible role Completes the `wire-ci-registry-auth` and `harden-zot-registry` Mikado cards. ## Manual Deployment Steps (after merge) 1. Deploy Authentik blueprint: `argocd app sync authentik` 2. In Authentik admin UI: set a password for the `zot-ci` service account 3. Deploy zot config: `mise run provision-indri -- --tags zot` 4. Log in to `https://registry.ops.eblu.me` as `zot-ci` via OIDC → generate API key 5. Store API key in 1Password as `zot-ci-apikey` in blumeops vault 6. Sync Forgejo secrets: `mise run provision-indri -- --tags forgejo_actions_secrets` 7. Trigger a test container build to verify CI push 8. Verify anonymous pull: `curl -sf https://registry.ops.eblu.me/v2/_catalog` ## Uncertainties - Zot `accessControl` group matching with OIDC: Groups from Authentik's `profile` scope claim should map to zot policy groups, but the exact claim-to-group matching needs runtime verification - `http.auth.apikey: true`: This config key is documented but needs verification against the specific zot version built from source on indri - API key permissions: Need to confirm zot API keys inherit the generating user's group for accessControl evaluation ## Test Plan - [ ] `mise run provision-indri -- --check --diff --tags zot` shows expected config changes - [ ] Anonymous pull works after deploy - [ ] Unauthenticated push fails (401) - [ ] OIDC browser login redirects to Authentik and back - [ ] API key push works after key generation - [ ] CI push succeeds with both Dagger and skopeo paths - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/237	2026-02-21 12:20:29 -08:00
Erich Blume	02f2be5523	Use nix eval instead of dagger for nix runner version extraction Dagger CLI needs a container runtime (Docker/containerd) to start its engine, which the bare nix runner doesn't have. Use nix eval directly instead — it's already available and more appropriate for a nix host. Reverts the dagger flake input since it's not usable here. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-20 23:21:16 -08:00
Erich Blume	ffa8727660	Adopt commit-based container tags (#232 ) ## Summary - Replace git-tag-triggered container builds with path-based triggers on main and workflow_dispatch - Image tags now encode upstream app version + commit SHA (`vX.Y.Z-<sha>`) for full traceability - Replace `container-tag-and-release` task with `container-build-and-release` (dispatches workflows via Forgejo API) - Update dagger `publish()` to accept `commit_sha` parameter - Update all docs and references to the new workflow ## Deployment and Testing - [ ] Merge to main - [ ] `mise run container-build-and-release <name>` for each container to populate new-format tags - [ ] Verify tags in registry via `mise run container-list` - [ ] Existing images untouched — old tags remain available Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/232	2026-02-20 22:56:20 -08:00
Erich Blume	695089499e	Nix container build for nettest (#214 ) ## Summary - Add `containers/nettest/default.nix` using `dockerTools.buildLayeredImage` with curl, jq, dnsutils, cacert, and bash — equivalent to the existing Dockerfile - Update `container-tag-and-release` to require `--nix` or `--dockerfile` flag when both build types exist for a container - Update `container-list` to show `[dockerfile+nix]` label when both exist ## Deployment and Testing - [ ] SSH to ringtail, run `nix build -f containers/nettest/default.nix -o result` to verify the nix expression builds - [ ] Tag `nettest-nix-v1.0.0`, confirm `build-container-nix` workflow runs on `nix-container-builder` runner and pushes to registry - [ ] Smoke test on ringtail k3s: `kubectl run nettest --image=registry.ops.eblu.me/blumeops/nettest:v1.0.0 --restart=Never && kubectl logs nettest` - [ ] Verify `mise run container-list` shows `[dockerfile+nix]` for nettest - [ ] Verify `mise run container-tag-and-release nettest v1.1.0` prompts for build type Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/214	2026-02-19 08:42:58 -08:00
Erich Blume	918df9e642	Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209 ) ## Summary Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions runner for building containers with Nix. ### K3s cluster - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint) - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`) - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access - `kubectl` added to system packages ### 1Password Connect + External Secrets Operator - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`, `external-secrets-config-ringtail` - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password` namespace in post_tasks ### Systemd Forgejo Actions runner - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env` - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access ### Nix container build workflow - New `.forgejo/workflows/build-container-nix.yaml` triggers on `-nix-v[0-9]` tags (e.g. `nettest-nix-v1.0.0`) - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy` - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering ### Mise task updates - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`) - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`) ## Post-merge 1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild 2. Register k3s cluster in ArgoCD (first time only): ```fish ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' \| \ sed 's\|127.0.0.1\|ringtail.tail8d86e.ts.net\|' > /tmp/k3s-ringtail.yaml set -x KUBECONFIG /tmp/k3s-ringtail.yaml argocd cluster add default --name k3s-ringtail 3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail -> external-secrets-config-ringtail 4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder' 5. Check Forgejo admin panel for ringtail-nix-builder runner online 6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209	2026-02-18 21:15:30 -08:00
Erich Blume	779b7d6709	Eliminate double towncrier run in release workflow (#199 ) ## Summary - Added a new `build_quartz` Dagger function that builds the Quartz site from a pre-processed source tree (no towncrier) - Reordered the release workflow so towncrier runs once on the runner, then passes the updated working tree to `build-quartz` - `build_docs` and `build_changelog` are preserved for standalone use — `build_docs` now delegates to `build_quartz` internally ## Motivation Previously towncrier ran twice per release: once inside a Dagger container (via `build_docs` → `build_changelog`) and once on the runner to capture CHANGELOG.md changes for the git commit. This was wasteful and fragile — if towncrier behavior changed, the two runs could produce different results. ## Test plan - [ ] Review diff to confirm workflow step ordering is correct - [ ] Trigger a release and confirm towncrier runs only once - [ ] Verify the docs tarball contains the updated CHANGELOG.md - [ ] `dagger call build-quartz --src=. --version=vX.Y.Z` should work standalone Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/199	2026-02-16 21:24:34 -08:00
Erich Blume	2fad8db639	Add yq to forgejo-runner and replace sed YAML edits (#180 ) All checks were successful Build Container / build (push) Successful in 1m31s Details ## Summary - Install yq in the forgejo-runner container image for structured YAML editing - Replace fragile `sed` regex patterns with `yq` in `build-blumeops.yaml` and `cv-deploy.yaml` workflows ## Deployment 1. Merge this PR 2. Tag and release forgejo-runner v3.1.0: `mise run container-tag-and-release forgejo-runner v3.1.0` 3. Update runner label in `argocd/manifests/forgejo-runner/external-secret.yaml` from `v3.0.2` to `v3.1.0` 4. Sync the forgejo-runner app: `argocd app sync forgejo-runner` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/180	2026-02-13 10:20:27 -08:00
Erich Blume	01e19023ee	Add CV/resume web app at cv.ops.eblu.me (#169 ) ## Summary - nginx container (`containers/cv/`) downloads and serves a content tarball at startup (same pattern as quartz) - ArgoCD app + k8s manifests (deployment, service, Tailscale ingress) - Caddy route for `cv.ops.eblu.me` - Deploy workflow: resolves "latest" or specific version from Forgejo packages, updates deployment, syncs ArgoCD - Content is built and released from the separate [cv repo](https://forge.ops.eblu.me/eblume/cv) ## Deployment steps (after merge) 1. `mise run container-tag-and-release cv v1.0.0` 2. Run "Release CV" workflow in cv repo (SPECIFIC_VERSION `v0.1.0`) 3. Run "Deploy CV" workflow in blumeops (default: latest) 4. `mise run provision-indri -- --tags caddy` 5. Verify at `https://cv.ops.eblu.me/` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/169	2026-02-12 11:09:41 -08:00
Erich Blume	95364dcb48	Simplify runner image (Dagger Phase 3) (#162 ) All checks were successful Build Container / build (push) Successful in 1m13s Details ## Summary With Phases 1 and 2 complete, the runner image no longer needs most of its bundled tools. This PR strips it down and adds what was missing. Removed (now inside Dagger containers): - Node.js 24.x - Docker CLI + buildx plugin - skopeo - gnupg, lsb-release, xz-utils Added: - `tzdata` — fixes the TZ env var (#159, #160, #161) so `TZ=America/Los_Angeles` actually works - `flyctl` — was being installed from scratch every release Workflow changes: - Remove "Ensure Dagger CLI" bootstrap steps from both workflows (Dagger is in the image) - Remove "Install flyctl" step from build-blumeops (flyctl is in the image) - Remove job-level `TZ` from build-blumeops (moved to runner configmap `runner.envs`) - Set `TZ: America/Los_Angeles` in runner configmap so all job containers inherit it ## Deployment After merge: 1. Build and release the new runner image: `mise run container-release forgejo-runner v2.0.0` 2. Sync the runner: `argocd app sync forgejo-runner` 3. Verify: `kubectl -n forgejo-runner exec deploy/forgejo-runner -c runner -- date` (but the real test is running a docs release and checking the changelog date) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/162	2026-02-11 17:24:20 -08:00
Erich Blume	e84ffb7d7f	Set TZ on build-blumeops workflow job (#161 ) ## Summary The runner pod's `TZ` env var (#159, #160) doesn't propagate to workflow job containers — jobs run inside Docker containers spawned by the DinD sidecar, not in the runner process itself. Set `TZ: America/Los_Angeles` at the job level so `uvx towncrier build` uses the correct timezone. This is the actual fix for the Feb 12 changelog dates. The runner pod TZ is still useful for runner daemon logs but doesn't affect job execution. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/161	2026-02-11 17:06:44 -08:00
Erich Blume	b197bd5f58	Adopt Dagger CI for docs build (Phase 2) (#157 ) ## Summary Migrates the docs build pipeline to Dagger (Phase 2 of the Dagger CI adoption plan). - Backfill `date-modified` frontmatter on all 80 docs — Dagger's `--src=.` excludes `.git`, so Quartz can't use git history for page dates. Frontmatter dates work with or without git. - New `docs-check-frontmatter` mise task + pre-commit hook — validates all docs have `title`, `tags`, and `date-modified` - New Dagger functions — `build_changelog` (towncrier in Python container) and `build_docs` (chains changelog → Quartz build in Node container, returns tarball) - Simplified CI workflow — the ~44-line inline Quartz build (clone, npm ci, build, tar, cleanup) is replaced by `dagger call build-docs`. Changelog step remains local on the runner since towncrier needs to modify the host working tree for the git commit. ### Design decisions - Towncrier runs twice in CI: once inside Dagger (for the docs tarball) and once on the runner (for the git commit). This is intentional — Dagger's directory export is additive and can't delete the consumed changelog fragments from the host. - Artifact hosting stays on Forgejo Releases (not migrated to Forgejo Packages as the plan doc originally suggested). That migration can happen independently. - `date-modified` frontmatter preserved even though `build_changelog` installs git — the git there is only for towncrier's `git add` call, not for history. The local iteration story (`dagger call build-docs --src=. --version=dev` with uncommitted changes) depends on frontmatter dates. ### Local iteration ```bash dagger call build-docs --src=. --version=dev export --path=./docs-dev.tar.gz tar tf docs-dev.tar.gz \| head -20 ``` ## Deployment and Testing - [x] `dagger call build-docs --src=. --version=dev` produces valid 1.1MB tarball (149 HTML pages) - [x] Pre-commit hooks pass (including new `docs-check-frontmatter`) - [ ] Full `workflow_dispatch` run after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/157	2026-02-11 16:33:16 -08:00
Erich Blume	1bc2b421a8	Adopt Dagger CI for container builds (Phase 1) (#156 ) All checks were successful Build Container / build (push) Successful in 13s Details ## Summary - Add Dagger Python module (`.dagger/`) with `build` and `publish` functions for container images - Replace Docker buildx + skopeo composite action with `dagger call publish` in `build-container.yaml` - BuildKit's native push is compatible with Zot — skopeo workaround eliminated - Add Dagger CLI (v0.19.11) to forgejo-runner Dockerfile, bump runner to v2.6.0 - Bootstrap step in workflow curl-installs dagger if not in runner (for first build on v2.5.1 runner) - Delete old `.forgejo/actions/build-push-image/` composite action - Add GPLv3 LICENSE ## Verified locally - `dagger call build --src=. --container-name=nettest` — builds ✓ - `dagger call publish --src=. --container-name=nettest --version=dagger-test` — pushed to Zot ✓ - `dagger call build --src=. --container-name=forgejo-runner` — new runner image builds ✓ - Dagger CLI accessible inside built runner image ✓ ## Deployment sequence (after merge) 1. `mise run container-tag-and-release forgejo-runner v2.6.0` — old runner bootstraps dagger via curl, builds new runner 2. `argocd app sync forgejo-runner` — runner restarts with v2.6.0 (dagger baked in) 3. `mise run container-tag-and-release nettest v0.13.0` — end-to-end test of new pipeline 4. `mise run container-list` — verify tags ## Not included (future phases) - Phase 2: docs build + Forgejo packages migration - Phase 3: runner simplification (remove skopeo, Node.js, etc.) - Phase 4: future workflows Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/156	2026-02-11 15:38:31 -08:00
Erich Blume	cef7611cba	Wrap fly ssh cache purge in sh -c for BusyBox fly ssh console -C doesn't run through a shell, so && was passed as literal arguments to rm. Wrap in sh -c to get proper shell parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 13:35:11 -08:00
Erich Blume	0efcce2984	Purge Fly.io proxy cache after docs release (#154 ) ## Summary - The Fly.io nginx proxy caches docs responses for 24h (`proxy_cache_valid 200 1d`) - After a release, docs.eblu.me kept serving stale content until the cache expired - This caused v1.5.4 to show v1.5.3 on the CHANGELOG page - Adds `flyctl` install and `fly ssh console` cache purge steps to the build workflow, running after the ArgoCD deploy completes ## Test plan - [ ] Next release should show the correct version on docs.eblu.me/CHANGELOG immediately - [ ] Verify the `fly ssh console` command succeeds in the workflow logs Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/154	2026-02-11 13:33:26 -08:00
Erich Blume	64a78422b1	Add Fly.io public reverse proxy for docs.eblu.me (#120 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 9s Details ## Summary - Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale - First service exposed: `docs.eblu.me` — the Quartz static docs site - Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME - Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow ## Key details - Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed - Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts - nginx caches aggressively for the static site; health check is on the default_server block - ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only - DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev` ## Test plan - [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok` - [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status` - [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert - [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected) - [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120	2026-02-08 02:36:19 -08:00
Erich Blume	95610d8e54	Fix Quartz build to preserve git history for accurate file dates (#106 ) ## Summary Fixes the "isn't yet tracked by git, dates will be inaccurate" warnings by using Quartz's `-d docs` flag instead of symlinking. ## Problem The previous approach symlinked `content -> docs`, but git doesn't follow symlinks. When Quartz asked git about `content/index.md`, git had no history for that path. ## Solution Use `npx quartz build -d docs` to tell Quartz to read from `docs/` directly. Now when Quartz asks git about `docs/index.md`, git finds the actual file history. - CHANGELOG.md is copied (not symlinked) into `docs/` for the build, then removed - All other files have accurate git-based dates ## Testing Tested locally - build produces no warnings. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/106	2026-02-04 08:46:47 -08:00
Erich Blume	03bda41de4	Fix Quartz build to preserve git history for accurate file dates (#105 ) ## Summary Fixes the "isn't yet tracked by git, dates will be inaccurate" warnings in the Build docs step by restructuring how Quartz builds the documentation. ## Problem Previously, we copied docs into Quartz's content folder. Since this was inside a fresh Quartz clone with no history of our files, the `CreatedModifiedDate` plugin couldn't determine accurate dates. ## Solution Build Quartz from within the blumeops repo instead: 1. Copy Quartz's build system (quartz/, package.json, etc.) into the workspace 2. Symlink `content` -> `docs` (preserves git history) 3. Symlink `docs/CHANGELOG.md` -> `../CHANGELOG.md` 4. Build from workspace root where git can trace file history 5. Clean up artifacts after creating tarball ## Deployment and Testing - [ ] Run build workflow and verify no "not tracked by git" warnings - [ ] Verify file dates appear correctly on built docs site Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/105	2026-02-04 08:25:46 -08:00
Erich Blume	efdd569285	Improve build workflow with version bump selection and changelog in releases (#104 ) ## Summary - Add `version_type` choice input with options: BUMP_PATCH (default), BUMP_MINOR, BUMP_MAJOR, SPECIFIC_VERSION - Add optional `specific_version` input for explicit version selection - Include changelog content in Forgejo release body under "What's Changed" section - Move CHANGELOG.md to repository root (still copied into docs during Quartz build) - Add CHANGELOG link to docs index page - Update doc-links script to recognize build-time docs from repo root ## Changes Workflow inputs: - Previously: single optional `version` string input - Now: `version_type` choice dropdown (defaults to BUMP_PATCH) + optional `specific_version` for explicit versions Release body: - Previously: just asset download instructions - Now: includes "What's Changed" section with changelog entries for this release CHANGELOG.md location: - Previously: `docs/CHANGELOG.md` - Now: `CHANGELOG.md` (repo root), copied into docs content during build ## Deployment and Testing - [ ] Run build workflow with BUMP_PATCH (default) - [ ] Run build workflow with BUMP_MINOR - [ ] Verify changelog appears in release body - [ ] Verify docs site includes CHANGELOG page Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/104	2026-02-04 08:13:16 -08:00
Erich Blume	82bcd935cd	Move DOCS_RELEASE_URL from ConfigMap to Deployment This ensures ArgoCD sync triggers a pod rollout when the URL changes, since ConfigMap data changes don't restart pods automatically. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 17:23:52 -08:00
Erich Blume	1f73eb675d	Auto-deploy docs from build workflow (#93 ) ## Summary - Add `uv` and `argocd` CLI to forgejo-runner container image - Add `workflow-bot` ArgoCD account with sync permissions (declarative via kustomize patches) - Add `ARGOCD_AUTH_TOKEN` to forgejo-runner external secret for workflow auth - Update build workflow to auto-deploy docs after release: - Update configmap with new release URL - Commit changelog and configmap changes - Sync docs app via ArgoCD ## Deployment and Testing Manual steps required before this can work: 1. [ ] Build and push new forgejo-runner image (v2.4.0) 2. [ ] Sync argocd app to create workflow-bot account 3. [ ] Generate token: `argocd account generate-token --account workflow-bot` 4. [ ] Store token in 1Password under "Forgejo Secrets" with field `argocd_token` 5. [ ] Sync forgejo-runner app to pick up new external secret 6. [ ] Update forgejo-runner deployment to use new image version 7. [ ] Test by running workflow manually 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/93	2026-02-03 16:58:03 -08:00
Erich Blume	9a8587b83f	Add towncrier changelog system (#86 ) ## Summary - Configure towncrier with custom types (feature, bugfix, infra, doc, misc) - Build initial v0.1.0 changelog from zk management log entries - Integrate towncrier into build-blumeops workflow - Update README to mark Phase 1b complete ## How It Works 1. Add changelog fragments to `docs/changelog.d/` as `<id>.<type>.md` 2. When running build-blumeops workflow, towncrier collects fragments 3. CHANGELOG.md is updated and fragments are removed 4. Changes are committed back to main before docs build ## Testing - [x] Tested `uvx towncrier build` locally - [ ] Test workflow execution (after merge) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/86	2026-02-03 11:48:13 -08:00
Erich Blume	95a82321ee	Add authentication and error logging to release creation - Add Authorization header using GITHUB_TOKEN - Remove silent fail flag to see error responses - Log API responses for debugging Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 09:33:47 -08:00
Erich Blume	8780928b9a	Use GitHub upstream for Quartz until mirror is fixed Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 09:17:40 -08:00
Erich Blume	f11f8a4e89	Fix workflow to handle no existing releases Remove -f flag from curl so 404 on /releases/latest doesn't fail the script when there are no releases yet. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-02-03 09:15:53 -08:00
Erich Blume	b8104d75ad	Move zk cards to docs/zk/ for documentation restructuring (#84 ) ## Summary - Move all existing zettelkasten cards from `docs/` to `docs/zk/` as a temporary holding area - Update `zk-docs` mise task to look in the new location - Add `docs/README.md` explaining the Diataxis-based restructuring plan and target audiences ## Context This is phase 1 of a multi-phase documentation restructuring effort. The goal is to reorganize docs to follow the Diataxis framework while serving multiple audiences: 1. Erich (owner) - knowledge graph/zk 2. Claude/AI agents - memory and context enrichment 3. New external readers - high-level overview 4. Potential operators/contributors - onboarding 5. Replicators - people wanting to duplicate the approach ## Testing - [x] Verified `mise run zk-docs` still works with the new path - [x] Updated obsidian.nvim config (in ~/.config/nvim) to point to new path ## Note The obsidian.nvim config change is outside this repo but was made as part of this work. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/84	2026-02-03 09:13:50 -08:00
Erich Blume	fd29244854	Simplify CI: remove Tailscale sidecar, use skopeo for push (#74 ) ## Summary - Remove Tailscale sidecar from build-push-image action - registry.ops.eblu.me is directly reachable from k8s pods via Caddy - Use skopeo for pushing images instead of docker push - Docker 27's manifest format has compatibility issues with zot registry - Remove tailscale_authkey secret requirement from workflows ## Deployment and Testing - [x] Tested with nettest-v0.10.0 tag - build succeeded and image pushed to registry 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/74	2026-01-30 10:18:20 -08:00
Erich Blume	ea42362b6f	Migrate Forgejo runner to Kubernetes with DinD (#60 ) ## Summary - Deploy Forgejo runner to k8s with Docker-in-Docker sidecar - Add job execution image with Node.js and Docker CLI - Retire host-mode runner on indri - All CI jobs now run containerized in k8s ## Components Added - `containers/forgejo-runner/Dockerfile` - Job execution image - `argocd/apps/forgejo-runner.yaml` - ArgoCD Application - `argocd/manifests/forgejo-runner/` - Kubernetes manifests ## Components Removed - `ansible/roles/forgejo_runner/` - No longer needed ## Changes to Existing Files - `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env - `.github/actionlint.yaml` - Only `k8s` label now valid ## Deployment 1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl \| kubectl --context=minikube-indri apply -f -` 2. Sync ArgoCD: `argocd app sync forgejo-runner` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60	2026-01-25 19:56:17 -08:00
Erich Blume	424647cd93	Use Tailscale sidecar for container registry push Some checks failed Build Container / build (push) Failing after 1m9s Details Docker Desktop's VM can't resolve tailnet hostnames. Work around this by: 1. Starting a Tailscale container that joins the tailnet 2. Building the image with docker build 3. Saving to tarball with docker save 4. Pushing via skopeo inside the Tailscale container Uses TS_CI_GATEWAY_AUTHKEY repository secret for authentication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:29:01 -08:00
Erich Blume	31697b4d63	Add nettest container for CI/CD network debugging (#52 ) Some checks failed Build Container / build (push) Failing after 18s Details ## Summary - Add `containers/nettest/` with Alpine-based Dockerfile and connectivity test script - Add `.forgejo/workflows/build-nettest.yaml` workflow triggered by `nettest-v*` tags - Test script checks DNS resolution and HTTPS connectivity to forge and registry ## Deployment and Testing - [ ] Merge PR to main - [ ] Run `mise run container-release nettest v0.1.0` to trigger first build - [ ] Verify workflow runs successfully and container can reach tailnet services - [ ] Manually test from minikube: `kubectl run nettest --rm -it --image=registry.tail8d86e.ts.net/blumeops/nettest:v0.1.0` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/52	2026-01-24 16:54:35 -08:00
Erich Blume	8ca8798121	Switch to Buildah for container builds (#51 ) All checks were successful Test CI / test (push) Successful in 4s Details ## Summary - Replace Docker with Buildah for container image builds - No Docker socket required - buildah is daemonless - Cleaner security model (no privileged containers or socket mounting) - Remove Docker-related security context from deployment ## Changes - Update Dockerfile to install buildah/podman instead of docker-cli - Configure buildah storage with overlay driver and fuse-overlayfs - Update composite action to use `buildah bud` and `buildah push` - Add `imagePullPolicy: Always` to ensure fresh image pulls - Update test workflow to verify buildah/podman ## Testing - [ ] Runner pod starts successfully - [ ] Buildah is available in runner - [ ] Test workflow verifies buildah/podman versions - [ ] Container build workflow builds and pushes to zot 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51	2026-01-24 13:30:26 -08:00
Erich Blume	5fcd122494	Reorganize CI/CD bootstrap phases and add custom runner Dockerfile (#50 ) All checks were successful Test CI / test (push) Successful in 2s Details ## Summary - Reorder CI/CD bootstrap phases to address chicken-and-egg problem - P2 is now "Custom Runner Image" (stock runner lacks Node.js) - Add P3 for "Mirror Forgejo & Build from Source" - Rename P3 -> P4 (Self-Deploy), P4 -> P5 (Container Builds) - Add Dockerfile for custom runner with Node.js, npm, docker, build tools - Update overview with new phase structure, host mode notes, and cross-compilation challenge ## Key Changes ### Phase Reordering \| Old \| New \| Name \| \|-----\|-----\|------\| \| P1 \| P1 \| Enable Actions (complete) \| \| P2 \| P2 \| Custom Runner Image (new focus) \| \| - \| P3 \| Mirror Forgejo & Build (new) \| \| P3 \| P4 \| Self-Deploy \| \| P4 \| P5 \| Container Builds \| ### Custom Runner Dockerfile The stock `forgejo/runner:3.5.1` image lacks Node.js, so `actions/checkout@v4` doesn't work. The new Dockerfile adds: - Node.js + npm (for GitHub Actions) - Docker CLI (for container builds) - Build tools (gcc, make, curl, jq) ### Bootstrap Strategy 1. Build custom runner image manually on gilbert (podman build) 2. Push to zot registry 3. Update deployment to use custom image 4. Then enable auto-build workflow for runner ## Deployment and Testing - [x] Review plan changes - [x] Build custom runner image manually and verify - [x] Update runner deployment - [x] Test `actions/checkout@v4` works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/50	2026-01-23 18:50:27 -08:00
Erich Blume	3bcad4189f	Add actionlint pre-commit hook for workflow validation (#49 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Fix workflow to use `github.` context variables (Forgejo schema validator only recognizes GitHub Actions syntax, not `gitea.` aliases) - Pass untrusted inputs through environment variables (security best practice per actionlint) - Add actionlint to Brewfile and pre-commit config to catch workflow validation errors locally ## Deployment and Testing - [x] Pre-commit hooks all pass - [x] actionlint validates `.forgejo/workflows/test.yaml` successfully - [ ] Verify workflow runs without errors on Forge after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/49	2026-01-23 17:56:24 -08:00
Erich Blume	7893c41020	Enable Forgejo Actions (Phase 1) (#48 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password - Enable Forgejo Actions in config (`[actions] ENABLED = true`) - Add `repo.actions` to DEFAULT_REPO_UNITS - Clean up unused MySQL database fields (we use SQLite) ## Phase 1 Progress This PR covers the first part of Phase 1 (ci-cd-bootstrap plan): - [x] Refactor app.ini to ansible template - [x] Store secrets in 1Password - [x] Enable Actions in config - [ ] Deploy config changes (pending review) - [ ] Create runner registration token - [ ] Deploy runner to k8s - [ ] Test with simple workflow ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags forgejo` to deploy - [ ] Verify Forgejo restarts correctly - [ ] Verify Actions tab appears in repo settings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48	2026-01-23 17:00:12 -08:00

47 commits