The Docker-based runner with Tailscale sidecar approach was abandoned
in favor of host execution mode. Clean up the unused infrastructure:
- Remove tailscale_ci_gateway role and its reference in indri.yml
- Remove tag:ci-gateway ACL grants and tagOwners from pulumi policy
- Plist already removed from indri
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Docker-based runner had networking issues reaching Forgejo from job
containers. Host execution mode runs the runner daemon directly on indri,
with jobs executing on the host. Actions that need Docker use host
networking to access localhost:3001.
- Runner binary compiled locally at ~/code/3rd/forgejo-runner
- Labels use :host suffix instead of :docker://image
- PATH set in launchd plist for mise-managed tools (node, etc.)
- Container network set to "host" for actions needing Docker
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add --accept-routes to tailscale-ci-gateway for service routing
- Run forgejo-runner as root for docker socket access
- Mount actual docker socket path (not symlink)
- Use gateway network namespace for tailnet connectivity
- Registration uses gateway network for Forgejo access
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Architecture:
- tailscale_ci_gateway role: Runs Tailscale container on tailnet-jobs network
- forgejo_runner role: Runs runner daemon in container on same network
- Job containers also use tailnet-jobs network
This allows the runner and jobs to reach forge.tail8d86e.ts.net via
the Tailscale gateway, avoiding hairpinning issues with localhost.
Changes:
- Add tailscale_ci_gateway role with launchd management
- Refactor forgejo_runner to use containerized daemon
- Runner registers with Tailscale URL instead of localhost
- Job containers run on tailnet-jobs network
- Update playbook role ordering (gateway before runner)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ci-gateway tag owner (admin and blumeops can assign)
- Grant ci-gateway access to forge:443 for git operations
- Grant ci-gateway access to registry:443 for container push/pull
- Add ACL test for ci-gateway access
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove forgejo-runner k8s manifests and ArgoCD app (runner now on indri)
- Remove build-runner workflow (no longer needed)
- Add ci-base image with Ubuntu 22.04 + common CI tools
- Add build-ci-base workflow to build the image
- Update test workflow to check docker instead of buildah
- Document bootstrap vs production mode for runner labels
- Configure host.docker.internal:5050 for zot access from job containers
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use Docker instead of buildah in composite action
- Build workflows now run on docker-builder label
- Add actionlint config for custom runner labels
- Avoids nested containerization complexity in k8s
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Run forgejo-runner directly on indri using Docker container mode
instead of trying to build containers inside k8s pods. This avoids
nested containerization complexity.
Features:
- Build from source using mise + Go
- Docker container mode for job isolation
- Can build containers via Docker socket
- Labels: docker-builder (distinct from k8s runner)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Buildah needs UID/GID remapping to extract images with files
owned by different users (root, shadow, etc). Configure
subordinate UID/GID ranges for the runner user.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Remove imagePullPolicy: Always (rely on immutable tags)
- Use explicit version tag instead of :latest
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Multi-stage build from mirrored forgejo-runner source
- Create proper runner user with passwd entry (fixes buildah)
- Use named user instead of numeric UID
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Workflows trigger on git tags (e.g. runner-v1.0.0, devpi-v1.0.0)
- Composite action takes explicit version, tags image with version + SHA
- Add mise-tasks/container-list to enumerate containers and recent tags
- Add mise-tasks/container-release to create release tags
- Update CLAUDE.md with container release commands
- TODO: investigate zot tag immutability
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Replace docker-cli with buildah/podman in runner image
- Configure buildah for overlay storage with fuse-overlayfs
- Add registry config for insecure local registry
- Remove Docker socket mount and root security context from deployment
- Update composite action to use buildah bud/push instead of docker
Buildah is daemonless - no Docker socket required, cleaner security model.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Create composite action: .forgejo/actions/build-push-image
- Add build-runner.yaml workflow (triggers on Dockerfile changes)
- Add build-devpi.yaml workflow (triggers on Dockerfile/start.sh changes)
- Mount Docker socket in runner deployment for container builds
- Run runner as root to access Docker socket
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
- Fix workflow to use `github.*` context variables (Forgejo schema validator only recognizes GitHub Actions syntax, not `gitea.*` aliases)
- Pass untrusted inputs through environment variables (security best practice per actionlint)
- Add actionlint to Brewfile and pre-commit config to catch workflow validation errors locally
## Deployment and Testing
- [x] Pre-commit hooks all pass
- [x] actionlint validates `.forgejo/workflows/test.yaml` successfully
- [ ] Verify workflow runs without errors on Forge after merge
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/49
- Mark Phase 1 (Enable Actions) as completed with date
- Check off all verification items in P1
- Add Step 6 to Phase 4 for runner logging and metrics
- Update overview table with status column
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The stage.match selector wasn't preventing Alloy from logging decode
errors internally. Removing logfmt parsing entirely - JSON parsing
handles most structured logs, and plain text logs still get collected.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
## Summary
- Use `stage.match` to conditionally apply logfmt parsing only to lines that don't start with `{`
- This prevents error spam like `"failed to decode logfmt" component_path=/ component_id=loki.process.pods component=stage type=logfmt err="logfmt syntax error at pos 2 on line 1: unexpected '\"'"` when JSON-formatted logs hit the logfmt parser
## Deployment and Testing
- [ ] Sync alloy-k8s app to feature branch and verify errors stop appearing
- [ ] Verify JSON logs are still parsed correctly
- [ ] Verify logfmt logs (from Loki, Prometheus etc.) are still parsed correctly
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/46
LaunchDaemons run in the system domain and require sudo to query.
Without become: true, the check always fails and tries to reload.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack.
Summary
- Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal)
- Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses
- Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics
- Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net)
- Add ACL rule for port 9187 (CNPG metrics)
- Delete obsolete ansible roles for prometheus and loki
Changes
- argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress
- argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress
- argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications
- argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS
- argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint
- argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics
- ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints
- pulumi/policy.hujson - ACL for port 9187
- Deleted ansible/roles/prometheus/ and ansible/roles/loki/
Deployment and Testing
- Stop prometheus and loki on indri
- Sync ArgoCD apps (apps, prometheus, loki, grafana)
- Run mise run provision-indri -- --tags alloy
- Verify Grafana dashboards show data
🤖 Generated with https://claude.ai/claude-code
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42
## Summary
- Remove ansible roles for services migrated to k8s: devpi, kiwix, transmission
- Also remove unused node_exporter and podman ansible roles
- Remove service tags from indri for k8s-hosted services (grafana, kiwix, devpi, pg, feed)
- Update indri description to reflect current architecture
## Changes
**Ansible roles removed** (34 files, ~1000 lines):
- devpi, devpi_metrics
- kiwix
- transmission, transmission_metrics
- node_exporter
- podman
**Pulumi indri tags removed**:
- tag:grafana, tag:kiwix, tag:devpi, tag:pg, tag:feed
These services now run in k8s with their own Tailscale devices via tailscale-operator.
## Deployment and Testing
- [x] Verified remaining ansible roles match indri.yml
- [x] Verified no playbooks or role dependencies reference removed roles
- [ ] Run `pulumi preview` to verify tag changes
- [ ] Run `pulumi up` to apply tag changes
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/41
## Summary
- New `pr-comments` mise task queries Forge API for unresolved review comments on a PR
- Task takes a PR number as argument and displays all comments without a resolver
- Updated CLAUDE.md to include using this task after user reviews PRs
## Deployment and Testing
- [x] Tested task on PR #39 (shows no unresolved comments since all were resolved)
- [x] Tested error handling with non-existent PR #9999🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/40
## Summary
- Add Transmission BitTorrent daemon to k8s (torrent namespace)
- Add Kiwix ZIM archive server to k8s (kiwix namespace)
- NFS storage from sifaka for shared torrent/ZIM data
- Torrent-sync sidecar in kiwix deployment to manage declarative ZIM list
- ZIM-watcher CronJob to auto-restart kiwix when new archives appear
- Remove transmission, transmission_metrics, and kiwix ansible roles from indri
- Remove svc:kiwix from tailscale_serve defaults
## Key Decisions
- Direct NFS mount for kiwix (no PVC) since it shares storage with transmission
- Shell wrapper for kiwix-serve command (glob expansion)
- Accept HTTP 409 as "ready" in torrent sync (transmission session ID mechanism)
- Completed downloads stored in `/downloads/complete/` on sifaka
## Deployment and Testing
- [x] Deployed transmission to k8s
- [x] Verified transmission web UI at torrent.tail8d86e.ts.net
- [x] Moved existing ZIM files to complete folder
- [x] Deployed kiwix to k8s
- [x] Verified kiwix web UI at kiwix.tail8d86e.ts.net
- [x] Stopped old services on indri
- [x] Cleared svc:kiwix from Tailscale serve on indri
- [x] Updated zk documentation
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/39
## Summary
- Migrate minikube from podman driver to qemu2 driver for proper NFS/SMB volume mount support
- Update ansible minikube role with qemu installation and containerd runtime
- Remove podman role dependency from indri.yml
- Add synology user creation steps and post-migration zot reconfiguration notes
## Why
Phase 6 (Kiwix/Transmission migration) was blocked because the podman driver lacks kernel capabilities for filesystem mounts. QEMU2 creates an actual VM with full mount support.
## Deployment and Testing
- [ ] Create k8s-storage user on Synology DSM
- [ ] Store credentials in 1Password (synology-k8s-storage)
- [ ] Export current k8s state
- [ ] Stop and delete podman-based minikube cluster
- [ ] Run ansible to create QEMU2 cluster
- [ ] Test NFS volume mount with test pod
- [ ] Redeploy ArgoCD and all apps
- [ ] Verify all services healthy
- [ ] Reconfigure zot registry mirrors for containerd (post-migration)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/38