Commit graph

1,012 commits

Author SHA1 Message Date
51a878cddb C0: review navidrome reference doc 2026-04-18 20:25:19 -07:00
deedeecef9 C0: adopt AGENTS.md as canonical agent config 2026-04-18 20:15:30 -07:00
71c1c453d6 Fetch job logs via SSH to indri instead of Forgejo web endpoint
Forgejo's web action routes don't support API token auth for private
repos (only session cookies or public access). Switch log fetching to
read the zstd-compressed log files directly from indri via SSH —
Forgejo stores all runner logs on disk regardless of which runner
executed the job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 17:08:46 -07:00
4f5a963ef6 Add API token auth and git remote detection to runner-logs
runner-logs now always authenticates with the Forgejo API token
(via --token flag, FORGEJO_TOKEN env, or 1Password) so it works on
private repos. The --repo default is auto-detected from the git
remote origin URL instead of being hardcoded.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 16:39:21 -07:00
1d62653871 Fix forge.eblu.me static assets by adding missing Host header
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m26s
The static asset cache block (css/js/png/etc) was missing
proxy_set_header Host, so Caddy received "forge.eblu.me" instead of
"forge.ops.eblu.me" and couldn't route the request. HTML loaded fine
because the main location / block had the header.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 16:00:56 -07:00
55abb17f50 Add resource limits to ArgoCD pods to prevent unbounded consumption
All 7 ArgoCD containers had no resource limits, allowing them to consume
unlimited CPU/memory during node pressure events. This contributed to
cluster-wide probe timeout cascades on minikube-indri.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 13:04:27 -07:00
Forgejo Actions
bdfcb4b677 Update docs release to v1.16.0
- Built changelog from towncrier fragments

[skip ci]
2026-04-18 10:00:54 -07:00
d26a6ae3b2 Update docs for Caddy routing and direct WireGuard peering v1.16.0
Comprehensive docs pass reflecting the new Fly proxy architecture:
- Fly proxy routes through Caddy on indri (not per-service TS Ingress)
- Direct WireGuard peering via --port=41641 pinning
- DERP relay performance lesson in Tailscale docs
- Caddy now in public traffic path
- indri tagged as flyio-target
- Removed fly-reload references
- Updated architecture diagrams and per-service setup guide
- Added changelog fragment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:57:30 -07:00
12b2786ca2 Route Fly proxy through Caddy on indri for direct WireGuard peering
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m59s
Tailscale Ingress pods in k8s can't establish direct WireGuard
connections (stuck behind pod-network NAT → DERP relay → 20s latency).
Indri's host-level Tailscale CAN peer directly with Fly.

Change all nginx upstreams to route through Caddy on indri instead of
per-service Tailscale Ingress endpoints. Tag indri as flyio-target in
the Tailscale ACL so the Fly proxy can reach it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:40:20 -07:00
bca4c2bede Expose Tailscale WireGuard UDP port on Fly proxy
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 1m33s
Enable direct peer-to-peer WireGuard connections by pinning tailscaled
to port 41641 and exposing it as a UDP service. Without this, all
traffic routes through Tailscale DERP relays causing 20+ second
latency. Requires dedicated IPv4 (allocated: 168.220.82.221).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 09:17:03 -07:00
c8da243663 Run alloy-tracing as root for eBPF capabilities
The nix-built Alloy image sets User=65534 (nobody). Even with
privileged: true, a non-root user gets no effective capabilities
(CapEff=0). Override with runAsUser: 0 so Beyla gets CAP_BPF and
CAP_SYS_ADMIN needed for eBPF instrumentation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:42:26 -07:00
3a2913ba1f Allow BPF in privileged containers on ringtail
NixOS defaults kernel.unprivileged_bpf_disabled=2, which blocks BPF
syscalls outside the init namespace even with CAP_BPF. Set to 1 so
privileged containers (Beyla/Alloy tracing) can create BPF maps.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:32:30 -07:00
24f3f9b24a Raise k3s memlock rlimit for eBPF tracing on ringtail
Beyla (alloy-tracing) has been failing since April 13 with
"failed to set memlock rlimit: operation not permitted" because k3s
inherits the default 8MB memlock limit. Set LimitMEMLOCK=infinity on
the k3s systemd service so privileged containers can use eBPF.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:27:39 -07:00
Forgejo Actions
a72a2c2bd4 Update docs release to v1.15.7
- Built changelog from towncrier fragments

[skip ci]
2026-04-18 08:14:58 -07:00
9bafe85b2b Add teslamate extensions to DR restore procedure v1.15.7
The earthdistance extension (depends on cube) must be created before
restoring the teslamate database — discovered missing after 2026-04-13 DR.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:12:26 -07:00
b4472c7849 Deploy devpi 6.19.3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:04:23 -07:00
4dab6d11bb Add remote commit check to container-build-and-release
Queries the Forgejo API to verify the target commit exists on the remote
before dispatching a build, preventing wasted CI runs on unpushed commits.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 08:03:21 -07:00
37b8a21524 Migrate devpi to Dagger build and bump to 6.19.3
Replace Dockerfile with container.py for native Dagger builds.
Bump devpi-server 6.19.1→6.19.3, devpi-web 5.0.1→5.0.2.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-18 07:57:05 -07:00
fe0e913963 Switch Fly proxy to upstream keepalive pools (#337)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s
## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
2026-04-17 16:39:52 -07:00
54b1cee950 Fix Connection header: only send 'upgrade' for WebSocket requests
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 1m35s
Was sending Connection: upgrade on every proxied request, which is
semantically wrong for normal HTTP traffic. Use a map to conditionally
send 'upgrade' only when the client requests a WebSocket switch,
'close' otherwise.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:27:40 -07:00
1c0ee099fb Move forge-specific latency panels to Forgejo dashboard
Fly.io dashboard keeps aggregate all-hosts p50/p90/p99. Forge-filtered
upstream response time panel moves to Forgejo's "Public Proxy" section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:13:40 -07:00
d7af004842 Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m53s
- Enable Forgejo /metrics endpoint (app.ini [metrics] section)
- Add Alloy scrape target for Forgejo metrics on indri
- Add upstream_response_time histogram to Fly proxy Alloy config
- Replace single p95 panel with p50/p90/p99 + upstream breakdown
  filtered to forge.eblu.me host

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:05:59 -07:00
8fccbda573 Extend Fly proxy latency histogram buckets to 60s
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m29s
Previous max bucket was 10s — all slower requests collapsed into +Inf,
making p50/p90/p99 unreadable during the Forgejo archive DoS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:50:28 -07:00
1631e11137 Add /user/ to forge robots.txt exclusion
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m47s
Crawlers follow auth redirects to /user/login which is pointless for them.
Saves round-trips for both sides.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:34:24 -07:00
0a98f76068 Update kiwix-serve to Dagger-built container (Alpine 3.23)
Points kustomization at v3.8.2-7a42aeb, the first image built from the
new container.py (replacing the Dockerfile).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:27:42 -07:00
65bc21b162 Add op-based auth to fly-deploy mise task
The task was missing FLY_API_TOKEN injection, requiring manual fly auth
login. Now uses op read to fetch the deploy token from 1Password.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:26:29 -07:00
7a42aeb77c Mitigate Forgejo archive endpoint DoS from crawler abuse
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m35s
Crawlers hitting /archive/ endpoints with unique commit SHAs generated 54GB
of git bundles in 2 days, pegging Forgejo at 43% CPU. Fix at multiple layers:

- Redirect archive requests to tailnet at Fly proxy (302)
- Expand robots.txt: block /users/, /*/archive/, /*/releases/download/
- Cache release artifact downloads at nginx (immutable, 7d TTL)
- Enable [cron.archive_cleanup] with 2h TTL and run-at-start

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 14:21:22 -07:00
5f38779d52 Migrate kiwix-serve container from Dockerfile to native Dagger build
Replaces the hand-written Dockerfile with container.py using the shared
alpine_runtime helper, which bumps the base image from Alpine 3.22 to 3.23.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 13:56:32 -07:00
fd9e1ac93b Remove nativeMessagingHosts.packages — breaks Firefox wrapper build
The _1password-gui package doesn't export native messaging manifests
in the format the Firefox wrapper expects. The 1Password NixOS module
already handles native messaging host registration separately.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:57:44 -07:00
e60e3d5fc7 Use programs.firefox module with 1Password extension via policy
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:54:33 -07:00
f283f9453d Set Firefox as default browser via home-manager xdg.mimeApps
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:43:35 -07:00
50dfdba4e6 Add Firefox, remove claude-cli:// handler workarounds
The xdg desktop entry and Librewolf user.js prefs didn't fix the
OAuth callback hang. Try stock Firefox instead as a simpler path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:42:36 -07:00
dd1cf4f198 Configure Librewolf to delegate claude-cli:// URIs to xdg-open
The xdg desktop entry and mimeapps were already registered but
Librewolf doesn't delegate unknown URI schemes to the system
handler by default. This adds user.js prefs to complete the chain.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 16:37:16 -07:00
68f845e773 Add changelog fragment for forge robots.txt
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:40:34 -07:00
7f6bbdc82c Add robots.txt to forge.eblu.me blocking crawlers from /mirrors/
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m19s
Facebook has been scraping forge mirror repos at ~3-4 req/s, slowing
down the Forgejo instance. Serve robots.txt directly from nginx to
disallow /mirrors/ while leaving eblume/* accessible to crawlers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:39:48 -07:00
5ec2411e20 Update navidrome, miniflux, forgejo-runner image tags to Alpine 3.23 builds [main]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 15:37:30 -07:00
3ecd888537 Switch container builds to manual-only workflow dispatch
Shared Dagger helpers (src/blumeops/) affect all Dagger-built containers,
making path-based auto-triggers unreliable. All builds now go through
`mise run container-build-and-release <name>`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 14:25:14 -07:00
352b95c141 Refactor Dagger go_build() helper and standardize Alpine 3.23
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (miniflux) (push) Successful in 10m2s
Build Container / build-dagger (forgejo-runner) (push) Successful in 10m2s
Extend go_build() with buildmode and extra_env params, migrate miniflux
and forgejo-runner to use it, and bump all Alpine bases from 3.22 to 3.23.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 10:10:46 -07:00
99f78c8745 Register claude-cli:// URI handler on ringtail for Claude Code OAuth
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-16 07:42:52 -07:00
fb1e8ff672 Deploy transmission containers from Dagger builds
Update kustomization image tags to the new container.py-built images
(v4.1.1-r1-2c483ce, v1.0.1-2c483ce).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:34:28 -07:00
2c483cefff Migrate transmission containers from Dockerfile to Dagger builds
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (transmission-exporter) (push) Successful in 2m29s
Build Container / build-dagger (transmission) (push) Successful in 2m29s
Replace Dockerfiles with native container.py for both transmission and
transmission-exporter. Updates base images (Alpine 3.23, Python 3.14),
pins uv to 0.11.6 instead of :latest.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 11:26:00 -07:00
519175c672 Fix borgmatic LaunchAgent TCC dialog hang by removing mise wrapper
LaunchAgents now call borgmatic directly at its mise-installed path
instead of routing through `mise x`, which triggered macOS TCC
permission dialogs (e.g. "mise wants to access Documents") that hung
headless sessions and caused backup failures.

Also adds `mise install` to the ansible role so borgmatic installation
is fully managed, and pins the version in both mise.toml and the role
defaults.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-15 07:23:46 -07:00
30ed018fd8 Update prowler image tag to v5.23.0-7c1cd11 [main]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:51:26 -07:00
7c1cd11e45 Upgrade Prowler to 5.23.0, remove registry workaround (#336)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dagger (prowler) (push) Successful in 36s
## Summary

- Upgrade Prowler from 5.22.0 to 5.23.0
- Remove the `enumerate-images` init container workaround from `cronjob-image-scan.yaml`
- Use native `--registry` and `--image-filter` flags now that upstream fix (PR prowler-cloud/prowler#10470) is released

The init container was a workaround for prowler-cloud/prowler#10457 where `--registry` args weren't forwarded to the provider constructor. We wrote the fix, it was merged, and v5.23.0 includes it.

## Test plan

- [ ] Build new container (`mise run container-release prowler 5.23.0`)
- [ ] Update kustomization.yaml with new image tag
- [ ] Sync prowler ArgoCD app from branch
- [ ] Manually trigger image scan job and verify `--registry` works natively
- [ ] Verify CIS and IaC scan cronjobs still work

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #336
2026-04-14 13:45:28 -07:00
6b690eb033 Review CC sso-gated-admin-tools: scope to ArgoCD only
Removed Grafana from the control description — no Prowler finding
references it. Tightened scope to match actual usage (ArgoCD wildcard
RBAC mute). Added workflow-bot scoping note.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 13:07:52 -07:00
be30668eef Automate Prowler MANUAL finding verification (#335)
## Summary
- Adds automated node-level verification to `review-compliance-reports`: kubelet file perms/ownership, kubelet config args, etcd CA separation, RBAC cluster-admin bindings
- Mutes the 14 MANUAL Prowler findings via new `manual-node-checks.yaml` mutelist file
- New `node-config-automated-verification` compensating control documents the approach
- Script fails loudly (red FAIL + verdict panel) if any check deviates from expected values

## Test plan
- [x] `mise run review-compliance-reports` — all 12 node checks PASS
- [x] Injected bad expected value (perms 400 vs actual 600) — FAIL rendered correctly
- [x] Fixed colon-in-binding-name bug (kubeadm:cluster-admins) with tab-separated jsonpath
- [ ] After merge: sync prowler mutelist ConfigMap and verify next scan shows 0 MANUAL findings

## Note
Prowler coverage is minikube-indri only — ringtail/k3s is a known gap tracked separately.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #335
2026-04-14 13:00:44 -07:00
Forgejo Actions
8c2f035e6d Update docs release to v1.15.6
- Built changelog from towncrier fragments

[skip ci]
2026-04-14 11:46:42 -07:00
04b44b350b Add changelog for ArgoCD token rotation after DR v1.15.6
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:45:00 -07:00
Forgejo Actions
f2514a6f02 Update docs release to v1.15.5
- Built changelog from towncrier fragments

[skip ci]
2026-04-14 11:29:27 -07:00
9d85c97b9b Update forgejo-runner kustomization tag to main-branch image v1.15.5
C0 follow-up: switch from branch-built tag to main-built v12.7.3-0e93cc0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-14 11:10:36 -07:00