Commit graph

115 commits

Author SHA1 Message Date
343d066701 Simplify runner image and workflows (Dagger Phase 3)
Remove Node.js, Docker CLI, buildx, skopeo, gnupg, lsb-release, and
xz-utils from the job execution image — all build tools now live inside
Dagger containers. Add tzdata (for TZ env var support) and flyctl.

Remove "Ensure Dagger CLI" bootstrap steps from both workflows and the
"Install flyctl" step from build-blumeops. Set TZ=America/Los_Angeles
in the runner configmap so all job containers inherit it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-11 17:23:37 -08:00
Forgejo Actions
996afbcf6f Update docs release to v1.6.5
[skip ci]
2026-02-11 17:10:29 -08:00
Forgejo Actions
6ce03df819 Update docs release to v1.6.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-12 01:01:23 +00:00
2a04ab26b7 Mount host zoneinfo into runner for TZ support (#160)
## Summary

The `TZ=America/Los_Angeles` env var from #159 has no effect because the `forgejo/runner` image doesn't ship tzdata. Mount the node's `/usr/share/zoneinfo` into the container so the timezone database is available.

## Deployment

After merge, sync forgejo-runner and verify:
```
argocd app sync forgejo-runner
kubectl -n forgejo-runner exec deploy/forgejo-runner -c runner -- date
# Should show PST/PDT, not UTC
```

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/160
2026-02-11 16:57:11 -08:00
42ebc2b122 Fix Forgejo runner timezone (UTC -> America/Los_Angeles) (#159)
## Summary

- Set `TZ=America/Los_Angeles` on the Forgejo runner container

The runner pod defaults to UTC. When releases are cut in the evening PST, towncrier stamps changelog entries with tomorrow's date (e.g., v1.6.2 shows 2026-02-12 despite being released on the evening of Feb 11 PST).

## Deployment

After merge, sync the forgejo-runner ArgoCD app:
```
argocd app sync forgejo-runner
```

The runner pod will restart with the new timezone. Note: the v1.6.2 changelog entry will remain dated 2026-02-12; future entries will use PST dates, so dates may appear non-sequential once.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/159
2026-02-11 16:53:41 -08:00
Forgejo Actions
e5d1e795e0 Update docs release to v1.6.3
[skip ci]
2026-02-12 00:46:35 +00:00
Forgejo Actions
a75089d8ef Update docs release to v1.6.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-12 00:35:02 +00:00
1bc2b421a8 Adopt Dagger CI for container builds (Phase 1) (#156)
All checks were successful
Build Container / build (push) Successful in 13s
## Summary

- Add Dagger Python module (`.dagger/`) with `build` and `publish` functions for container images
- Replace Docker buildx + skopeo composite action with `dagger call publish` in `build-container.yaml`
- BuildKit's native push is compatible with Zot — **skopeo workaround eliminated**
- Add Dagger CLI (v0.19.11) to forgejo-runner Dockerfile, bump runner to v2.6.0
- Bootstrap step in workflow curl-installs dagger if not in runner (for first build on v2.5.1 runner)
- Delete old `.forgejo/actions/build-push-image/` composite action
- Add GPLv3 LICENSE

## Verified locally

- `dagger call build --src=. --container-name=nettest` — builds ✓
- `dagger call publish --src=. --container-name=nettest --version=dagger-test` — pushed to Zot ✓
- `dagger call build --src=. --container-name=forgejo-runner` — new runner image builds ✓
- Dagger CLI accessible inside built runner image ✓

## Deployment sequence (after merge)

1. `mise run container-tag-and-release forgejo-runner v2.6.0` — old runner bootstraps dagger via curl, builds new runner
2. `argocd app sync forgejo-runner` — runner restarts with v2.6.0 (dagger baked in)
3. `mise run container-tag-and-release nettest v0.13.0` — end-to-end test of new pipeline
4. `mise run container-list` — verify tags

## Not included (future phases)

- Phase 2: docs build + Forgejo packages migration
- Phase 3: runner simplification (remove skopeo, Node.js, etc.)
- Phase 4: future workflows

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/156
2026-02-11 15:38:31 -08:00
Forgejo Actions
362ae22ab7 Update docs release to v1.6.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 21:37:34 +00:00
Forgejo Actions
eca01a9546 Update docs release to v1.6.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 21:33:57 +00:00
Forgejo Actions
ab6661f5dd Update docs release to v1.5.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 20:17:12 +00:00
Forgejo Actions
a106f92c38 Update docs release to v1.5.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-11 15:53:49 +00:00
f0ac04fb8a Bootstrap buildx: revert to docker build, bump runner to v2.5.1 (#148)
All checks were successful
Build Container / build (push) Successful in 1m56s
## Summary
- Temporarily revert composite action to `docker build` so we can build the runner image (chicken-and-egg: current runner v2.5.0 doesn't have buildx)
- Bump runner label to `v2.5.1` so after sync the new runner image (with buildx) gets used

## Deployment plan
1. Merge this PR
2. Tag `forgejo-runner-v2.5.1` — builds with legacy `docker build` (one last time)
3. Sync forgejo-runner in ArgoCD to pick up the v2.5.1 label
4. Follow-up PR: switch action back to `docker buildx build`
5. Tag `nettest-v0.12.0` to verify buildx works end-to-end

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/148
2026-02-10 21:17:14 -08:00
85e36cd807 Operations and observability for sifaka NAS (#135)
## Summary
- Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring
- Formalize existing `node_exporter` container under Ansible management
- Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus
- Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime)
- Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS
- Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles

## Prerequisites before deploy
- [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP)
- [ ] Verify `ssh eblume@sifaka 'docker ps'` works
- [ ] Run `mise run provision-sifaka` to deploy containers
- [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes
- [ ] `argocd app sync prometheus` + `argocd app sync grafana-config`

## Test plan
- [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics`
- [ ] Verify Prometheus targets page shows both sifaka jobs as UP
- [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135
2026-02-09 17:44:05 -08:00
3415cad38c Log real client IPs via Fly-Client-IP header (#130)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 59s
## Summary
- Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header
- Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki
- Keeps `remote_addr` (the internal proxy IP) for debugging

Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs.

## Deployment and Testing
- [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory
- [ ] Verify in Grafana that new log lines include `client_ip` with real IPs
- [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130
2026-02-09 11:02:06 -08:00
Forgejo Actions
92a1081302 Update docs release to v1.5.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-09 15:30:21 +00:00
a0b076172f Fix Immich/Homepage Ingress host matching, add missing service checks (#127)
## Summary

- Fix Immich Ingress `host: photos` causing 404 with ProxyGroup (same FQDN mismatch as Prometheus/Loki)
- Migrate Homepage from old per-service Tailscale proxy to shared ProxyGroup (was the last holdout)
- Add Immich and Navidrome to `services-check` HTTP endpoints

## Deployment Notes

- Already tested on branch: Immich and Homepage both return 200 via Caddy
- Homepage's old Helm-managed Ingress was deleted manually; ArgoCD may recreate it on sync — prune with `argocd app sync homepage --prune` after merge
- Old per-service `ts-homepage-*` pod in tailscale namespace can be cleaned up after confirming ProxyGroup works

## Test Plan

- [x] `curl https://photos.ops.eblu.me/` returns 200
- [x] `curl https://go.ops.eblu.me/` returns 200
- [ ] `mise run services-check` fully passes after merge

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/127
2026-02-08 22:12:50 -08:00
e6cf7e47e0 Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m8s
## Summary
- Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy
- Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test
- Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses
- Switch Alloy push endpoints from `*.ops.eblu.me` (Caddy) to `*.tail8d86e.ts.net` (Tailscale Ingress)
- Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly

## Manual step (not in PR)
Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes.

## Deployment order
1. **Pulumi ACLs** — `mise run tailnet-preview && mise run tailnet-up`
2. **OAuth client** — Manual update in Tailscale admin console
3. **K8s Ingresses** — `argocd app sync apps && argocd app sync docs loki prometheus`
4. **Fly.io proxy** — `mise run fly-deploy`
5. **Verify** — `mise run services-check`, check Grafana dashboards

## Test plan
- [ ] `mise run tailnet-preview` shows clean diff
- [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions
- [ ] After deploy: Grafana dashboards show continued log/metric flow
- [ ] `curl -sf https://docs.eblu.me` returns 200
- [ ] `mise run services-check` passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126
2026-02-08 21:54:18 -08:00
Forgejo Actions
c8d0af6644 Update docs release to v1.5.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 18:06:46 +00:00
cc54b4f565 Add Fly.io proxy observability via embedded Alloy (#123)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00
Forgejo Actions
c46d55060d Update docs release to v1.5.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 10:37:30 +00:00
64a78422b1 Add Fly.io public reverse proxy for docs.eblu.me (#120)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 9s
## Summary

- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow

## Key details

- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`

## Test plan

- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120
2026-02-08 02:36:19 -08:00
Forgejo Actions
11c76d4768 Update docs release to v1.4.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:45:40 +00:00
Forgejo Actions
ab7efd8c1c Update docs release to v1.4.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:27:23 +00:00
Forgejo Actions
3f5017f732 Update docs release to v1.4.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:03:34 +00:00
3b4ff91469 Fix homepage Admin bookmark icons (#110)
## Summary
- Fix broken Pulumi icon: changed `pulumi` to `si-pulumi` (Simple Icons prefix required)
- Fix broken ArgoCD icon: changed `argocd` to `argo-cd` (Dashboard Icons uses hyphenated name)

## Deployment and Testing
- [ ] Sync homepage app in ArgoCD
- [ ] Verify icons appear on go.ops.eblu.me Admin bookmarks section

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/110
2026-02-05 06:29:39 -08:00
Forgejo Actions
808bc507d8 Update docs release to v1.3.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-05 01:22:10 +00:00
Forgejo Actions
a03a9faaad Update docs release to v1.3.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 22:40:18 +00:00
Forgejo Actions
e15caec898 Update docs release to v1.3.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 16:47:27 +00:00
Forgejo Actions
4aeade1543 Update docs release to v1.3.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 16:26:24 +00:00
Forgejo Actions
1835e3e80e Update docs release to v1.3.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 16:14:08 +00:00
1e13d4b83d Fix Navidrome automatic library scan schedule (#101)
## Summary
- Fix env var name from `ND_SCANSCHEDULE` to `ND_SCANNER_SCHEDULE` (Navidrome uses viper config where dots become underscores)
- Use explicit `@every 1h` format for clarity
- Reorder CLAUDE.md rules to emphasize running zk-docs first

## Root Cause
Navidrome logs showed "Periodic scan is DISABLED" at startup despite the env var being set. The config key is `scanner.schedule`, which translates to `ND_SCANNER_SCHEDULE` (not `ND_SCANSCHEDULE`).

## Deployment and Testing
- [ ] Sync navidrome app: `argocd app sync navidrome`
- [ ] Verify pod restarts with new env var
- [ ] Check logs for "Scheduling scanner" message instead of "Periodic scan is DISABLED"
- [ ] Wait ~1 hour and confirm scan runs automatically

🤖 Generated with [Claude Code](https://claude.ai/code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/101
2026-02-04 07:23:12 -08:00
Forgejo Actions
e405a48881 Update docs release to v1.2.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 05:18:37 +00:00
Forgejo Actions
f88da51e23 Update docs release to v1.2.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 04:53:30 +00:00
Forgejo Actions
16cdffaebf Update docs release to v1.1.5
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 04:34:31 +00:00
Forgejo Actions
e426473c59 Update docs release to v1.1.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 04:18:04 +00:00
Forgejo Actions
672dbda9d7 Update docs release to v1.1.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 03:07:15 +00:00
Forgejo Actions
f279891575 Update docs release to v1.1.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 03:02:13 +00:00
Forgejo Actions
81d99b689d Update docs release to v1.1.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 02:53:17 +00:00
Forgejo Actions
bf03d71780 Update docs release to v1.1.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:27:09 +00:00
82bcd935cd Move DOCS_RELEASE_URL from ConfigMap to Deployment
This ensures ArgoCD sync triggers a pod rollout when the URL changes,
since ConfigMap data changes don't restart pods automatically.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:23:52 -08:00
Forgejo Actions
103cc0deab Update docs release to v1.0.14
- Updated configmap with new DOCS_RELEASE_URL
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:18:33 +00:00
aaf5090509 Remove ARGOCD_AUTH_TOKEN from external secret
Workflow secrets come from Forgejo's secret store, not runner env.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:17:53 -08:00
Forgejo Actions
492aa9a104 Update docs release to v1.0.13
- Updated configmap with new DOCS_RELEASE_URL
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:15:22 +00:00
3a26d7e49a Update forgejo-runner image to v2.5.0
Fixes argocd CLI download.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:13:37 -08:00
Forgejo Actions
4d3222d91b Update docs release to v1.0.12
- Updated configmap with new DOCS_RELEASE_URL
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 01:07:05 +00:00
f08595a3c0 Update forgejo-runner image to v2.4.0
Includes uv and argocd CLI for auto-deploy workflow.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 17:05:09 -08:00
1f73eb675d Auto-deploy docs from build workflow (#93)
## Summary
- Add `uv` and `argocd` CLI to forgejo-runner container image
- Add `workflow-bot` ArgoCD account with sync permissions (declarative via kustomize patches)
- Add `ARGOCD_AUTH_TOKEN` to forgejo-runner external secret for workflow auth
- Update build workflow to auto-deploy docs after release:
  - Update configmap with new release URL
  - Commit changelog and configmap changes
  - Sync docs app via ArgoCD

## Deployment and Testing
Manual steps required before this can work:
1. [ ] Build and push new forgejo-runner image (v2.4.0)
2. [ ] Sync argocd app to create workflow-bot account
3. [ ] Generate token: `argocd account generate-token --account workflow-bot`
4. [ ] Store token in 1Password under "Forgejo Secrets" with field `argocd_token`
5. [ ] Sync forgejo-runner app to pick up new external secret
6. [ ] Update forgejo-runner deployment to use new image version
7. [ ] Test by running workflow manually

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/93
2026-02-03 16:58:03 -08:00
7d5e6b032b Update docs release to v1.0.11
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:40:06 -08:00
31564d1d9a Update docs release to v1.0.10
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 16:32:17 -08:00