diff --git a/plans/ci-cd-bootstrap/00_overview.md b/plans/ci-cd-bootstrap/00_overview.md index 4b4ed08..1943618 100644 --- a/plans/ci-cd-bootstrap/00_overview.md +++ b/plans/ci-cd-bootstrap/00_overview.md @@ -49,12 +49,12 @@ This plan details the setup of Forgejo Actions as the CI/CD system for blumeops, ## Phases -| Phase | Name | Description | -|-------|------|-------------| -| 1 | [Enable Actions](P1_enable_actions.md) | Configure Forgejo for Actions, deploy runner | -| 2 | [Mirror & Build](P2_mirror_and_build.md) | Mirror upstream Forgejo, create build workflow | -| 3 | [Self-Deploy](P3_self_deploy.md) | Forgejo deploys itself, transition to mcquack | -| 4 | [Container Builds](P4_container_builds.md) | Build custom container images (devpi, etc.) | +| Phase | Name | Description | Status | +|-------|------|-------------|--------| +| 1 | [Enable Actions](P1_enable_actions.md) | Configure Forgejo for Actions, deploy runner | ✅ Complete | +| 2 | [Mirror & Build](P2_mirror_and_build.md) | Mirror upstream Forgejo, create build workflow | Planning | +| 3 | [Self-Deploy](P3_self_deploy.md) | Forgejo deploys itself, transition to mcquack | Planning | +| 4 | [Container Builds](P4_container_builds.md) | Build custom container images, runner observability | Planning | ## The Bootstrap Problem diff --git a/plans/ci-cd-bootstrap/P1_enable_actions.md b/plans/ci-cd-bootstrap/P1_enable_actions.md index ce1d252..a988348 100644 --- a/plans/ci-cd-bootstrap/P1_enable_actions.md +++ b/plans/ci-cd-bootstrap/P1_enable_actions.md @@ -2,7 +2,7 @@ **Goal**: Configure Forgejo to support Actions workflows and deploy a runner in k8s -**Status**: Planning +**Status**: Completed (2026-01-23) **Prerequisites**: None (uses existing brew-based Forgejo) @@ -281,13 +281,13 @@ Check https://forge.tail8d86e.ts.net/eblume/blumeops/actions for the workflow ru ## Verification Checklist -- [ ] Actions enabled in app.ini -- [ ] Forgejo restarted successfully -- [ ] Runner token stored in 1Password -- [ ] Runner deployment created in ArgoCD -- [ ] Runner pod running in k8s -- [ ] Runner shows as online in Forgejo admin -- [ ] Test workflow runs successfully +- [x] Actions enabled in app.ini +- [x] Forgejo restarted successfully +- [x] Runner token stored in 1Password +- [x] Runner deployment created in ArgoCD +- [x] Runner pod running in k8s +- [x] Runner shows as online in Forgejo admin +- [x] Test workflow runs successfully --- diff --git a/plans/ci-cd-bootstrap/P4_container_builds.md b/plans/ci-cd-bootstrap/P4_container_builds.md index 6a7ced8..60a075b 100644 --- a/plans/ci-cd-bootstrap/P4_container_builds.md +++ b/plans/ci-cd-bootstrap/P4_container_builds.md @@ -455,6 +455,96 @@ Add Trivy or similar: --- +## Step 6: Runner Observability (Logging & Metrics) + +### 6.1 Problem + +The forgejo-runner pod generates logs and metrics that should be collected for: +- Debugging failed workflow runs +- Monitoring runner health and capacity +- Alerting on runner failures + +### 6.2 Log Collection via Alloy + +The forgejo-runner namespace needs to be included in Alloy's k8s log collection. Alloy is already configured to scrape logs from k8s pods - verify the runner namespace is included. + +Check current Alloy config: +```bash +ssh indri 'cat ~/.config/alloy/config.alloy | grep -A20 discovery.kubernetes' +``` + +If using namespace filtering, ensure `forgejo-runner` is included. + +### 6.3 Metrics Collection + +The forgejo-runner exposes Prometheus metrics. Add a ServiceMonitor or configure Alloy to scrape: + +**Option A: ServiceMonitor (if using Prometheus Operator)** + +Create `argocd/manifests/forgejo-runner/servicemonitor.yaml`: +```yaml +apiVersion: monitoring.coreos.com/v1 +kind: ServiceMonitor +metadata: + name: forgejo-runner + namespace: forgejo-runner +spec: + selector: + matchLabels: + app: forgejo-runner + endpoints: + - port: metrics + interval: 30s +``` + +**Option B: Alloy scrape config** + +Add to Alloy's k8s scrape config to discover the runner pod's metrics endpoint. + +### 6.4 Create Runner Service for Metrics + +Add `argocd/manifests/forgejo-runner/service.yaml`: +```yaml +apiVersion: v1 +kind: Service +metadata: + name: forgejo-runner-metrics + namespace: forgejo-runner + labels: + app: forgejo-runner +spec: + selector: + app: forgejo-runner + ports: + - name: metrics + port: 8080 + targetPort: 8080 +``` + +Update kustomization.yaml to include the service. + +### 6.5 Grafana Dashboard + +Consider creating a dashboard for: +- Runner status (online/offline) +- Job queue depth +- Job execution time +- Success/failure rates + +### 6.6 Verification + +```bash +# Check runner logs are appearing in Loki +# Go to Grafana → Explore → Loki +# Query: {namespace="forgejo-runner"} + +# Check metrics are being scraped +# Go to Grafana → Explore → Prometheus +# Query: forgejo_runner_* +``` + +--- + ## Verification Checklist - [ ] devpi build workflow created @@ -464,6 +554,8 @@ Add Trivy or similar: - [ ] Reusable container workflow created - [ ] (Optional) Python build workflow created - [ ] (Optional) Scheduled builds configured +- [ ] Runner logs visible in Loki +- [ ] Runner metrics scraped by Prometheus/Alloy --- @@ -474,8 +566,10 @@ With this phase complete, we have: 2. **Forgejo self-deploys** from CI on tagged releases 3. **Container images** built automatically on push 4. Infrastructure for Python package builds +5. **Runner observability** with logs in Loki and metrics in Prometheus The CI/CD bootstrap is complete. Future work: - Add more container builds as needed - Add Python package publishing for internal tools - Consider adding a macOS runner on indri for native builds +- Create Grafana dashboards for CI/CD monitoring