From ceba6b3c2c2d378c6a9336691360860ef4084544 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Sat, 24 Jan 2026 16:21:49 -0800 Subject: [PATCH] Remove plans, they dont seem to work --- plans/ci-cd-bootstrap/00_overview.md | 179 --- plans/ci-cd-bootstrap/P1_enable_actions.md | 322 ----- plans/ci-cd-bootstrap/P2_mirror_and_build.md | 347 ----- plans/ci-cd-bootstrap/P3_mirror_forgejo.md | 349 ----- plans/ci-cd-bootstrap/P4_self_deploy.md | 409 ------ plans/ci-cd-bootstrap/P5_container_builds.md | 505 ------- plans/completed/k8s-migration/00_overview.md | 79 -- .../k8s-migration/P0_foundation.complete.md | 1225 ----------------- .../P1_k8s_infrastructure.complete.md | 657 --------- .../k8s-migration/P2_grafana.complete.md | 396 ------ .../k8s-migration/P3_postgresql.complete.md | 359 ----- .../k8s-migration/P4_miniflux.complete.md | 162 --- .../P5.1_docker_migration.complete.md | 208 --- .../k8s-migration/P5_devpi.complete.md | 102 -- .../k8s-migration/P6_kiwix.complete.md | 1039 -------------- plans/completed/k8s-migration/P7_forgejo.md | 394 ------ .../completed/k8s-migration/P8_woodpecker.md | 32 - plans/completed/k8s-migration/P9_cleanup.md | 52 - 18 files changed, 6816 deletions(-) delete mode 100644 plans/ci-cd-bootstrap/00_overview.md delete mode 100644 plans/ci-cd-bootstrap/P1_enable_actions.md delete mode 100644 plans/ci-cd-bootstrap/P2_mirror_and_build.md delete mode 100644 plans/ci-cd-bootstrap/P3_mirror_forgejo.md delete mode 100644 plans/ci-cd-bootstrap/P4_self_deploy.md delete mode 100644 plans/ci-cd-bootstrap/P5_container_builds.md delete mode 100644 plans/completed/k8s-migration/00_overview.md delete mode 100644 plans/completed/k8s-migration/P0_foundation.complete.md delete mode 100644 plans/completed/k8s-migration/P1_k8s_infrastructure.complete.md delete mode 100644 plans/completed/k8s-migration/P2_grafana.complete.md delete mode 100644 plans/completed/k8s-migration/P3_postgresql.complete.md delete mode 100644 plans/completed/k8s-migration/P4_miniflux.complete.md delete mode 100644 plans/completed/k8s-migration/P5.1_docker_migration.complete.md delete mode 100644 plans/completed/k8s-migration/P5_devpi.complete.md delete mode 100644 plans/completed/k8s-migration/P6_kiwix.complete.md delete mode 100644 plans/completed/k8s-migration/P7_forgejo.md delete mode 100644 plans/completed/k8s-migration/P8_woodpecker.md delete mode 100644 plans/completed/k8s-migration/P9_cleanup.md diff --git a/plans/ci-cd-bootstrap/00_overview.md b/plans/ci-cd-bootstrap/00_overview.md deleted file mode 100644 index 84199d6..0000000 --- a/plans/ci-cd-bootstrap/00_overview.md +++ /dev/null @@ -1,179 +0,0 @@ -# Forgejo Actions CI/CD Bootstrap Plan - -This plan details the setup of Forgejo Actions as the CI/CD system for blumeops, starting with the bootstrapping problem: using Forgejo to build and deploy Forgejo itself. - -## Goals - -1. **Forgejo Actions** as the primary CI system (replaces Woodpecker from original plan) -2. **Self-hosted Forgejo** built from source, deployed as mcquack LaunchAgent on indri -3. **Container builds** for ArgoCD manifests (devpi, etc.) -4. **Cron-scheduled tasks** via k8s CronJobs (not Actions) -5. **Local development** parity using `act` for workflow testing - -## Why Forgejo Actions over Woodpecker? - -- Native integration with Forgejo (no OAuth setup, automatic repo detection) -- GitHub Actions compatible syntax (huge ecosystem of reusable actions) -- `act` tool for local testing on gilbert -- Single system to maintain instead of two - -## Architecture Overview - -``` -┌─────────────────────────────────────────────────────────────────┐ -│ INDRI │ -│ ┌─────────────────────┐ │ -│ │ Forgejo │ ← Built from source │ -│ │ (mcquack agent) │ ← Deploys itself via CI │ -│ │ │ │ -│ │ - Web UI (3001) │ │ -│ │ - SSH (2200) │ │ -│ │ - Actions enabled │ │ -│ └─────────────────────┘ │ -└─────────────────────────────────────────────────────────────────┘ - │ - │ SSH deploy - ▼ -┌─────────────────────────────────────────────────────────────────┐ -│ KUBERNETES (minikube) │ -│ ┌─────────────────────┐ ┌─────────────────────┐ │ -│ │ Forgejo Runner │ │ Other Services │ │ -│ │ (host mode) │ │ (via ArgoCD) │ │ -│ │ │ │ │ │ -│ │ - Custom image │ │ │ │ -│ │ - Node.js + tools │ │ │ │ -│ │ - Docker builds │ │ │ │ -│ └─────────────────────┘ └─────────────────────┘ │ -└─────────────────────────────────────────────────────────────────┘ -``` - -## Phases - -| Phase | Name | Description | Status | -|-------|------|-------------|--------| -| 1 | [Enable Actions](P1_enable_actions.md) | Configure Forgejo for Actions, deploy runner in host mode | ✅ Complete | -| 2 | [Custom Runner Image](P2_mirror_and_build.md) | Build custom runner with Node.js/tools, enable standard Actions | ✅ Complete | -| 3 | [Mirror Forgejo & Build](P3_mirror_forgejo.md) | Mirror upstream Forgejo, create build workflow | Planning | -| 4 | [Self-Deploy](P4_self_deploy.md) | Forgejo deploys itself, transition to mcquack | Planning | -| 5 | [Container Builds](P5_container_builds.md) | Build custom container images (devpi, etc.) | Planning | - -## The Bootstrap Problem - -**Chicken-and-egg**: We need Forgejo Actions to build Forgejo, but Forgejo must be running first. - -**Additional complication**: The stock runner image lacks Node.js, so standard GitHub Actions don't work. - -**Solution**: -1. Keep current brew-based Forgejo running during setup ✅ -2. Enable Actions, deploy runner in host mode ✅ -3. **Build custom runner image** with Node.js and tools (bootstrap manually, then automate) -4. Mirror upstream Forgejo, create build workflow -5. Address cross-compilation challenge (Linux runner → macOS target) -6. First CI build creates the binary -7. CI deploys binary to indri as mcquack service -8. `brew services stop forgejo` and uninstall -9. Future builds: Forgejo builds and deploys itself - -**Cross-compilation challenge**: -The runner runs in Linux containers (k8s), but Forgejo needs to run on indri (macOS ARM64). Options: -- Cross-compile with CGO_ENABLED=1 (complex, needs OSX toolchain) -- Cross-compile with CGO_ENABLED=0 (breaks Tailscale DNS resolution) -- Build on gilbert manually, use CI only for deploy -- Run a native macOS runner on indri (outside k8s) - -This will be addressed in Phase 3. - -**Risk mitigation**: If self-deployment breaks Forgejo: -- blumeops is mirrored to GitHub -- Manual recovery: build on gilbert, scp to indri, restart service -- See Disaster Recovery section in P4 - -## Host Mode Runner - -The runner uses **host mode** (`ubuntu-latest:host`), meaning: -- Jobs run directly in the runner container (no Docker/k8s pods spawned) -- Tools must be pre-installed in the runner image -- Stock image lacks Node.js, so `actions/checkout@v4` doesn't work -- Solution: Build custom runner image with necessary tools (Phase 2) - -## Ansible Role Strategy - -The forgejo ansible role will follow the zot/alloy pattern: - -1. **Check binary exists** at expected path -2. **If missing**: Fail with message pointing to CI trigger instructions -3. **If present**: Deploy config, ensure LaunchAgent loaded - -Ansible does NOT: -- Build the binary (that's CI's job) -- Deploy new versions (that's CI's job) - -Ansible DOES: -- Manage app.ini configuration (via template with secrets from 1Password) -- Manage mcquack LaunchAgent plist -- Ensure service is running -- Collect logs via Alloy - -## Files Summary - -### New Files - -| Path | Purpose | -|------|---------| -| `argocd/apps/forgejo-runner.yaml` | ArgoCD Application for runner ✅ | -| `argocd/manifests/forgejo-runner/` | Runner k8s manifests ✅ | -| `argocd/manifests/forgejo-runner/Dockerfile` | Custom runner image (P2) | -| `.forgejo/workflows/build-runner.yml` | Auto-rebuild runner image (P2) | -| `.forgejo/workflows/test.yml` | Test workflow ✅ | -| (on forge) `eblume/forgejo/.forgejo/workflows/` | Build workflow in forgejo mirror (P3) | - -### Modified Files - -| Path | Change | -|------|--------| -| `ansible/roles/forgejo/` | Complete rewrite for mcquack pattern (P4) | -| `ansible/roles/alloy/defaults/main.yml` | Update forgejo log paths (P4) | -| zk cards | Update forgejo, argocd, blumeops cards | - -### Credentials Needed - -| Item | Purpose | Storage | -|------|---------|---------| -| Runner registration token | Runner auth to Forgejo | 1Password ✅ | -| SSH deploy key | Runner SSH to indri (for Forgejo deploy) | 1Password + k8s secret (P3) | - -## Related Plans - -- [P7_forgejo.md](../k8s-migration/P7_forgejo.md) - Original k8s migration plan (superseded for Forgejo itself, but SSH hostname split info still relevant) -- [P8_woodpecker.md](../k8s-migration/P8_woodpecker.md) - Original Woodpecker plan (superseded by Forgejo Actions) - -## Decision Log - -### 2026-01-23: Custom runner image as Phase 2 - -**Decision**: Move custom runner image work from P4 to P2 - -**Rationale**: -- Stock runner lacks Node.js, can't run `actions/checkout@v4` -- Need working GitHub Actions before building Forgejo -- Bootstrap manually (podman build on gilbert), then automate - -### 2026-01-23: Forgejo Actions over Woodpecker - -**Decision**: Use Forgejo Actions instead of Woodpecker CI - -**Rationale**: -- Native Forgejo integration (Actions is built-in) -- GitHub Actions compatible (reuse existing actions) -- `act` for local testing -- One less system to deploy and maintain - -### 2026-01-23: Keep Forgejo on indri (not k8s) - -**Decision**: Forgejo stays on indri as mcquack service, not migrated to k8s - -**Rationale**: -- Avoid circular dependency (ArgoCD needs Forgejo to deploy Forgejo) -- Simpler SSH handling (direct port, no k8s networking complexity) -- Forgejo is critical infrastructure, benefits from isolation -- Can still use Tailscale serve for external access diff --git a/plans/ci-cd-bootstrap/P1_enable_actions.md b/plans/ci-cd-bootstrap/P1_enable_actions.md deleted file mode 100644 index a988348..0000000 --- a/plans/ci-cd-bootstrap/P1_enable_actions.md +++ /dev/null @@ -1,322 +0,0 @@ -# Phase 1: Enable Forgejo Actions - -**Goal**: Configure Forgejo to support Actions workflows and deploy a runner in k8s - -**Status**: Completed (2026-01-23) - -**Prerequisites**: None (uses existing brew-based Forgejo) - ---- - -## Current State - -- Forgejo runs via `brew services` on indri -- Config at `/opt/homebrew/var/forgejo/custom/conf/app.ini` -- Actions not enabled -- No runners deployed - ---- - -## Step 1: Enable Actions in Forgejo - -### 1.1 Update app.ini - -SSH to indri and edit the Forgejo config: - -```bash -ssh indri 'vim /opt/homebrew/var/forgejo/custom/conf/app.ini' -``` - -Add the following sections: - -```ini -[actions] -ENABLED = true -DEFAULT_ACTIONS_URL = https://code.forgejo.org - -[repository] -; Allow workflows to be stored in .forgejo/workflows -DEFAULT_REPO_UNITS = repo.code,repo.issues,repo.pulls,repo.releases,repo.wiki,repo.projects,repo.packages,repo.actions -``` - -### 1.2 Restart Forgejo - -```bash -ssh indri 'brew services restart forgejo' -``` - -### 1.3 Verify Actions Enabled - -1. Go to https://forge.tail8d86e.ts.net -2. Navigate to any repo → Settings → Actions -3. Should see "Enable Repository Actions" option - ---- - -## Step 2: Create Runner Registration Token - -### 2.1 Generate Token in Forgejo UI - -1. Go to https://forge.tail8d86e.ts.net/admin/actions/runners -2. Click "Create new Runner" -3. Copy the registration token -4. Store in 1Password (blumeops vault) as "Forgejo Runner Token" - -### 2.2 Create k8s Secret Template - -Create `argocd/manifests/forgejo-runner/secret-token.yaml.tpl`: - -```yaml -# Template for op inject -apiVersion: v1 -kind: Secret -metadata: - name: forgejo-runner-token - namespace: forgejo-runner -type: Opaque -stringData: - token: "op://blumeops//token" -``` - ---- - -## Step 3: Deploy Runner to Kubernetes - -### 3.1 Create ArgoCD Application - -Create `argocd/apps/forgejo-runner.yaml`: - -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: forgejo-runner - namespace: argocd -spec: - project: default - source: - repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git - targetRevision: main - path: argocd/manifests/forgejo-runner - destination: - server: https://kubernetes.default.svc - namespace: forgejo-runner - syncPolicy: - syncOptions: - - CreateNamespace=true -``` - -### 3.2 Create Runner Manifests - -Create directory `argocd/manifests/forgejo-runner/` with: - -**kustomization.yaml**: -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -namespace: forgejo-runner -resources: - - namespace.yaml - - deployment.yaml - - serviceaccount.yaml - - secret-token.yaml -``` - -**namespace.yaml**: -```yaml -apiVersion: v1 -kind: Namespace -metadata: - name: forgejo-runner -``` - -**serviceaccount.yaml**: -```yaml -apiVersion: v1 -kind: ServiceAccount -metadata: - name: forgejo-runner - namespace: forgejo-runner -``` - -**deployment.yaml**: -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: forgejo-runner - namespace: forgejo-runner -spec: - replicas: 1 - selector: - matchLabels: - app: forgejo-runner - template: - metadata: - labels: - app: forgejo-runner - spec: - serviceAccountName: forgejo-runner - containers: - - name: runner - image: code.forgejo.org/forgejo/runner:3.5.1 - env: - - name: FORGEJO_INSTANCE_URL - value: "https://forge.tail8d86e.ts.net" - - name: RUNNER_NAME - value: "k8s-runner-1" - - name: RUNNER_TOKEN - valueFrom: - secretKeyRef: - name: forgejo-runner-token - key: token - command: - - /bin/sh - - -c - - | - # Register runner if not already registered - if [ ! -f /data/.runner ]; then - forgejo-runner register \ - --instance "$FORGEJO_INSTANCE_URL" \ - --token "$RUNNER_TOKEN" \ - --name "$RUNNER_NAME" \ - --labels "ubuntu-latest:docker://node:20-bookworm,ubuntu-22.04:docker://ubuntu:22.04" \ - --no-interactive - fi - # Start the runner daemon - forgejo-runner daemon - volumeMounts: - - name: runner-data - mountPath: /data - - name: docker-sock - mountPath: /var/run/docker.sock - resources: - requests: - memory: "256Mi" - cpu: "100m" - limits: - memory: "1Gi" - cpu: "1000m" - volumes: - - name: runner-data - emptyDir: {} - - name: docker-sock - hostPath: - path: /var/run/docker.sock - type: Socket -``` - -**Note**: The runner needs access to Docker to run workflow jobs in containers. In minikube with docker driver, `/var/run/docker.sock` is available. - ---- - -## Step 4: Deploy and Verify - -### 4.1 Inject Secrets and Deploy - -```bash -# Inject secrets -op inject -i argocd/manifests/forgejo-runner/secret-token.yaml.tpl \ - -o argocd/manifests/forgejo-runner/secret-token.yaml - -# Sync apps -argocd app sync apps -argocd app sync forgejo-runner -``` - -### 4.2 Verify Runner Registration - -```bash -# Check runner pod -kubectl --context=minikube-indri -n forgejo-runner get pods - -# Check runner logs -kubectl --context=minikube-indri -n forgejo-runner logs -f deployment/forgejo-runner - -# Verify in Forgejo UI -# Go to https://forge.tail8d86e.ts.net/admin/actions/runners -# Should see "k8s-runner-1" as online -``` - ---- - -## Step 5: Test with Simple Workflow - -### 5.1 Create Test Workflow - -In the blumeops repo, create `.forgejo/workflows/test.yml`: - -```yaml -name: Test CI - -on: - push: - branches: [main] - pull_request: - workflow_dispatch: - -jobs: - test: - runs-on: ubuntu-latest - steps: - - uses: actions/checkout@v4 - - name: Hello World - run: | - echo "Hello from Forgejo Actions!" - echo "Runner: ${{ runner.name }}" - echo "Repo: ${{ github.repository }}" -``` - -### 5.2 Push and Verify - -```bash -git add .forgejo/ -git commit -m "Add test workflow for Forgejo Actions" -git push -``` - -Check https://forge.tail8d86e.ts.net/eblume/blumeops/actions for the workflow run. - ---- - -## Verification Checklist - -- [x] Actions enabled in app.ini -- [x] Forgejo restarted successfully -- [x] Runner token stored in 1Password -- [x] Runner deployment created in ArgoCD -- [x] Runner pod running in k8s -- [x] Runner shows as online in Forgejo admin -- [x] Test workflow runs successfully - ---- - -## Troubleshooting - -### Runner Can't Connect to Forgejo - -The runner needs to reach `forge.tail8d86e.ts.net` from inside k8s. This should work via Tailscale operator egress (already configured for ArgoCD). - -If not working: -```bash -# Test from inside k8s -kubectl --context=minikube-indri run -it --rm curl --image=curlimages/curl -- \ - curl -v https://forge.tail8d86e.ts.net/api/v1/version -``` - -### Docker Socket Permission Denied - -The runner container needs to access the Docker socket. In minikube with docker driver, this should work. If permission denied: - -```bash -# Check socket permissions -kubectl --context=minikube-indri -n forgejo-runner exec deployment/forgejo-runner -- ls -la /var/run/docker.sock -``` - -May need to run runner as root or adjust security context. - ---- - -## Next Phase - -Once runner is working, proceed to [Phase 2: Mirror & Build](P2_mirror_and_build.md). diff --git a/plans/ci-cd-bootstrap/P2_mirror_and_build.md b/plans/ci-cd-bootstrap/P2_mirror_and_build.md deleted file mode 100644 index 5981066..0000000 --- a/plans/ci-cd-bootstrap/P2_mirror_and_build.md +++ /dev/null @@ -1,347 +0,0 @@ -# Phase 2: Custom Runner Image - -**Goal**: Build a custom forgejo-runner image with necessary tools, enabling standard GitHub Actions - -**Status**: Complete (2026-01-23) - -**Prerequisites**: [Phase 1](P1_enable_actions.md) complete (Actions enabled, runner deployed in host mode) - ---- - -## Problem Statement - -The stock `code.forgejo.org/forgejo/runner:3.5.1` image lacks tools needed for standard GitHub Actions: -- **Node.js** - Required by most actions (checkout, setup-*, etc.) -- **Git** - For repository operations (present but minimal) -- **Common build tools** - make, gcc, curl, jq, etc. - -In host mode, jobs run directly in the runner container, so these tools must be pre-installed. - -### Chicken-and-Egg Problem - -We can't use `actions/checkout@v4` to build the custom runner because that action requires Node.js, which we don't have yet. Solution: Bootstrap manually, then automate. - ---- - -## Step 1: Create Dockerfile for Custom Runner - -Create `argocd/manifests/forgejo-runner/Dockerfile`: - -```dockerfile -FROM code.forgejo.org/forgejo/runner:3.5.1 - -# The base image is Debian-based -# Install tools needed for GitHub Actions and builds -RUN apt-get update && apt-get install -y --no-install-recommends \ - # Required for actions/checkout and other Node-based actions - nodejs \ - npm \ - # Build essentials - git \ - curl \ - wget \ - jq \ - make \ - gcc \ - g++ \ - # For container builds (if we add Docker-in-Docker later) - ca-certificates \ - && rm -rf /var/lib/apt/lists/* - -# Verify Node.js is available -RUN node --version && npm --version -``` - ---- - -## Step 2: Bootstrap - Build Image Manually - -Since we can't use CI yet, build the image manually on gilbert and push to zot. - -### 2.1 Build with Podman - -```bash -cd ~/code/personal/blumeops/argocd/manifests/forgejo-runner - -# Build for linux/arm64 (minikube on M1 Mac) -podman build --platform linux/arm64 -t registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest . - -# Push to zot (no auth required) -podman push registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest -``` - -### 2.2 Verify Image in Registry - -```bash -curl -s https://registry.tail8d86e.ts.net/v2/blumeops/forgejo-runner/tags/list | jq . -``` - ---- - -## Step 3: Update Runner Deployment - -### 3.1 Update deployment.yaml - -Change the image from stock to custom: - -```yaml -# Before -image: code.forgejo.org/forgejo/runner:3.5.1 - -# After -image: registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest -``` - -### 3.2 Update kustomization.yaml - -Add Dockerfile to resources (for reference, not deployed): - -```yaml -# Note: Dockerfile is for building, not k8s deployment -# It lives here for co-location with the runner manifests -``` - -### 3.3 Sync Deployment - -```bash -argocd app sync forgejo-runner - -# Verify new image is running -kubectl --context=minikube-indri -n forgejo-runner get pods -o jsonpath='{.items[*].spec.containers[*].image}' -``` - ---- - -## Step 4: Test with Real GitHub Action - -Now that we have Node.js, test with `actions/checkout@v4`. - -### 4.1 Update Test Workflow - -Update `.forgejo/workflows/test.yml`: - -```yaml -name: Test CI - -on: - push: - branches: [main] - pull_request: - workflow_dispatch: - -jobs: - test: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Verify tools - run: | - echo "Node.js: $(node --version)" - echo "npm: $(npm --version)" - echo "Git: $(git --version)" - echo "Make: $(make --version | head -1)" - - - name: Show repo info - run: | - echo "Repository: ${{ github.repository }}" - echo "Branch: ${{ github.ref_name }}" - ls -la -``` - -### 4.2 Push and Verify - -```bash -git add .forgejo/workflows/test.yml -git commit -m "Test checkout action with custom runner" -git push -``` - -Check https://forge.tail8d86e.ts.net/eblume/blumeops/actions - should see successful run with `actions/checkout@v4`. - ---- - -## Step 5: Create Auto-Build Workflow for Runner - -Now that Actions work properly, create a workflow to rebuild the runner image automatically. - -### 5.1 Create Build Workflow - -Create `.forgejo/workflows/build-runner.yml`: - -```yaml -name: Build Runner Image - -on: - push: - paths: - - 'argocd/manifests/forgejo-runner/Dockerfile' - - '.forgejo/workflows/build-runner.yml' - workflow_dispatch: - -env: - REGISTRY: registry.tail8d86e.ts.net - IMAGE_NAME: blumeops/forgejo-runner - -jobs: - build: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Build image - run: | - cd argocd/manifests/forgejo-runner - # Use docker build (available in runner container) - # Note: This builds for the runner's native arch - docker build -t ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} . - docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \ - ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest - - - name: Push to registry - run: | - # Zot has no auth, just push - docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} - docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest - - - name: Verify push - run: | - curl -sf "https://${{ env.REGISTRY }}/v2/${{ env.IMAGE_NAME }}/tags/list" | jq . - echo "Image pushed: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}" -``` - -### 5.2 Note on Docker-in-Docker - -The runner runs in host mode, so we need Docker CLI available. Options: - -1. **Add Docker CLI to the custom image** (see Dockerfile update below) -2. **Mount Docker socket from minikube** (requires deployment change) -3. **Use Podman instead** (rootless, no socket needed) - -For now, we'll add Docker CLI to the image and mount the socket. - -### 5.3 Update Dockerfile for Docker Builds - -```dockerfile -FROM code.forgejo.org/forgejo/runner:3.5.1 - -RUN apt-get update && apt-get install -y --no-install-recommends \ - nodejs \ - npm \ - git \ - curl \ - wget \ - jq \ - make \ - gcc \ - g++ \ - ca-certificates \ - # Docker CLI for building container images - docker.io \ - && rm -rf /var/lib/apt/lists/* - -RUN node --version && npm --version && docker --version -``` - -### 5.4 Update Deployment for Docker Socket - -Add Docker socket mount to `deployment.yaml`: - -```yaml -volumeMounts: - - name: runner-data - mountPath: /data - - name: runner-config - mountPath: /config - - name: docker-sock - mountPath: /var/run/docker.sock -volumes: - - name: runner-data - emptyDir: {} - - name: runner-config - configMap: - name: forgejo-runner-config - - name: docker-sock - hostPath: - path: /var/run/docker.sock - type: Socket -``` - ---- - -## Step 6: Verification - -### 6.1 Manual Image Build Works - -```bash -# On gilbert -podman build --platform linux/arm64 -t registry.tail8d86e.ts.net/blumeops/forgejo-runner:test . -podman push registry.tail8d86e.ts.net/blumeops/forgejo-runner:test -``` - -### 6.2 Runner Uses Custom Image - -```bash -kubectl --context=minikube-indri -n forgejo-runner get pods -o jsonpath='{.items[*].spec.containers[*].image}' -# Should show: registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest -``` - -### 6.3 GitHub Actions Work - -- `actions/checkout@v4` succeeds -- Test workflow shows Node.js, npm, git versions - -### 6.4 Auto-Build Workflow Works - -Push a change to the Dockerfile and verify: -1. Workflow triggers -2. Image builds successfully -3. Image pushed to zot - ---- - -## Verification Checklist - -- [x] Dockerfile created for custom runner (Alpine-based with apk) -- [x] Image built manually on gilbert (podman build) -- [x] Image pushed to zot registry -- [x] Runner deployment updated to use custom image -- [x] Runner pod running with new image -- [x] `actions/checkout@v4` works in test workflow -- [ ] Auto-build workflow created (deferred - needs Docker socket) -- [ ] Docker socket mounted (for container builds) -- [ ] Auto-build workflow successfully rebuilds runner - ---- - -## Troubleshooting - -### Image Pull Fails in Minikube - -Minikube needs to be able to pull from zot. Check registry mirror config: -```bash -ssh indri 'minikube ssh -- cat /etc/containerd/certs.d/registry.tail8d86e.ts.net/hosts.toml' -``` - -### Docker Build Fails in Workflow - -If Docker socket mount doesn't work: -1. Check socket exists in minikube: `minikube ssh -- ls -la /var/run/docker.sock` -2. Check permissions: runner may need to be in docker group -3. Alternative: Use `podman` (rootless) instead of Docker - -### Node.js Actions Still Fail - -Ensure the runner pod restarted after image update: -```bash -kubectl --context=minikube-indri -n forgejo-runner rollout restart deployment/forgejo-runner -kubectl --context=minikube-indri -n forgejo-runner logs -f deployment/forgejo-runner -``` - ---- - -## Next Phase - -Once the custom runner is working with auto-build, proceed to [Phase 3: Mirror Forgejo & Build](P3_mirror_and_build.md) to set up Forgejo source builds. diff --git a/plans/ci-cd-bootstrap/P3_mirror_forgejo.md b/plans/ci-cd-bootstrap/P3_mirror_forgejo.md deleted file mode 100644 index 9e1e142..0000000 --- a/plans/ci-cd-bootstrap/P3_mirror_forgejo.md +++ /dev/null @@ -1,349 +0,0 @@ -# Phase 3: Mirror Forgejo & Build from Source - -**Goal**: Mirror upstream Forgejo to forge and create a workflow that builds it for macOS ARM64 - -**Status**: Planning - -**Prerequisites**: [Phase 2](P2_mirror_and_build.md) complete (custom runner image with Node.js/tools) - ---- - -## Problem Statement - -We want to build Forgejo from source to: -1. Have full control over the binary running on indri -2. Enable self-deployment via CI -3. Ensure proper macOS DNS resolution (requires CGO_ENABLED=1) - -### The Cross-Compilation Challenge - -The runner runs in a Linux container (k8s on indri), but the target is macOS ARM64 (indri itself). - -**Options**: - -| Option | Pros | Cons | -|--------|------|------| -| A. Cross-compile CGO_ENABLED=0 | Simple, no special toolchain | Breaks Tailscale MagicDNS resolution | -| B. Cross-compile CGO_ENABLED=1 | Proper DNS | Needs OSX cross-compiler (osxcross), complex | -| C. Build on gilbert manually | Works now, simple | Not automated, manual step | -| D. Native macOS runner on indri | Full native build | Runner outside k8s, different architecture | -| E. Hybrid: build on gilbert, deploy via CI | Uses existing tools | Partial automation | - -**Recommendation**: Start with Option C/E (manual build on gilbert, CI just deploys), then consider Option D if we want full automation. - ---- - -## Step 1: Mirror Upstream Forgejo - -### 1.1 User Action: Create Mirror on Forge - -**Manual step** (hairpinning doesn't work from indri): - -1. Go to https://forge.tail8d86e.ts.net -2. Click "+" → "New Migration" -3. Select "Gitea" as clone source -4. URL: `https://codeberg.org/forgejo/forgejo.git` -5. Repository name: `forgejo` -6. Check "This repository will be a mirror" -7. Click "Migrate Repository" - -### 1.2 Clone Mirror Locally - -```bash -git clone ssh://forgejo@forge.tail8d86e.ts.net/eblume/forgejo.git ~/code/3rd/forgejo -cd ~/code/3rd/forgejo -``` - ---- - -## Step 2: Understand Forgejo Build Process - -### 2.1 Build Requirements - -From Forgejo's `Makefile` and docs: - -- **Go**: 1.23+ (check `go.mod` for exact version) -- **Node.js**: 20+ (for frontend) -- **Make**: GNU Make -- **Git**: For version embedding - -### 2.2 Build Commands - -```bash -# Install frontend dependencies and build -make deps-frontend -make frontend - -# Build backend (with CGO for proper DNS on macOS) -CGO_ENABLED=1 TAGS="bindata sqlite sqlite_unlock_notify" make backend - -# Or all-in-one -CGO_ENABLED=1 TAGS="bindata sqlite sqlite_unlock_notify" make build -``` - -### 2.3 Output - -Binary at `gitea` (yes, the binary is still named `gitea` for compatibility). - ---- - -## Step 3: Build on Gilbert (Manual Bootstrap) - -For the initial bootstrap, build on gilbert (macOS ARM64 native). - -### 3.1 Setup Build Environment - -```bash -cd ~/code/3rd/forgejo -mise use go@1.23 node@20 - -# Verify tools -go version -node --version -make --version -``` - -### 3.2 Build - -```bash -# Clean build -make clean - -# Build frontend -make deps-frontend -make frontend - -# Build backend with CGO (important for macOS DNS!) -CGO_ENABLED=1 TAGS="bindata sqlite sqlite_unlock_notify" make backend - -# Verify binary -./gitea --version -file gitea # Should show: Mach-O 64-bit executable arm64 -``` - -### 3.3 Deploy to Indri - -```bash -# Copy binary -scp gitea indri:~/.local/bin/forgejo-new - -# Verify on indri -ssh indri '~/.local/bin/forgejo-new --version' -``` - ---- - -## Step 4: Create Deploy Workflow (Option E) - -Since cross-compilation is complex, use a hybrid approach: -1. Build on gilbert (manual trigger or pre-built) -2. CI workflow fetches and deploys - -### 4.1 SSH Deploy Key for Runner - -The runner needs SSH access to indri to deploy the binary. - -**Generate key on gilbert**: -```bash -ssh-keygen -t ed25519 -C "forgejo-runner-deploy" -f ~/.ssh/forgejo-runner-deploy -N "" -``` - -**Add public key to indri's authorized_keys**: -```bash -cat ~/.ssh/forgejo-runner-deploy.pub | ssh indri 'cat >> ~/.ssh/authorized_keys' -``` - -**Store private key in 1Password** (blumeops vault) as "Forgejo Runner Deploy Key" - -### 4.2 Create k8s Secret - -Create `argocd/manifests/forgejo-runner/secret-ssh.yaml.tpl`: - -```yaml -apiVersion: v1 -kind: Secret -metadata: - name: forgejo-runner-ssh - namespace: forgejo-runner -type: Opaque -stringData: - id_ed25519: | - op://blumeops//private-key - known_hosts: | - # Get with: ssh-keyscan indri.tail8d86e.ts.net 2>/dev/null | grep ed25519 - indri.tail8d86e.ts.net ssh-ed25519 AAAAC3... -``` - -### 4.3 Update Deployment for SSH - -Add SSH secret mount to `deployment.yaml`: - -```yaml -volumeMounts: - - name: ssh-key - mountPath: /root/.ssh - readOnly: true -volumes: - - name: ssh-key - secret: - secretName: forgejo-runner-ssh - defaultMode: 0600 -``` - -### 4.4 Create Deploy-Only Workflow - -Create `.forgejo/workflows/deploy-forgejo.yml` in blumeops: - -```yaml -name: Deploy Forgejo - -on: - workflow_dispatch: - inputs: - version: - description: 'Version to deploy (tag or commit)' - required: true - default: 'v10.0.0' - -jobs: - deploy: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Deploy to indri - env: - VERSION: ${{ github.event.inputs.version }} - run: | - # SSH config - mkdir -p ~/.ssh - cp /root/.ssh/id_ed25519 ~/.ssh/ - cp /root/.ssh/known_hosts ~/.ssh/ - chmod 600 ~/.ssh/id_ed25519 - - # Deploy script - ssh erichblume@indri.tail8d86e.ts.net << 'EOF' - set -e - cd ~/.local/bin - - # Verify the new binary exists and runs - if [ ! -f forgejo-new ]; then - echo "ERROR: forgejo-new not found. Build on gilbert first:" - echo " cd ~/code/3rd/forgejo && git checkout $VERSION" - echo " CGO_ENABLED=1 TAGS='bindata sqlite sqlite_unlock_notify' make build" - echo " scp gitea indri:~/.local/bin/forgejo-new" - exit 1 - fi - - ./forgejo-new --version - - # Stop current service - launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist 2>/dev/null || true - - # Atomic swap - mv forgejo forgejo-old 2>/dev/null || true - mv forgejo-new forgejo - - # Start new service - launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist - - # Verify it's running - sleep 5 - curl -sf http://localhost:3001/api/v1/version || exit 1 - - echo "Deploy successful!" - ./forgejo --version - EOF -``` - ---- - -## Future: Full CI Build (Option D) - -If we want full automation, consider running a native macOS runner on indri: - -### Native Runner on Indri - -```bash -# Install forgejo-runner on indri via mise -ssh indri 'mise use forgejo-runner' - -# Register as a macOS runner -ssh indri 'forgejo-runner register \ - --instance https://forge.tail8d86e.ts.net \ - --token "$TOKEN" \ - --name "indri-native" \ - --labels "macos-arm64:host" \ - --no-interactive' - -# Create LaunchAgent for runner -# (similar to other mcquack services) -``` - -Then workflow uses: -```yaml -runs-on: macos-arm64 -``` - -This enables full native builds in CI. Document in a future phase if needed. - ---- - -## Verification Checklist - -- [ ] Forgejo mirrored to forge -- [ ] Mirror cloned to ~/code/3rd/forgejo -- [ ] Build succeeds on gilbert -- [ ] Binary is valid macOS ARM64 executable -- [ ] Binary deployed to indri ~/.local/bin/ -- [ ] SSH deploy key created and stored in 1Password -- [ ] Deploy key added to indri authorized_keys -- [ ] (Optional) k8s SSH secret created -- [ ] (Optional) Deploy workflow created - ---- - -## Troubleshooting - -### Build Fails: Node.js Version - -``` -error: engine "node" is incompatible -``` - -Update Node.js: `mise use node@20` - -### Build Fails: Go Version - -``` -go: go.mod requires go >= 1.23 -``` - -Update Go: `mise use go@1.23` - -### Binary Crashes on indri - -Check if CGO was enabled: -```bash -# If built without CGO, DNS resolution may fail -./forgejo --version # Should work -./forgejo web # May fail to resolve Tailscale hostnames -``` - -Rebuild with `CGO_ENABLED=1`. - -### SSH Deploy Fails - -Check runner has SSH access: -```bash -# Test from inside runner pod -kubectl --context=minikube-indri -n forgejo-runner exec deployment/forgejo-runner -- \ - ssh -i /root/.ssh/id_ed25519 erichblume@indri.tail8d86e.ts.net 'echo ok' -``` - ---- - -## Next Phase - -Once Forgejo is building and deploying successfully, proceed to [Phase 4: Self-Deploy](P4_self_deploy.md) for the full mcquack transition. diff --git a/plans/ci-cd-bootstrap/P4_self_deploy.md b/plans/ci-cd-bootstrap/P4_self_deploy.md deleted file mode 100644 index 8a73843..0000000 --- a/plans/ci-cd-bootstrap/P4_self_deploy.md +++ /dev/null @@ -1,409 +0,0 @@ -# Phase 4: Self-Deploy & Transition to mcquack - -**Goal**: Complete the bootstrap - Forgejo deploys itself, transition from brew to mcquack LaunchAgent - -**Status**: Planning - -**Prerequisites**: [Phase 3](P3_mirror_forgejo.md) complete (Forgejo builds and deploys to indri) - ---- - -## Overview - -This phase completes the bootstrap: -1. First successful CI deploy creates the binary -2. Transition from brew service to mcquack LaunchAgent -3. Update ansible role to mcquack pattern -4. Remove brew forgejo - -After this phase, Forgejo builds and deploys itself on every tagged release. - ---- - -## Step 1: Prepare indri for mcquack - -### 1.1 Create Directory Structure - -```bash -ssh indri << 'EOF' - mkdir -p ~/.local/bin - mkdir -p ~/.config/forgejo - mkdir -p ~/Library/Logs -EOF -``` - -### 1.2 Prepare Data Directory - -The existing data is at `/opt/homebrew/var/forgejo`. We'll keep it there for now (simpler), or optionally migrate to `~/forgejo`. - -**Option A: Keep existing path** (recommended for simplicity) -- Data stays at `/opt/homebrew/var/forgejo` -- Binary moves to `~/.local/bin/forgejo` - -**Option B: Full migration** -- Move data to `~/forgejo` -- Requires updating app.ini paths - -For this plan, we'll use Option A. - ---- - -## Step 2: First CI Deploy - -### 2.1 Trigger Build with Deploy - -1. Go to https://forge.tail8d86e.ts.net/eblume/forgejo/actions -2. Select "Build Forgejo" workflow -3. Click "Run workflow" -4. Set deploy=true -5. Monitor the run - -### 2.2 Verify Binary Deployed - -```bash -ssh indri 'ls -la ~/.local/bin/forgejo && ~/.local/bin/forgejo --version' -``` - -At this point: -- New binary is at `~/.local/bin/forgejo` -- Brew forgejo is still running -- LaunchAgent doesn't exist yet - ---- - -## Step 3: Create mcquack LaunchAgent - -### 3.1 Create Plist Manually (One-Time Bootstrap) - -```bash -ssh indri << 'EOF' -cat > ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist << 'PLIST' - - - - - Label - mcquack.eblume.forgejo - ProgramArguments - - /Users/erichblume/.local/bin/forgejo - web - --config - /opt/homebrew/var/forgejo/custom/conf/app.ini - --work-path - /opt/homebrew/var/forgejo - - RunAtLoad - - KeepAlive - - StandardOutPath - /Users/erichblume/Library/Logs/mcquack.forgejo.out.log - StandardErrorPath - /Users/erichblume/Library/Logs/mcquack.forgejo.err.log - EnvironmentVariables - - HOME - /Users/erichblume - USER - erichblume - - - -PLIST -EOF -``` - ---- - -## Step 4: Cutover from Brew to mcquack - -### 4.1 Stop Brew Service - -```bash -ssh indri 'brew services stop forgejo' -``` - -### 4.2 Start mcquack Service - -```bash -ssh indri 'launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist' -``` - -### 4.3 Verify Service Running - -```bash -# Check process -ssh indri 'launchctl list | grep forgejo' - -# Check logs -ssh indri 'tail -20 ~/Library/Logs/mcquack.forgejo.err.log' - -# Check HTTP -curl -s https://forge.tail8d86e.ts.net/api/v1/version -``` - -### 4.4 Verify Git Operations - -```bash -# SSH test -ssh -T forgejo@forge.tail8d86e.ts.net - -# Clone test -git clone ssh://forgejo@forge.tail8d86e.ts.net/eblume/blumeops.git /tmp/test-clone -rm -rf /tmp/test-clone -``` - ---- - -## Step 5: Update Ansible Role - -### 5.1 Rewrite forgejo Role - -Replace `ansible/roles/forgejo/tasks/main.yml`: - -```yaml ---- -# Forgejo is built from source via CI and deployed automatically. -# This role manages the configuration and LaunchAgent only. -# -# BINARY DEPLOYMENT: -# The binary at ~/.local/bin/forgejo is deployed by Forgejo Actions CI. -# If missing, trigger a build at: -# https://forge.tail8d86e.ts.net/eblume/forgejo/actions -# -# CONFIGURATION: -# app.ini at /opt/homebrew/var/forgejo/custom/conf/app.ini contains secrets -# and is NOT managed by ansible. It is backed up by borgmatic. - -- name: Verify forgejo binary exists - ansible.builtin.stat: - path: "{{ forgejo_binary }}" - register: forgejo_binary_stat - -- name: Fail if forgejo binary not found - ansible.builtin.fail: - msg: | - Forgejo binary not found at {{ forgejo_binary }}. - - The binary is deployed by Forgejo Actions CI. To build and deploy: - 1. Go to https://forge.tail8d86e.ts.net/eblume/forgejo/actions - 2. Select "Build Forgejo" workflow - 3. Click "Run workflow" with deploy=true - - Alternatively, build manually on gilbert and scp to indri. - when: not forgejo_binary_stat.stat.exists - -- name: Check forgejo config exists - ansible.builtin.stat: - path: "{{ forgejo_config }}" - register: forgejo_config_stat - -- name: Fail if forgejo config is missing - ansible.builtin.fail: - msg: | - Forgejo config not found at {{ forgejo_config }} - This file contains secrets and is not managed by ansible. - To restore from backup, run: - borgmatic --config ~/.config/borgmatic/config.yaml extract --archive latest \ - --path {{ forgejo_config }} - when: not forgejo_config_stat.stat.exists - -- name: Deploy forgejo LaunchAgent plist - ansible.builtin.template: - src: forgejo.plist.j2 - dest: ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist - mode: '0644' - notify: Restart forgejo - -- name: Check if forgejo LaunchAgent is loaded - ansible.builtin.command: launchctl list mcquack.eblume.forgejo - register: forgejo_launchctl_check - changed_when: false - failed_when: false - -- name: Load forgejo LaunchAgent if not loaded - ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist - when: forgejo_launchctl_check.rc != 0 - changed_when: true - failed_when: false -``` - -### 5.2 Create defaults/main.yml - -```yaml ---- -# Forgejo binary and paths -forgejo_binary: /Users/erichblume/.local/bin/forgejo -forgejo_work_path: /opt/homebrew/var/forgejo -forgejo_config: "{{ forgejo_work_path }}/custom/conf/app.ini" -forgejo_log_dir: /Users/erichblume/Library/Logs - -# HTTP and SSH ports (must match app.ini) -forgejo_http_port: 3001 -forgejo_ssh_port: 2200 -``` - -### 5.3 Create templates/forgejo.plist.j2 - -```xml - - - - - - Label - mcquack.eblume.forgejo - ProgramArguments - - {{ forgejo_binary }} - web - --config - {{ forgejo_config }} - --work-path - {{ forgejo_work_path }} - - RunAtLoad - - KeepAlive - - StandardOutPath - {{ forgejo_log_dir }}/mcquack.forgejo.out.log - StandardErrorPath - {{ forgejo_log_dir }}/mcquack.forgejo.err.log - EnvironmentVariables - - HOME - /Users/erichblume - USER - erichblume - - - -``` - -### 5.4 Update handlers/main.yml - -```yaml ---- -- name: Restart forgejo - ansible.builtin.shell: | - launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist 2>/dev/null || true - launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist - changed_when: true -``` - ---- - -## Step 6: Update Alloy Log Collection - -Update `ansible/roles/alloy/defaults/main.yml`: - -Change forgejo log paths from brew to mcquack: -```yaml -alloy_brew_logs: - # Remove forgejo from here - - path: /opt/homebrew/var/log/tailscaled.log - service: tailscale - stream: stdout - -alloy_mcquack_logs: - # ... existing entries ... - - path: /Users/erichblume/Library/Logs/mcquack.forgejo.out.log - service: forgejo - stream: stdout - - path: /Users/erichblume/Library/Logs/mcquack.forgejo.err.log - service: forgejo - stream: stderr -``` - ---- - -## Step 7: Remove Brew Forgejo - -### 7.1 Uninstall Brew Package - -```bash -ssh indri 'brew uninstall forgejo' -``` - -### 7.2 Remove Old Logs - -```bash -ssh indri 'rm -f /opt/homebrew/var/log/forgejo.log' -``` - ---- - -## Step 8: Run Ansible - -```bash -mise run provision-indri -- --tags forgejo,alloy -``` - ---- - -## Disaster Recovery - -### If CI Deploy Breaks Forgejo - -1. **Build manually on gilbert**: - ```bash - cd ~/code/3rd/forgejo - git pull - mise use go node - TAGS="bindata sqlite sqlite_unlock_notify" make build - scp gitea indri:~/.local/bin/forgejo - ``` - -2. **Restart service**: - ```bash - ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist; launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist' - ``` - -3. **Verify**: - ```bash - curl https://forge.tail8d86e.ts.net/api/v1/version - ``` - -### If Forgejo Won't Start - -1. Check logs: `ssh indri 'tail -100 ~/Library/Logs/mcquack.forgejo.err.log'` -2. Check binary: `ssh indri '~/.local/bin/forgejo --version'` -3. Check config: `ssh indri 'cat /opt/homebrew/var/forgejo/custom/conf/app.ini | head -50'` -4. Try running manually: `ssh indri '~/.local/bin/forgejo web --config /opt/homebrew/var/forgejo/custom/conf/app.ini --work-path /opt/homebrew/var/forgejo'` - -### Switch ArgoCD to GitHub (Nuclear Option) - -If Forgejo is down and you need to deploy fixes: - -```bash -argocd repo add https://github.com/eblume/blumeops.git --username eblume --password $GITHUB_PAT -argocd app set apps --repo https://github.com/eblume/blumeops.git -argocd app sync apps -``` - -After recovery, switch back to Forgejo. - ---- - -## Verification Checklist - -- [ ] CI deploy completed successfully -- [ ] Binary at `~/.local/bin/forgejo` -- [ ] mcquack LaunchAgent created -- [ ] Brew service stopped -- [ ] mcquack service started -- [ ] HTTP works (`curl https://forge.tail8d86e.ts.net/api/v1/version`) -- [ ] SSH works (`ssh -T forgejo@forge.tail8d86e.ts.net`) -- [ ] Git clone/push works -- [ ] Ansible role updated -- [ ] Alloy logs updated -- [ ] Brew package uninstalled -- [ ] `mise run provision-indri` succeeds - ---- - -## Next Phase - -After bootstrap is complete, proceed to [Phase 5: Container Builds](P5_container_builds.md) to set up container image building for ArgoCD. diff --git a/plans/ci-cd-bootstrap/P5_container_builds.md b/plans/ci-cd-bootstrap/P5_container_builds.md deleted file mode 100644 index fcae2b2..0000000 --- a/plans/ci-cd-bootstrap/P5_container_builds.md +++ /dev/null @@ -1,505 +0,0 @@ -# Phase 5: Container Image Builds - -**Goal**: Set up CI workflows to build custom container images and push to zot registry - -**Status**: Planning - -**Prerequisites**: [Phase 4](P4_self_deploy.md) complete (Forgejo self-deploying, Actions working) - ---- - -## Overview - -With Forgejo Actions operational (including custom runner from P2), we can now build container images for: -- Custom devpi with pre-installed plugins -- Any other custom images needed for k8s services -- Release artifacts for Python packages - -**Note**: The custom runner image build is covered in [Phase 2](P2_mirror_and_build.md). This phase focuses on application container builds. - ---- - -## Use Case 1: devpi Custom Image - -### Current State - -devpi runs from `registry.tail8d86e.ts.net/blumeops/devpi:latest`, built manually: -- Base image: python -- Adds: devpi-server, devpi-web -- Startup script for auto-initialization - -### Goal - -Automate builds triggered by: -- Push to devpi repo on forge -- Manual workflow dispatch -- Optionally: upstream devpi release (via schedule check) - ---- - -## Step 1: Create Workflow for devpi - -### 1.1 Ensure devpi Repo Has Dockerfile - -The Dockerfile already exists at `argocd/manifests/devpi/Dockerfile`. We'll create a workflow in the blumeops repo that builds it. - -### 1.2 Create Build Workflow - -Create `.forgejo/workflows/build-devpi.yml` in blumeops repo: - -```yaml -name: Build devpi Image - -on: - push: - paths: - - 'argocd/manifests/devpi/Dockerfile' - - 'argocd/manifests/devpi/start.sh' - - '.forgejo/workflows/build-devpi.yml' - workflow_dispatch: - inputs: - tag: - description: 'Image tag (default: latest)' - required: false - default: 'latest' - -env: - REGISTRY: registry.tail8d86e.ts.net - IMAGE_NAME: blumeops/devpi - -jobs: - build: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Determine tag - id: tag - run: | - if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then - TAG="${{ github.event.inputs.tag }}" - else - TAG="latest" - fi - echo "tag=$TAG" >> "$GITHUB_OUTPUT" - - - name: Build image - uses: docker/build-push-action@v5 - with: - context: argocd/manifests/devpi - file: argocd/manifests/devpi/Dockerfile - platforms: linux/arm64 - load: true - tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }} - - - name: Push to registry - run: | - # Zot has no auth, just push - docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }} - - - name: Verify push - run: | - # Check image exists in registry - curl -sf "https://${{ env.REGISTRY }}/v2/${{ env.IMAGE_NAME }}/tags/list" | jq . -``` - -### 1.3 Runner Needs Registry Access - -The runner needs to reach `registry.tail8d86e.ts.net`. This should work via Tailscale egress (same as Forgejo access). - -If not, add egress for registry in `argocd/manifests/tailscale-operator/`: -```yaml -apiVersion: tailscale.com/v1alpha1 -kind: Connector -metadata: - name: egress-registry - namespace: tailscale-operator -spec: - hostname: egress-registry - subnetRouter: - advertiseRoutes: - - registry.tail8d86e.ts.net/32 -``` - ---- - -## Step 2: Test Build Workflow - -### 2.1 Push and Trigger - -```bash -# Make a small change to trigger -echo "# Build $(date)" >> argocd/manifests/devpi/Dockerfile -git add argocd/manifests/devpi/Dockerfile -git commit -m "Trigger devpi image rebuild" -git push -``` - -### 2.2 Monitor Build - -1. Go to https://forge.tail8d86e.ts.net/eblume/blumeops/actions -2. Watch "Build devpi Image" workflow -3. Verify success - -### 2.3 Verify Image in Registry - -```bash -curl -s https://registry.tail8d86e.ts.net/v2/blumeops/devpi/tags/list | jq . -``` - -### 2.4 Restart devpi to Use New Image - -```bash -kubectl --context=minikube-indri -n devpi rollout restart statefulset/devpi -``` - ---- - -## Step 3: Reusable Container Build Workflow - -### 3.1 Create Reusable Workflow - -Create `.forgejo/workflows/build-container.yml`: - -```yaml -name: Build Container Image - -on: - workflow_call: - inputs: - context: - description: 'Build context path' - required: true - type: string - dockerfile: - description: 'Dockerfile path (relative to context)' - required: false - type: string - default: 'Dockerfile' - image_name: - description: 'Image name (without registry)' - required: true - type: string - tag: - description: 'Image tag' - required: false - type: string - default: 'latest' - platforms: - description: 'Target platforms' - required: false - type: string - default: 'linux/arm64' - -env: - REGISTRY: registry.tail8d86e.ts.net - -jobs: - build: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Set up Docker Buildx - uses: docker/setup-buildx-action@v3 - - - name: Build and push - uses: docker/build-push-action@v5 - with: - context: ${{ inputs.context }} - file: ${{ inputs.context }}/${{ inputs.dockerfile }} - platforms: ${{ inputs.platforms }} - push: true - tags: ${{ env.REGISTRY }}/${{ inputs.image_name }}:${{ inputs.tag }} - - - name: Verify push - run: | - curl -sf "https://${{ env.REGISTRY }}/v2/${{ inputs.image_name }}/tags/list" | jq . -``` - -### 3.2 Use in devpi Workflow - -Simplify `.forgejo/workflows/build-devpi.yml`: - -```yaml -name: Build devpi Image - -on: - push: - paths: - - 'argocd/manifests/devpi/**' - workflow_dispatch: - -jobs: - build: - uses: ./.forgejo/workflows/build-container.yml - with: - context: argocd/manifests/devpi - image_name: blumeops/devpi -``` - ---- - -## Step 4: Python Package Builds (Optional) - -### 4.1 Use Case - -Build Python packages from forge repos and publish to devpi. - -Example: `mcquack` package (LaunchAgent management library) - -### 4.2 Create Python Build Workflow - -Create `.forgejo/workflows/build-python.yml`: - -```yaml -name: Build Python Package - -on: - workflow_call: - inputs: - package_path: - description: 'Path to package (contains pyproject.toml)' - required: false - type: string - default: '.' - python_version: - description: 'Python version' - required: false - type: string - default: '3.12' - publish: - description: 'Publish to devpi' - required: false - type: boolean - default: false - secrets: - DEVPI_PASSWORD: - required: false - -env: - DEVPI_URL: https://pypi.tail8d86e.ts.net - -jobs: - build: - runs-on: ubuntu-latest - steps: - - name: Checkout - uses: actions/checkout@v4 - - - name: Setup Python - uses: actions/setup-python@v5 - with: - python-version: ${{ inputs.python_version }} - - - name: Install uv - run: pip install uv - - - name: Build package - run: | - cd ${{ inputs.package_path }} - uv build - - - name: Upload artifact - uses: actions/upload-artifact@v4 - with: - name: dist - path: ${{ inputs.package_path }}/dist/ - - - name: Publish to devpi - if: inputs.publish - run: | - cd ${{ inputs.package_path }} - uv publish \ - --publish-url ${{ env.DEVPI_URL }}/eblume/dev/ \ - --username eblume \ - --password "${{ secrets.DEVPI_PASSWORD }}" -``` - ---- - -## Step 5: Scheduled Builds (Cron) - -### 5.1 Weekly Rebuild - -Keep images fresh with weekly rebuilds: - -```yaml -name: Weekly Image Rebuilds - -on: - schedule: - # Every Sunday at 3 AM UTC - - cron: '0 3 * * 0' - workflow_dispatch: - -jobs: - devpi: - uses: ./.forgejo/workflows/build-container.yml - with: - context: argocd/manifests/devpi - image_name: blumeops/devpi -``` - ---- - -## Future Improvements - -### Multi-Arch Builds - -For images that need both ARM64 and AMD64: - -```yaml -platforms: linux/arm64,linux/amd64 -``` - -Requires QEMU emulation setup in runner (already supported by buildx). - -### Build Caching - -Use GitHub/Forgejo cache actions: - -```yaml -- name: Cache Docker layers - uses: actions/cache@v4 - with: - path: /tmp/.buildx-cache - key: ${{ runner.os }}-buildx-${{ hashFiles('**/Dockerfile') }} -``` - -### Security Scanning - -Add Trivy or similar: - -```yaml -- name: Run Trivy vulnerability scanner - uses: aquasecurity/trivy-action@master - with: - image-ref: '${{ env.REGISTRY }}/${{ inputs.image_name }}:${{ inputs.tag }}' -``` - ---- - -## Step 6: Runner Observability (Logging & Metrics) - -### 6.1 Problem - -The forgejo-runner pod generates logs and metrics that should be collected for: -- Debugging failed workflow runs -- Monitoring runner health and capacity -- Alerting on runner failures - -### 6.2 Log Collection via Alloy - -The forgejo-runner namespace needs to be included in Alloy's k8s log collection. Alloy is already configured to scrape logs from k8s pods - verify the runner namespace is included. - -Check current Alloy config: -```bash -ssh indri 'cat ~/.config/alloy/config.alloy | grep -A20 discovery.kubernetes' -``` - -If using namespace filtering, ensure `forgejo-runner` is included. - -### 6.3 Metrics Collection - -The forgejo-runner exposes Prometheus metrics. Add a ServiceMonitor or configure Alloy to scrape: - -**Option A: ServiceMonitor (if using Prometheus Operator)** - -Create `argocd/manifests/forgejo-runner/servicemonitor.yaml`: -```yaml -apiVersion: monitoring.coreos.com/v1 -kind: ServiceMonitor -metadata: - name: forgejo-runner - namespace: forgejo-runner -spec: - selector: - matchLabels: - app: forgejo-runner - endpoints: - - port: metrics - interval: 30s -``` - -**Option B: Alloy scrape config** - -Add to Alloy's k8s scrape config to discover the runner pod's metrics endpoint. - -### 6.4 Create Runner Service for Metrics - -Add `argocd/manifests/forgejo-runner/service.yaml`: -```yaml -apiVersion: v1 -kind: Service -metadata: - name: forgejo-runner-metrics - namespace: forgejo-runner - labels: - app: forgejo-runner -spec: - selector: - app: forgejo-runner - ports: - - name: metrics - port: 8080 - targetPort: 8080 -``` - -Update kustomization.yaml to include the service. - -### 6.5 Grafana Dashboard - -Consider creating a dashboard for: -- Runner status (online/offline) -- Job queue depth -- Job execution time -- Success/failure rates - -### 6.6 Verification - -```bash -# Check runner logs are appearing in Loki -# Go to Grafana → Explore → Loki -# Query: {namespace="forgejo-runner"} - -# Check metrics are being scraped -# Go to Grafana → Explore → Prometheus -# Query: forgejo_runner_* -``` - ---- - -## Verification Checklist - -- [ ] devpi build workflow created -- [ ] devpi image builds successfully -- [ ] Image pushed to zot registry -- [ ] devpi pod uses new image -- [ ] Reusable container workflow created -- [ ] (Optional) Python build workflow created -- [ ] (Optional) Scheduled builds configured -- [ ] Runner logs visible in Loki -- [ ] Runner metrics scraped by Prometheus/Alloy - ---- - -## Summary - -With this phase complete, we have: -1. **Forgejo Actions** running with k8s runner -2. **Forgejo self-deploys** from CI on tagged releases -3. **Container images** built automatically on push -4. Infrastructure for Python package builds -5. **Runner observability** with logs in Loki and metrics in Prometheus - -The CI/CD bootstrap is complete. Future work: -- Add more container builds as needed -- Add Python package publishing for internal tools -- Consider adding a macOS runner on indri for native builds -- Create Grafana dashboards for CI/CD monitoring diff --git a/plans/completed/k8s-migration/00_overview.md b/plans/completed/k8s-migration/00_overview.md deleted file mode 100644 index 5e336c0..0000000 --- a/plans/completed/k8s-migration/00_overview.md +++ /dev/null @@ -1,79 +0,0 @@ -# Blumeops Minikube Migration Plan - -**Status**: Completed (2026-01-23) - -This plan detailed the phased migration of blumeops services from direct hosting on indri (Mac Mini M1) to a minikube cluster. The migration is now complete for all services that will be migrated. - -## Final Status - -| Phase | Name | Status | Notes | -|-------|------|--------|-------| -| 0 | [Foundation](P0_foundation.complete.md) | ✅ Complete | Container registry (zot) + minikube cluster | -| 1 | [K8s Infrastructure](P1_k8s_infrastructure.complete.md) | ✅ Complete | Tailscale operator, ArgoCD, CloudNativePG, PostgreSQL cluster | -| 2 | [Grafana](P2_grafana.complete.md) | ✅ Complete | Migrated Grafana via ArgoCD | -| 3 | [PostgreSQL](P3_postgresql.complete.md) | ✅ Complete | Data migration to k8s PostgreSQL | -| 4 | [Miniflux](P4_miniflux.complete.md) | ✅ Complete | Migrated Miniflux via ArgoCD | -| 5 | [devpi](P5_devpi.complete.md) | ✅ Complete | Migrated devpi via ArgoCD | -| 5.1 | [Docker Migration](P5.1_docker_migration.complete.md) | ✅ Complete | Switched minikube to docker driver (not QEMU2) | -| 6 | [Kiwix](P6_kiwix.complete.md) | ✅ Complete | Migrated Kiwix + Transmission via ArgoCD | -| 7 | [Forgejo](P7_forgejo.md) | ⏭️ Won't Do | Forgejo stays on indri - see [CI/CD Bootstrap](../../ci-cd-bootstrap/) | -| 8 | [Woodpecker](P8_woodpecker.md) | ⏭️ Won't Do | Replaced by Forgejo Actions - see [CI/CD Bootstrap](../../ci-cd-bootstrap/) | -| 9 | [Cleanup](P9_cleanup.md) | ⏭️ Won't Do | Observability cleanup done separately (2026-01-22) | - -## What Was Migrated to K8s - -| Service | Status | Notes | -|---------|--------|-------| -| Grafana | ✅ In k8s | Helm chart via ArgoCD | -| PostgreSQL | ✅ In k8s | CloudNativePG operator | -| Miniflux | ✅ In k8s | Using k8s PostgreSQL | -| devpi | ✅ In k8s | Custom container image | -| Kiwix | ✅ In k8s | NFS mount from sifaka | -| Transmission | ✅ In k8s | NFS mount from sifaka | -| Prometheus | ✅ In k8s | Migrated 2026-01-22 | -| Loki | ✅ In k8s | Migrated 2026-01-22 | -| Alloy (k8s) | ✅ In k8s | DaemonSet for pod logs | -| TeslaMate | ✅ In k8s | Added 2026-01-23 | - -## What Stays on Indri - -| Service | Reason | -|---------|--------| -| **Forgejo** | Critical infrastructure, avoids circular dependency with ArgoCD | -| **Zot Registry** | K8s needs images to start - must be outside k8s | -| **Alloy (host)** | Collects host-level metrics and logs | -| **Borgmatic** | Backup system must survive k8s failures | -| **Plex** | Uses own NAT traversal, not Tailscale | - -## Architecture Decisions Made - -### Minikube Driver: Docker (not QEMU2/Podman) -- Original plan called for QEMU2, but docker driver proved simpler -- NFS mounts work via Docker NAT through indri's LAN IP -- API server accessible via Tailscale TCP passthrough - -### Forgejo: Stays on Indri -- Original P7 planned k8s migration -- Decision changed: Forgejo is critical infrastructure -- Will be built from source via Forgejo Actions CI -- See [CI/CD Bootstrap Plan](../../ci-cd-bootstrap/) for details - -### CI/CD: Forgejo Actions (not Woodpecker) -- Original P8 planned Woodpecker deployment -- Decision changed: Use Forgejo's native Actions instead -- Simpler (one less system), GitHub Actions compatible -- See [CI/CD Bootstrap Plan](../../ci-cd-bootstrap/) for details - -### Observability: Migrated to K8s -- Original plan kept Prometheus/Loki on indri -- Changed: Migrated both to k8s (2026-01-22) -- Alloy on indri pushes to k8s endpoints -- Alloy DaemonSet in k8s collects pod logs - -## Lessons Learned - -1. **Docker driver is simpler than QEMU2** - Direct NFS mounts work, no VM complexity -2. **Tailscale operator works well** - Easy service exposure with automatic TLS -3. **CloudNativePG is production-ready** - Good operator, easy backups -4. **Keep critical infra outside k8s** - Forgejo and zot must survive k8s failures -5. **CGO matters on macOS** - Alloy needed CGO=1 for Tailscale DNS resolution diff --git a/plans/completed/k8s-migration/P0_foundation.complete.md b/plans/completed/k8s-migration/P0_foundation.complete.md deleted file mode 100644 index 934e83a..0000000 --- a/plans/completed/k8s-migration/P0_foundation.complete.md +++ /dev/null @@ -1,1225 +0,0 @@ -# Phase 0: Foundation (Complete) - -**Goal**: Container registry + minikube cluster without disrupting existing services - -**Status**: Complete - ---- - -## Important: Tailscale Service Creation Order - -> **WARNING**: You MUST create services in the Tailscale admin console BEFORE running `tailscale serve` commands via ansible. If you run `tailscale serve --service svc:foo` before the service exists in the admin console, the local config will be in a bad state. -> -> To fix a misconfigured service: -> ```bash -> tailscale serve --service svc:foo reset -> ``` -> Then create the service in admin console and try again. - ---- - -## Step 0.1: Update Pulumi ACLs (BEFORE Tailscale serve) - -**Files to modify:** -- `pulumi/policy.hujson` - -**Changes:** - -1. Add new tag to `tagOwners` section (around line 104, after `"tag:feed"`): -```hujson -"tag:registry": ["autogroup:admin", "tag:blumeops"], -``` - -2. Add test cases to `tests` section: - - Update Erich's accept list (around line 111) to include registry: - ```hujson - "accept": ["tag:grafana:443", "tag:kiwix:443", "tag:feed:443", "tag:loki:3100", "tag:pg:5432", "tag:homelab:22", "tag:registry:443"], - ``` - - Update Allison's deny list (around line 117) to deny registry: - ```hujson - "deny": ["tag:grafana:443", "tag:loki:3100", "tag:nas:445", "tag:registry:443"], - ``` - -**Note:** -- No member grant needed - admins have full access via wildcard, members don't need registry -- `tag:k8s` is added later in Phase 1 when the Tailscale Kubernetes Operator is deployed -- Zot supports htpasswd auth if we later need finer-grained control - -**Testing:** -```bash -mise run tailnet-preview # Review changes - should show new tag -mise run tailnet-up # Apply changes -``` - -**Implementation Details:** -- Also need to add `"tag:registry"` to indri's tags in `pulumi/__main__.py` (the `DeviceTags` resource), not just define it in `policy.hujson`. The policy file defines the tag ownership rules, but the device tags are managed separately in the Python code. - ---- - -## Step 0.2: Create Tailscale Services in Admin Console (MANUAL) - -> **CRITICAL**: Do this BEFORE running any ansible that calls `tailscale serve` - -1. Go to https://login.tailscale.com/admin/services -2. Create service `registry` with: - - Port: 443 (HTTPS) - - Host: indri - -**Implementation Details:** -- Tag is applied to indri via Pulumi in Step 0.1, not manually in admin console. - -**Verification:** -```bash -# Service should appear (even if not yet serving) -tailscale status | grep registry -``` - ---- - -## Step 0.3: Create Zot Registry Ansible Role - -**Note:** Zot is NOT in homebrew (no formula or tap). Clone to `~/code/3rd/` on indri and build from source (requires Go). - -**Prerequisites on indri (ALREADY COMPLETED):** -```bash -# Clone zot from forge mirror (use localhost:3001 - hairpinning doesn't work on indri) -ssh indri 'git clone http://localhost:3001/eblume/zot.git ~/code/3rd/zot' - -# Set up Go via mise (creates mise.toml in repo directory) -ssh indri 'cd ~/code/3rd/zot && mise use go@1.25' - -# Build (creates bin/zot-darwin-arm64, ~183MB) -ssh indri 'cd ~/code/3rd/zot && mise x -- make binary' - -# Verify binary exists -ssh indri 'ls -la ~/code/3rd/zot/bin/zot-darwin-arm64' -``` - -**Build verified:** Binary at `~/code/3rd/zot/bin/zot-darwin-arm64` (183MB, ARM64 native). - -**New files:** -``` -ansible/roles/zot/ -├── defaults/main.yml -├── tasks/main.yml -├── templates/ -│ ├── config.json.j2 -│ └── zot.plist.j2 -└── handlers/main.yml -``` - -**Key configuration (defaults/main.yml):** -```yaml -zot_repo_dir: "/Users/erichblume/code/3rd/zot" -zot_binary: "{{ zot_repo_dir }}/bin/zot-darwin-arm64" -zot_data_dir: "/Users/erichblume/zot" -zot_config_dir: "/Users/erichblume/.config/zot" -zot_port: 5000 -zot_log_dir: "/Users/erichblume/Library/Logs" - -# Pull-through cache registries (on-demand sync) -zot_sync_registries: - - name: docker.io - url: https://registry-1.docker.io - - name: ghcr.io - url: https://ghcr.io - - name: quay.io - url: https://quay.io -``` - -**Zot config.json template** (key sections): -```json -{ - "storage": { - "rootDirectory": "/Users/erichblume/zot" - }, - "http": { - "address": "0.0.0.0", - "port": "5000" - }, - "extensions": { - "sync": { - "enable": true, - "registries": [ - { - "urls": ["https://registry-1.docker.io"], - "content": [{"prefix": "**"}], - "onDemand": true, - "tlsVerify": true - }, - { - "urls": ["https://ghcr.io"], - "content": [{"prefix": "**"}], - "onDemand": true, - "tlsVerify": true - }, - { - "urls": ["https://quay.io"], - "content": [{"prefix": "**"}], - "onDemand": true, - "tlsVerify": true - } - ] - } - } -} -``` - -**Two modes of operation:** - -1. **Pull-through cache** (automatic): When you pull `registry.tail8d86e.ts.net/docker.io/library/nginx:latest`, Zot fetches from Docker Hub and caches locally. Subsequent pulls are local. - -2. **Private images** (manual push): Push your own images to any path NOT matching a sync prefix: - ```bash - # From gilbert (after building) - podman push myapp:v1 registry.tail8d86e.ts.net/blumeops/myapp:v1 - ``` - -**Namespace convention:** -- `registry.tail8d86e.ts.net/docker.io/*` → cached from Docker Hub -- `registry.tail8d86e.ts.net/ghcr.io/*` → cached from GHCR -- `registry.tail8d86e.ts.net/quay.io/*` → cached from Quay -- `registry.tail8d86e.ts.net/blumeops/*` → private images (built by you/Woodpecker) - -**LaunchAgent template (zot.plist.j2):** -```xml - - - - - Label - mcquack.eblume.zot - ProgramArguments - - - {{ zot_binary }} - serve - {{ zot_config_dir }}/config.json - - RunAtLoad - - KeepAlive - - StandardOutPath - {{ zot_log_dir }}/mcquack.zot.out.log - StandardErrorPath - {{ zot_log_dir }}/mcquack.zot.err.log - - -``` - -**Handlers (handlers/main.yml):** -```yaml -- name: Restart zot - ansible.builtin.shell: | - launchctl unload ~/Library/LaunchAgents/mcquack.eblume.zot.plist 2>/dev/null || true - launchctl load ~/Library/LaunchAgents/mcquack.eblume.zot.plist - changed_when: true -``` - -**Tasks should notify handler on config change:** -```yaml -- name: Deploy zot config - ansible.builtin.template: - src: config.json.j2 - dest: "{{ zot_config_dir }}/config.json" - notify: Restart zot -``` - -**Testing (after deploying role):** -```bash -# Check LaunchAgent is running -ssh indri 'launchctl list | grep zot' - -# Check zot is responding -ssh indri 'curl -s http://localhost:5000/v2/_catalog' -# Expected: {"repositories":[]} - -# Check logs for errors -ssh indri 'tail -20 ~/Library/Logs/mcquack.zot.err.log' - -# Test pull-through cache via curl (podman not installed until Step 0.8) -ssh indri 'curl -s http://localhost:5000/v2/docker.io/library/alpine/manifests/latest -H "Accept: application/vnd.oci.image.manifest.v1+json"' -# Should return manifest JSON (triggers cache fetch from Docker Hub) -ssh indri 'curl -s http://localhost:5000/v2/_catalog' -# Expected: {"repositories":["docker.io/library/alpine"]} -``` - -**Implementation Details:** -- Changed port from 5000 to 5050 because macOS ControlCenter (AirPlay Receiver) uses port 5000 by default. -- Fixed sync config: use `"content": [{"prefix": "**", "destination": "/{{ registry.name }}"}]` instead of `"prefix": "{{ registry.name }}/**"`. The destination rewrites the local path, while prefix `**` matches all upstream repos. - ---- - -## Step 0.4: Add Zot to Tailscale Serve - -**Files to modify:** -- `ansible/roles/tailscale_serve/defaults/main.yml` - -**Changes:** -```yaml -# Add to tailscale_serve_services list -- name: svc:registry - https: - port: 443 - upstream: http://localhost:5000 -``` - -**Testing:** -```bash -# Deploy tailscale serve config -mise run provision-indri -- --tags tailscale-serve - -# Verify from gilbert (not indri - hairpinning doesn't work) -curl -s https://registry.tail8d86e.ts.net/v2/_catalog -# Expected: {"repositories":["docker.io/library/alpine"]} (from Step 0.3 test) - -# Test private image push from gilbert -podman pull alpine:latest -podman tag alpine:latest registry.tail8d86e.ts.net/blumeops/test:v1 -podman push registry.tail8d86e.ts.net/blumeops/test:v1 -curl -s https://registry.tail8d86e.ts.net/v2/_catalog -# Expected: {"repositories":["blumeops/test","docker.io/library/alpine"]} -``` - -**Implementation Details:** -- Changed upstream port from 5000 to 5050 (see Step 0.3 implementation details). -- After running `tailscale serve`, the service must be approved in Tailscale admin console at https://login.tailscale.com/admin/services before it becomes accessible. -- Podman needed on gilbert for testing - added to Brewfile. Requires `podman machine init && podman machine start` after install. - ---- - -## Step 0.5: Create Zot Metrics Role - -**New files:** -``` -ansible/roles/zot_metrics/ -├── defaults/main.yml -├── tasks/main.yml -├── templates/ -│ ├── zot-metrics.sh.j2 -│ └── zot-metrics.plist.j2 -└── handlers/main.yml -``` - -**Metrics script pattern (zot-metrics.sh.j2):** -```bash -#!/bin/bash -# Collect Zot registry metrics for Prometheus textfile collector -set -euo pipefail - -METRICS_FILE="/opt/homebrew/var/node_exporter/textfile/zot.prom" -TEMP_FILE="${METRICS_FILE}.tmp" - -# Check if zot is up -if curl -sf http://localhost:5000/v2/_catalog > /dev/null 2>&1; then - echo "zot_up 1" > "$TEMP_FILE" -else - echo "zot_up 0" > "$TEMP_FILE" -fi - -mv "$TEMP_FILE" "$METRICS_FILE" -``` - -**Note:** Start with just `zot_up` for now. Additional metrics (storage usage, cache stats) can be added later after reviewing zot's metrics endpoint. - -**Testing:** -```bash -# Deploy metrics role -mise run provision-indri -- --tags zot_metrics - -# Check metrics file exists and is updated -ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/zot.prom' -# Expected: zot_up 1 - -# Verify metrics appear in Prometheus (after a scrape cycle) -curl -s "http://indri:9090/api/v1/query?query=zot_up" | jq '.data.result[0].value[1]' -# Expected: "1" -``` - ---- - -## Step 0.6: Add Zot Log Collection to Alloy - -**Files to modify:** -- `ansible/roles/alloy/defaults/main.yml` - -**Changes:** -Add to the `alloy_mcquack_logs` list: -```yaml - - path: /Users/erichblume/Library/Logs/mcquack.zot.out.log - service: zot - stream: stdout - - path: /Users/erichblume/Library/Logs/mcquack.zot.err.log - service: zot - stream: stderr -``` - -**Testing:** -```bash -# Deploy alloy config (handler restarts alloy automatically if config changed) -mise run provision-indri -- --tags alloy - -# Wait a minute, then check Loki for zot logs -# In Grafana Explore, query: {service="zot"} -``` - ---- - -## Step 0.7: Update indri-services-check Script - -**Files to modify:** -- `mise-tasks/indri-services-check` - -**Changes to add:** -```bash -# Add after existing service checks (around line 55) -check_service "zot" "ssh indri 'launchctl list | grep zot | grep -v \"^-\"'" -check_service "zot-metrics" "ssh indri 'launchctl list | grep zot-metrics | grep -v \"^-\"'" - -# Add to HTTP endpoints section (around line 65) -check_http "Zot Registry" "http://indri:5000/v2/_catalog" - -# Add metrics file check -check_service "Zot metrics" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/zot.prom'" -``` - -**Testing:** -```bash -# Run the health check -mise run indri-services-check - -# Expected output includes: -# zot... OK -# zot-metrics... OK -# Zot Registry... OK -# Zot metrics... OK -``` - -**Implementation Details:** -- Used Tailscale service URL (`https://registry.tail8d86e.ts.net/v2/_catalog`) instead of internal endpoint to verify full path works. - ---- - -## Step 0.8: Install and Configure Podman on Indri - -**New files:** -``` -ansible/roles/podman/ -├── tasks/main.yml -└── handlers/main.yml -``` - -**Tasks (tasks/main.yml):** -```yaml -- name: Install podman via homebrew - community.general.homebrew: - name: podman - state: present - -- name: Initialize podman machine (if not exists) - ansible.builtin.command: - cmd: podman machine init --cpus 4 --memory 8192 --disk-size 220 - register: podman_init - changed_when: podman_init.rc == 0 - failed_when: podman_init.rc not in [0, 125] # 125 = already exists - -- name: Start podman machine - ansible.builtin.command: - cmd: podman machine start - register: podman_start - changed_when: "'started successfully' in podman_start.stdout" - failed_when: podman_start.rc not in [0, 125] # 125 = already running -``` - -**Testing:** -```bash -# Deploy podman role -mise run provision-indri -- --tags podman - -# Verify podman is working -ssh indri 'podman info' -ssh indri 'podman run --rm hello-world' -``` - -**Implementation Details:** -- **KNOWN ISSUE**: `podman machine init` and `podman machine start` have reliability issues when run via Ansible/SSH. The machine sometimes gets stuck in "Starting" state due to a race condition (see https://github.com/containers/podman/issues/16945). Apple Hypervisor may also require GUI session context. -- **WORKAROUND**: If the machine fails to start via Ansible, manually run on indri: - ```bash - podman machine rm -f podman-machine-default - podman machine init --cpus 4 --memory 8192 --disk-size 220 - podman machine start - ``` -- LaunchAgent approach was attempted but didn't resolve the issue reliably. -- TODO: Investigate proper automation solution for reliable podman machine management. - ---- - -## Step 0.9: Install and Configure Minikube - -**New files:** -``` -ansible/roles/minikube/ -├── defaults/main.yml -├── tasks/main.yml -└── handlers/main.yml -``` - -**Defaults:** -```yaml -minikube_cpus: 4 -minikube_memory: 8192 -minikube_disk_size: "200g" -minikube_driver: podman -minikube_container_runtime: cri-o -``` - -**Note on storage:** The disk-size is for node-local storage only (container images, emptyDir, local PVs). Pods can also mount external storage: -- **hostPath** - indri filesystem (e.g., `~/transmission/` for kiwix ZIM files) -- **NFS** - sifaka volumes (Synology supports NFS natively, easiest for k8s) -- **SMB/CIFS** - requires csi-driver-smb; sifaka currently uses SMB for desktop mounts - -**Tasks:** -```yaml -- name: Install minikube via homebrew - community.general.homebrew: - name: minikube - state: present - -- name: Check if minikube cluster exists - ansible.builtin.command: - cmd: minikube status --format='{{.Host}}' - register: minikube_status - changed_when: false - failed_when: false - -- name: Start minikube cluster - ansible.builtin.command: - cmd: > - minikube start - --driver={{ minikube_driver }} - --container-runtime={{ minikube_container_runtime }} - --cpus={{ minikube_cpus }} - --memory={{ minikube_memory }} - --disk-size={{ minikube_disk_size }} - when: minikube_status.rc != 0 or 'Running' not in minikube_status.stdout -``` - -**Testing:** -```bash -# Deploy minikube role -mise run provision-indri -- --tags minikube - -# Verify cluster is running -ssh indri 'minikube status' -# Expected: host: Running, kubelet: Running, apiserver: Running - -# Test kubectl access from indri -ssh indri 'kubectl get nodes' -# Expected: minikube Ready control-plane ... -``` - -**Implementation Details:** -- Changed `minikube_memory` from 8192 to 7800 because podman machine reports slightly less available memory (7908MB) due to VM overhead. Minikube rejects memory requests exceeding what podman reports. -- Deployed with Kubernetes v1.34.0 and CRI-O 1.24.6. - ---- - -## Step 0.10: Configure Kubeconfig on Gilbert - -**Goal**: Enable `kubectl` and `k9s` on gilbert to connect to the minikube cluster running on indri. - -**Considerations:** -- Minikube runs inside a podman VM on indri, so the API server isn't directly exposed on indri's network interface -- Admin users have full Tailscale access to indri via `autogroup:admin → * → *` -- Be careful not to overwrite existing work kubeconfigs - -**Possible approaches:** -1. SSH tunneling to forward the API server port -2. `minikube tunnel` running on indri (exposes LoadBalancer services) -3. Configure minikube with `--apiserver-names=indri` at cluster creation time -4. Use `kubectl` via SSH wrapper: `ssh indri kubectl ...` - -**Verification:** -```bash -# From gilbert, these should work: -kubectl get nodes -kubectl get namespaces -k9s # Should show the minikube cluster -``` - -The exact approach will be determined during implementation based on what works best with the podman driver. - -**Implementation Details:** - -Chose **Option 3: Recreate cluster with `--apiserver-names`** after researching alternatives: - -1. **SSH tunneling** - Requires keeping a tunnel running or complex on-demand setup -2. **SOCKS5 proxy with kubeconfig `proxy-url`** - Kubeconfig supports `proxy-url: socks5://localhost:1080` per-context, but still requires managing the proxy -3. **`--apiserver-names` + `--listen-address`** - Native minikube support, cleanest solution - -**Cluster Setup:** Recreated the minikube cluster with additional flags: -```bash -minikube delete -minikube start \ - --driver=podman \ - --container-runtime=cri-o \ - --cpus=4 --memory=7800 --disk-size=200g \ - --apiserver-names=indri \ - --listen-address=0.0.0.0 -``` - -- `--apiserver-names=indri` adds "indri" to the API server certificate SAN -- `--listen-address=0.0.0.0` tells podman to expose the API port on all interfaces -- API server port is dynamic (check with `kubectl config view --minify -o jsonpath="{.clusters[0].cluster.server}"` on indri) - -**Credential Management with 1Password:** - -Rather than copying private keys between machines, credentials are stored in 1Password and fetched on-demand using kubectl's exec credential plugin. This mirrors the 1Password SSH agent pattern for biometric-protected key access. - -1. **Store credentials in 1Password** (vault: `vg6xf6vvfmoh5hqjjhlhbeoaie`, item: `3jo4f2hnzvwfmamudfsbbbec7e`): - - `client-cert` - Contents of `~/.minikube/profiles/minikube/client.crt` (text field) - - `client-key` - Contents of `~/.minikube/profiles/minikube/client.key` (text field) - - `ca-cert` - Contents of `~/.minikube/ca.crt` (text field, not secret but stored for convenience) - -2. **Created credential helper script** at `bin/kubectl-credential-1password`: - ```bash - #!/bin/bash - # Fetches client cert/key from 1Password, outputs ExecCredential JSON - # Usage: kubectl-credential-1password - ``` - Symlinked to `~/.local/bin/kubectl-credential-1password` - -3. **Kubeconfig setup on gilbert:** - ```bash - # Store CA cert locally (not secret - public key for server verification) - mkdir -p ~/.kube/minikube-indri - op --vault item get --fields ca-cert | sed 's/^"//; s/"$//' > ~/.kube/minikube-indri/ca.crt - - # Configure cluster - kubectl config set-cluster minikube-indri \ - --server=https://indri: \ - --certificate-authority=/Users/eblume/.kube/minikube-indri/ca.crt - - # Configure credentials with exec plugin - kubectl config set-credentials minikube-indri \ - --exec-api-version=client.authentication.k8s.io/v1beta1 \ - --exec-command=kubectl-credential-1password \ - --exec-arg= \ - --exec-arg= \ - --exec-arg=client-cert \ - --exec-arg=client-key - - # Create context - kubectl config set-context minikube-indri \ - --cluster=minikube-indri \ - --user=minikube-indri - ``` - -4. **Usage:** - ```bash - kubectl --context=minikube-indri get nodes - # or - kubectl config use-context minikube-indri - kubectl get nodes - ``` - -**Security Notes:** -- Client private key never stored on disk - fetched from 1Password on each kubectl command -- CA cert stored on disk (not secret - it's a public key for server verification) -- 1Password biometric/password prompt required for credential access -- `op` command strips quotes from text fields with `sed 's/^"//; s/"$//'` - -**References:** -- [minikube start options](https://minikube.sigs.k8s.io/docs/commands/start/) -- [Using kubectl via SSH Tunnel](https://blog.scottlowe.org/2020/06/16/using-kubectl-via-an-ssh-tunnel/) -- [SOCKS5 Proxy Access to K8s API](https://kubernetes.ltd/docs/tasks/extend-kubernetes/socks5-proxy-access-api/) -- [kubectl-tokensshtunnel](https://github.com/jordiprats/kubectl-tokensshtunnel) -- [Securing kubectl config with 1Password](https://blog.mikael.green/post/1password-kubeconfig/) - ---- - -## Step 0.11: Add Minikube to indri-services-check - -**Files to modify:** -- `mise-tasks/indri-services-check` - -**Changes:** -```bash -# Add new section for Kubernetes -echo "" -echo "Kubernetes cluster:" -check_service "minikube" "ssh indri 'minikube status --format={{.Host}} | grep -q Running'" -check_service "k8s-apiserver" "ssh indri 'kubectl get --raw /healthz'" -``` - -**Testing:** -```bash -mise run indri-services-check - -# Expected output includes: -# Kubernetes cluster: -# minikube... OK -# k8s-apiserver... OK -``` - -**Implementation Notes:** -- Added a third check `k8s-apiserver (remote)` that verifies kubectl access from gilbert, not just via SSH to indri. This ensures the 1Password credential flow and remote API server access are working. -- The remote check uses both `--kubeconfig` and `--context` flags explicitly since the script runs in bash (not fish) and doesn't inherit the KUBECONFIG environment variable from fish config. - ---- - -## Step 0.12: Create Zettelkasten Documentation - -**New files:** -- `~/code/personal/zk/zot.md` -- `~/code/personal/zk/minikube.md` - -**Files to update:** -- `~/code/personal/zk/1767747119-YCPO.md` (main blumeops card) - -**Updates to main blumeops card:** - -1. Add to **Device Tags** table: - | `tag:registry` | indri | Container registry access | - -2. Add to **Services** table: - | **Registry** | https://registry.tail8d86e.ts.net | OCI container registry (Zot) | [[zot]] | - | **Kubernetes** | https://indri: | Minikube cluster | [[minikube]] | - -3. Add to **Port Map (Indri)** table: - | 5050 | Zot | HTTP | localhost | Container registry | - | | K8s API | HTTPS | 0.0.0.0 | Minikube API server | - -4. Add new section **Remote Kubernetes Access**: - ```markdown - ## Remote Kubernetes Access (from Gilbert) - - The minikube cluster on indri is accessible from gilbert via direct connection. - Cluster was created with `--apiserver-names=indri --listen-address=0.0.0.0`. - - **Fish abbreviations** (in `~/.config/fish/config.fish`): - - `ki` → `kubectl --context=minikube-indri` - - `k9i` → `k9s --context=minikube-indri` - - `k9` → `k9s` - - ```bash - # Quick access via abbreviations - ki get nodes - k9i - - # Or explicitly set context - kubectl config use-context minikube-indri - kubectl get nodes - ``` - ``` - -**Template for zot.md:** -```markdown ---- -id: zot -aliases: - - zot - - container-registry -tags: - - blumeops ---- - -# Zot Registry Management Log - -Zot is an OCI-native container registry running on Indri, providing: -1. Pull-through cache for Docker Hub, GHCR, Quay (avoids rate limits) -2. Private image storage for custom-built containers - -## Service Details - -- URL: https://registry.tail8d86e.ts.net -- Local port: 5050 -- Data directory: ~/zot -- Config: ~/.config/zot/config.json -- Managed via: mcquack LaunchAgent - -## Namespace Convention - -| Path | Source | -|------|--------| -| `registry.../docker.io/*` | Cached from Docker Hub | -| `registry.../ghcr.io/*` | Cached from GHCR | -| `registry.../quay.io/*` | Cached from Quay | -| `registry.../blumeops/*` | Private images (yours) | - -## Useful Commands - -\`\`\`bash -# List all images -curl -s http://localhost:5050/v2/_catalog | jq - -# Pull via cache (from indri or k8s) -podman pull localhost:5050/docker.io/library/nginx:latest - -# Build and push private image (from gilbert) -podman build -t registry.tail8d86e.ts.net/blumeops/myapp:v1 . -podman push registry.tail8d86e.ts.net/blumeops/myapp:v1 - -# Check service status -launchctl list | grep zot - -# View logs -tail -f ~/Library/Logs/mcquack.zot.err.log -\`\`\` - -## Log - -### [DATE] -- Initial setup for k8s migration Phase 0 -``` - -**Template for minikube.md:** -```markdown ---- -id: minikube -aliases: - - minikube - - kubernetes - - k8s -tags: - - blumeops ---- - -# Minikube Management Log - -Minikube provides a single-node Kubernetes cluster on Indri for running containerized services. - -## Cluster Details - -- Driver: podman (rootless) -- Container runtime: CRI-O -- Kubernetes version: v1.34.0 -- Resources: 4 CPUs, 7800MB RAM, 200GB disk -- API server: https://indri: (accessible from gilbert via Tailscale) - -## Remote Access from Gilbert - -Cluster was created with `--apiserver-names=indri --listen-address=0.0.0.0` to allow remote kubectl access. - -\`\`\`bash -# Switch context -kubectl config use-context minikube-indri - -# Verify -kubectl get nodes -kubectl get namespaces - -# Use k9s -k9s --context minikube-indri -\`\`\` - -## Useful Commands (on indri) - -\`\`\`bash -# Cluster status -minikube status - -# Start/stop cluster -minikube start -minikube stop - -# Access dashboard -minikube dashboard - -# SSH into node -minikube ssh - -# View logs -minikube logs -\`\`\` - -## Podman Machine (prerequisite) - -Minikube uses podman as the container runtime. The podman machine must be running: - -\`\`\`bash -# Check podman machine -podman machine list - -# Start if needed -podman machine start -\`\`\` - -## Log - -### [DATE] -- Initial cluster setup for k8s migration Phase 0 -- Configured for remote access with --apiserver-names=indri -``` - -**Implementation Notes:** -- Created zot.md and minikube.md in ~/code/personal/zk/ -- Updated 1767747119-YCPO.md (main blumeops card) with all specified changes -- Added 1Password credential plugin reference to minikube docs -- K8s API port is 39535 (dynamically assigned by minikube, may change on cluster recreation) - ---- - -## Step 0.13: Update Main Playbook - -**Files to modify:** -- `ansible/playbooks/indri.yml` - -**Changes:** -```yaml -# Add new roles to the roles list -- role: podman - tags: podman -- role: zot - tags: zot -- role: zot_metrics - tags: zot_metrics -- role: minikube - tags: minikube -``` - -**Implementation Notes:** -- Roles were added incrementally during Steps 0.3, 0.5, 0.8, and 0.9 -- All four roles (zot, zot_metrics, podman, minikube) confirmed present in indri.yml - ---- - -## Step 0.14: Expose K8s API as Tailscale Service (Added Post-Completion) - -> **Note**: This step was added after Phase 0 was otherwise complete, to provide a stable, named endpoint for the Kubernetes API server. - -**Goal**: Expose the minikube API server as `k8s.tail8d86e.ts.net` instead of using `indri:`. - -**Current state:** -- Minikube API server on port 39535 (dynamic, could change on cluster recreation) -- Accessed via `https://indri:39535` -- Certificate SANs include "indri" - -**Target state:** -- Stable Tailscale service at `k8s.tail8d86e.ts.net:443` -- Fixed API server port (6443, the k8s standard) -- Certificate SANs include both hostnames for compatibility - ---- - -### Step 0.14.1: Update Pulumi ACLs - -**Files to modify:** -- `pulumi/policy.hujson` -- `pulumi/__main__.py` - -**Changes to policy.hujson:** - -1. Add tag to `tagOwners`: -```hujson -"tag:k8s-api": ["autogroup:admin", "tag:blumeops"], -``` - -2. Update Erich's test case accept list to include k8s-api: -```hujson -"accept": ["tag:grafana:443", "tag:kiwix:443", "tag:feed:443", "tag:loki:3100", "tag:pg:5432", "tag:homelab:22", "tag:registry:443", "tag:k8s-api:443"], -``` - -3. Update Allison's deny list: -```hujson -"deny": ["tag:grafana:443", "tag:loki:3100", "tag:nas:445", "tag:registry:443", "tag:k8s-api:443"], -``` - -**Changes to __main__.py:** -- Add `"tag:k8s-api"` to indri's DeviceTags - -**Testing:** -```bash -mise run tailnet-preview # Review changes -mise run tailnet-up # Apply changes -``` - ---- - -### Step 0.14.2: Create Tailscale Service in Admin Console (MANUAL) - -> **CRITICAL**: Do this BEFORE running ansible that calls `tailscale serve` - -1. Go to https://login.tailscale.com/admin/services -2. Create service `k8s` with: - - Port: 443 (TCP) - - Host: indri - ---- - -### Step 0.14.3: Recreate Minikube Cluster - -The cluster needs to be recreated to: -1. Add `k8s.tail8d86e.ts.net` to the API server certificate SANs -2. Fix the API server port to 6443 (standard k8s port) - -**On indri:** -```bash -# Stop and delete existing cluster -minikube stop -minikube delete - -# Recreate with new settings -minikube start \ - --driver=podman \ - --container-runtime=cri-o \ - --cpus=4 --memory=7800 --disk-size=200g \ - --apiserver-names=k8s.tail8d86e.ts.net,indri \ - --apiserver-port=6443 \ - --listen-address=0.0.0.0 - -# Verify certificate SANs include both names -kubectl config view --minify -o jsonpath="{.clusters[0].cluster.server}" -# Expected: https://127.0.0.1:6443 or similar - -# Verify cluster is running -minikube status -kubectl get nodes -``` - -**Update ansible role defaults** (`ansible/roles/minikube/defaults/main.yml`): -```yaml -minikube_apiserver_names: - - k8s.tail8d86e.ts.net - - indri -minikube_apiserver_port: 6443 -``` - ---- - -### Step 0.14.4: Add K8s Service to Tailscale Serve - -**Files to modify:** -- `ansible/roles/tailscale_serve/defaults/main.yml` - -**Add to services list:** -```yaml -- name: svc:k8s - tcp: - port: 443 - upstream: tcp://localhost:6443 -``` - -**Note:** Using TCP passthrough (not HTTPS termination) because k8s uses mTLS authentication. - -**Deploy:** -```bash -mise run provision-indri -- --tags tailscale-serve -``` - ---- - -### Step 0.14.5: Update 1Password Credentials - -After cluster recreation, the client certificates have changed. - -**On indri, get the new credentials:** -```bash -# Display new certificates (copy to 1Password) -cat ~/.minikube/profiles/minikube/client.crt -cat ~/.minikube/profiles/minikube/client.key -cat ~/.minikube/ca.crt -``` - -**In 1Password** (vault: `vg6xf6vvfmoh5hqjjhlhbeoaie`, item: `3jo4f2hnzvwfmamudfsbbbec7e`): -- Update `client-cert` field with new certificate -- Update `client-key` field with new key -- Update `ca-cert` field with new CA certificate - ---- - -### Step 0.14.6: Update Kubeconfig on Gilbert - -**Update CA certificate:** -```bash -# Fetch new CA cert from 1Password -op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get 3jo4f2hnzvwfmamudfsbbbec7e --fields ca-cert | sed 's/^"//; s/"$//' > ~/.kube/minikube-indri/ca.crt -``` - -**Update kubeconfig** (`~/.kube/minikube-indri/config.yml`): -```yaml -clusters: -- cluster: - certificate-authority: /Users/eblume/.kube/minikube-indri/ca.crt - server: https://k8s.tail8d86e.ts.net # Changed from https://indri:39535 - name: minikube-indri -``` - -**Verification:** -```bash -# Test connection via new hostname -kubectl --context=minikube-indri get nodes - -# Test via abbreviation -ki get nodes -``` - ---- - -### Step 0.14.7: Update Documentation - -**Files to update:** -- `~/code/personal/zk/minikube.md` - Update API server URL and port info -- `~/code/personal/zk/1767747119-YCPO.md` - Update Services table and Port Map - -**Changes to blumeops card:** - -1. Update Services table: - | **Kubernetes** | https://k8s.tail8d86e.ts.net | Minikube cluster | [[minikube]] | - -2. Update Port Map: - | 6443 | K8s API | HTTPS/TCP | 0.0.0.0 | Minikube API server (via Tailscale) | - -3. Add `tag:k8s-api` to Device Tags table - ---- - -### Step 0.14.8: Update indri-services-check - -**Files to modify:** -- `mise-tasks/indri-services-check` - -**Changes:** -```bash -# Update remote k8s check to use new URL -check_service "k8s-apiserver (remote)" "kubectl --kubeconfig=$HOME/.kube/minikube-indri/config.yml --context=minikube-indri get --raw /healthz" -# (No change needed - uses kubeconfig which now points to k8s.tail8d86e.ts.net) -``` - ---- - -### Step 0.14 Verification - -```bash -# 1. Service health check -mise run indri-services-check -# All services should be OK - -# 2. Test k8s access via Tailscale hostname -curl -k https://k8s.tail8d86e.ts.net/healthz -# Expected: ok (or certificate error if mTLS required - that's fine) - -# 3. kubectl via Tailscale -ki get nodes -ki get namespaces - -# 4. k9s via Tailscale -k9i -``` - ---- - -## Phase 0 Verification Checklist - -Run after completing all steps: - -```bash -# 1. Full service health check -mise run indri-services-check -# All services should show OK, including new ones - -# 2. Registry functionality - pull-through cache -ssh indri 'podman pull localhost:5000/docker.io/library/alpine:latest' -curl -s https://registry.tail8d86e.ts.net/v2/_catalog -# Expected: {"repositories":["docker.io/library/alpine"]} - -# 3. Registry functionality - private image push (from gilbert) -podman pull alpine:latest -podman tag alpine:latest registry.tail8d86e.ts.net/blumeops/test:v1 -podman push registry.tail8d86e.ts.net/blumeops/test:v1 -curl -s https://registry.tail8d86e.ts.net/v2/_catalog -# Expected: {"repositories":["blumeops/test","docker.io/library/alpine"]} - -# 4. Kubernetes cluster -ssh indri 'minikube status' -ssh indri 'kubectl get nodes' -kubectl get nodes # from gilbert - -# 5. Metrics in Prometheus -curl -s "http://indri:9090/api/v1/query?query=zot_up" -# Expected: value = 1 - -# 6. Logs in Loki -# In Grafana Explore: {service="zot"} -# Should see zot log entries - -# 7. k9s from gilbert -k9s -# Should connect and show minikube cluster -``` - ---- - -## Phase 0 Rollback - -If something goes wrong: - -```bash -# Stop and remove minikube -ssh indri 'minikube stop && minikube delete' - -# Stop and remove zot -ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.zot.plist' -ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.zot.plist' - -# Remove podman machine -ssh indri 'podman machine stop && podman machine rm' - -# Remove from tailscale serve -ssh indri 'tailscale serve --service svc:registry reset' - -# Remove tags from Pulumi (revert policy.hujson changes) -mise run tailnet-up - -# Revert ansible playbook changes -git checkout ansible/playbooks/indri.yml -git checkout ansible/roles/tailscale_serve/defaults/main.yml -git checkout ansible/roles/alloy/templates/config.alloy.j2 - -# Remove new roles -rm -rf ansible/roles/{zot,zot_metrics,podman,minikube} - -# Remove zk cards -rm ~/code/personal/zk/{zot,minikube}.md -``` - ---- - -## Phase 0 Follow-up: Grafana Dashboards - -After Phase 0 is running and stable, create monitoring dashboards: - -**Zot Dashboard** (`ansible/roles/grafana/files/dashboards/zot.json`): -1. Check what metrics zot exposes: `ssh indri 'curl -s http://localhost:5000/metrics'` -2. Review community dashboards for inspiration (copy permitted if license allows) -3. Create dashboard with available metrics (at minimum: `zot_up`) - -**Minikube Dashboard** (`ansible/roles/grafana/files/dashboards/minikube.json`): -1. Deploy kube-state-metrics if needed for additional cluster metrics -2. Review what Prometheus can scrape from the cluster -3. Review community dashboards for inspiration (copy permitted if license allows) -4. Create dashboard with relevant panels (node usage, pod counts, etc.) - ---- - -## New Files Summary - -| File | Purpose | -|------|---------| -| `ansible/roles/zot/` | Zot registry deployment | -| `ansible/roles/zot_metrics/` | Metrics collection for Zot | -| `ansible/roles/podman/` | Podman installation and setup | -| `ansible/roles/minikube/` | Minikube cluster setup | -| `~/code/personal/zk/zot.md` | Zot management documentation | -| `~/code/personal/zk/minikube.md` | Minikube management documentation | - -## Modified Files Summary - -| File | Changes | -|------|---------| -| `pulumi/policy.hujson` | Add tag:registry | -| `ansible/playbooks/indri.yml` | Add new roles | -| `ansible/roles/tailscale_serve/defaults/main.yml` | Add svc:registry | -| `ansible/roles/alloy/templates/config.alloy.j2` | Add zot log collection | -| `mise-tasks/indri-services-check` | Add zot and k8s checks | diff --git a/plans/completed/k8s-migration/P1_k8s_infrastructure.complete.md b/plans/completed/k8s-migration/P1_k8s_infrastructure.complete.md deleted file mode 100644 index 9e02286..0000000 --- a/plans/completed/k8s-migration/P1_k8s_infrastructure.complete.md +++ /dev/null @@ -1,657 +0,0 @@ -# Phase 1: Kubernetes Infrastructure - -**Goal**: Tailscale operator, ArgoCD, CloudNativePG operator, PostgreSQL cluster - -**Status**: In Progress - -**Prerequisites**: [Phase 0](P0_foundation.complete.md) complete - ---- - -## Overview - -Phase 1 establishes the k8s control plane infrastructure: -1. **Tailscale operator** - Exposes services on the tailnet -2. **ArgoCD** - GitOps continuous delivery -3. **CloudNativePG** - PostgreSQL operator -4. **PostgreSQL cluster** - Database for future app migrations - -The deployment follows a bootstrap pattern: -- First two components deployed via `kubectl apply -k` (no GitOps yet) -- ArgoCD then takes over management of all components including itself -- All subsequent deployments use ArgoCD - ---- - -## Kubernetes Tags Overview - -| Tag | Purpose | Applied To | -|-----|---------|------------| -| `tag:k8s-api` | Controls access to the K8s API server | indri (Phase 0.14) | -| `tag:k8s-operator` | Identifies the Tailscale K8s Operator | OAuth client for operator | -| `tag:k8s` | Default tag for operator-managed resources | Proxies, services, ingresses created by operator | - -**Ownership chain**: `tag:k8s-operator` must own `tag:k8s` so the operator can assign that tag to devices it creates. - ---- - -## PostgreSQL Migration Strategy - -The k8s PostgreSQL cluster will eventually replace the brew PostgreSQL on indri. - -| Phase | `pg.tail8d86e.ts.net` points to | Miniflux connects to | -|-------|--------------------------------|---------------------| -| Current | brew PostgreSQL (indri) | `pg.tail8d86e.ts.net` | -| Phase 1 | brew PostgreSQL (indri) | `pg.tail8d86e.ts.net` (no change) | -| Phase 4 | brew PostgreSQL (indri) | k8s PG (internal, after miniflux migrates to k8s) | -| Post-Phase 4 | k8s PostgreSQL | k8s PG (internal) | -| Cleanup | k8s PostgreSQL | k8s PG (internal) | - -This allows zero-downtime migration - the Tailscale service switches after apps are migrated. - ---- - -## Steps - -### 1. Update Pulumi ACLs for k8s workloads ✓ - -**Status**: Complete - -Added to `pulumi/policy.hujson`: -- `tag:k8s-operator` - for the operator OAuth client -- `tag:k8s` - for operator-managed resources (owned by `tag:k8s-operator`) -- Grant for `tag:k8s` → `tag:registry` access - ---- - -### 2. Create Tailscale OAuth client ✓ - -**Status**: Complete - -OAuth client stored in 1Password (vault: `vg6xf6vvfmoh5hqjjhlhbeoaie`, item: `2it22lavwgbxdskoaxanej354q`) - -**Configuration used:** -- Tags: `tag:k8s-operator` -- Devices write scope tag: `tag:k8s` -- Scopes: Devices Core (R/W), Auth Keys (R/W), Services (Write) - ---- - -### 3. Deploy Tailscale Kubernetes Operator (Bootstrap) - -Deploy via `kubectl apply -k` - will be migrated to ArgoCD management in Step 5. - -**Setup manifests directory:** -```bash -mkdir -p argocd/manifests/tailscale-operator -cd argocd/manifests/tailscale-operator - -# Download static manifest from Tailscale repo -curl -sL https://raw.githubusercontent.com/tailscale/tailscale/main/cmd/k8s-operator/deploy/manifests/operator.yaml -o operator.yaml - -# Download CRDs -curl -sL https://raw.githubusercontent.com/tailscale/tailscale/main/cmd/k8s-operator/deploy/crds/tailscale.com_connectors.yaml -o crds/connectors.yaml -curl -sL https://raw.githubusercontent.com/tailscale/tailscale/main/cmd/k8s-operator/deploy/crds/tailscale.com_proxyclasses.yaml -o crds/proxyclasses.yaml -# ... (other CRDs as needed) -``` - -**Create kustomization.yaml:** -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -namespace: tailscale-system -resources: - - operator.yaml -secretGenerator: - - name: operator-oauth - namespace: tailscale-system - literals: - - client_id=PLACEHOLDER - - client_secret=PLACEHOLDER -generatorOptions: - disableNameSuffixHash: true -``` - -**Deploy:** -```bash -# Get credentials from 1Password and create secret manually (kustomize secretGenerator is for reference) -CLIENT_ID=$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get 2it22lavwgbxdskoaxanej354q --fields client-id --reveal) -CLIENT_SECRET=$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get 2it22lavwgbxdskoaxanej354q --fields client-secret --reveal) - -kubectl create namespace tailscale-system -kubectl create secret generic operator-oauth \ - --namespace tailscale-system \ - --from-literal=client_id=$CLIENT_ID \ - --from-literal=client_secret=$CLIENT_SECRET - -# Apply operator manifests -kubectl apply -k argocd/manifests/tailscale-operator/ -``` - -**Verification:** -```bash -kubectl get pods -n tailscale-system -# Expected: operator pod Running - -kubectl logs -n tailscale-system -l app.kubernetes.io/name=tailscale-operator -``` - ---- - -### 4. Deploy ArgoCD - -Deploy ArgoCD and expose via Tailscale as `argocd.tail8d86e.ts.net`. - -**Prerequisites:** -- Add `tag:argocd` to Pulumi ACLs -- Create Tailscale service `argocd` in admin console - -**Setup manifests:** -```bash -mkdir -p argocd/manifests/argocd - -# Download ArgoCD install manifest -curl -sL https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml -o argocd/manifests/argocd/install.yaml -``` - -**Create kustomization.yaml:** -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -namespace: argocd -resources: - - install.yaml - - service-tailscale.yaml # LoadBalancer for Tailscale exposure -``` - -**Create service-tailscale.yaml:** -```yaml -apiVersion: v1 -kind: Service -metadata: - name: argocd-server-tailscale - namespace: argocd - annotations: - tailscale.com/hostname: "argocd" -spec: - type: LoadBalancer - loadBalancerClass: tailscale - selector: - app.kubernetes.io/name: argocd-server - ports: - - name: https - port: 443 - targetPort: 8080 -``` - -**Deploy:** -```bash -kubectl create namespace argocd -kubectl apply -k argocd/manifests/argocd/ -``` - -**Get initial admin password:** -```bash -kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d -``` - -**Verification:** -- https://argocd.tail8d86e.ts.net loads -- Can login with admin / - -**Post-setup:** -1. Change admin password, store in 1Password -2. Configure git repo connection to `github.com/eblume/blumeops` (public, no auth needed) - - Note: Using GitHub mirror since ArgoCD can't easily reach forge without additional networking - ---- - -### 5. Migrate Tailscale Operator to ArgoCD - -Create ArgoCD Application to manage the Tailscale operator. - -**Create argocd/apps/tailscale-operator.yaml:** -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: tailscale-operator - namespace: argocd -spec: - project: default - source: - repoURL: https://github.com/eblume/blumeops.git - targetRevision: main - path: argocd/manifests/tailscale-operator - destination: - server: https://kubernetes.default.svc - namespace: tailscale-system - syncPolicy: - automated: - prune: true - selfHeal: true -``` - -**Apply:** -```bash -kubectl apply -f argocd/apps/tailscale-operator.yaml -``` - -**Note on secrets:** The OAuth secret was created manually in Step 3. For GitOps, consider: -- Sealed Secrets -- External Secrets Operator -- SOPS - -For now, the secret remains manually managed outside of ArgoCD. - ---- - -### 6. Deploy CloudNativePG via ArgoCD - -**Setup manifests:** -```bash -mkdir -p argocd/manifests/cloudnative-pg - -# Download CNPG operator manifest -curl -sL https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.24/releases/cnpg-1.24.0.yaml -o argocd/manifests/cloudnative-pg/operator.yaml -``` - -**Create kustomization.yaml:** -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -resources: - - operator.yaml -``` - -**Create ArgoCD Application (argocd/apps/cloudnative-pg.yaml):** -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: cloudnative-pg - namespace: argocd -spec: - project: default - source: - repoURL: https://github.com/eblume/blumeops.git - targetRevision: main - path: argocd/manifests/cloudnative-pg - destination: - server: https://kubernetes.default.svc - namespace: cnpg-system - syncPolicy: - automated: - prune: true - selfHeal: true - syncOptions: - - CreateNamespace=true -``` - -**Apply:** -```bash -kubectl apply -f argocd/apps/cloudnative-pg.yaml -``` - -**Verification:** -```bash -kubectl get pods -n cnpg-system -# Expected: cnpg-controller-manager Running -``` - ---- - -### 7. Create PostgreSQL Cluster via ArgoCD - -Create the database cluster. **Not exposed via Tailscale yet** - internal only until apps migrate. - -**Create argocd/manifests/databases/blumeops-pg.yaml:** -```yaml -apiVersion: postgresql.cnpg.io/v1 -kind: Cluster -metadata: - name: blumeops-pg - namespace: databases -spec: - instances: 1 - storage: - size: 10Gi - storageClass: standard - monitoring: - enablePodMonitor: true - bootstrap: - initdb: - database: miniflux - owner: miniflux -``` - -**Create kustomization.yaml:** -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -namespace: databases -resources: - - blumeops-pg.yaml -``` - -**Create ArgoCD Application (argocd/apps/blumeops-pg.yaml):** -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: blumeops-pg - namespace: argocd -spec: - project: default - source: - repoURL: https://github.com/eblume/blumeops.git - targetRevision: main - path: argocd/manifests/databases - destination: - server: https://kubernetes.default.svc - namespace: databases - syncPolicy: - automated: - prune: true - selfHeal: true - syncOptions: - - CreateNamespace=true -``` - -**Apply:** -```bash -kubectl apply -f argocd/apps/blumeops-pg.yaml -``` - -**Verification:** -```bash -kubectl get cluster -n databases -# Expected: blumeops-pg with STATUS "Cluster in healthy state" - -kubectl get pods -n databases -# Expected: blumeops-pg-1 Running - -# Get connection secret -kubectl -n databases get secret blumeops-pg-app -o jsonpath='{.data.uri}' | base64 -d -``` - ---- - -### 8. Create App-of-Apps Root Application - -Once all components are deployed, create a root application to manage all apps. - -**Create argocd/apps/root.yaml:** -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: root - namespace: argocd -spec: - project: default - source: - repoURL: https://github.com/eblume/blumeops.git - targetRevision: main - path: argocd/apps - destination: - server: https://kubernetes.default.svc - namespace: argocd - syncPolicy: - automated: - prune: true - selfHeal: true -``` - -**Apply:** -```bash -kubectl apply -f argocd/apps/root.yaml -``` - -Now ArgoCD manages itself and all other applications via the app-of-apps pattern. - ---- - -## New Files Summary - -``` -argocd/ - apps/ - root.yaml # App-of-apps root - tailscale-operator.yaml # Tailscale operator app - cloudnative-pg.yaml # CNPG operator app - blumeops-pg.yaml # PostgreSQL cluster app - manifests/ - tailscale-operator/ - kustomization.yaml - operator.yaml - argocd/ - kustomization.yaml - install.yaml - service-tailscale.yaml - cloudnative-pg/ - kustomization.yaml - operator.yaml - databases/ - kustomization.yaml - blumeops-pg.yaml -``` - ---- - -## Pulumi ACL Updates Required - -Add to `pulumi/policy.hujson`: -```hujson -"tag:argocd": ["autogroup:admin", "tag:blumeops"], -``` - -Add to Erich's test accept list: -```hujson -"accept": [..., "tag:argocd:443"], -``` - -Add to Allison's deny list: -```hujson -"deny": [..., "tag:argocd:443"], -``` - ---- - -## Verification Checklist - -```bash -# 1. Tailscale operator running -kubectl get pods -n tailscale-system - -# 2. ArgoCD accessible -curl -k https://argocd.tail8d86e.ts.net/healthz - -# 3. CloudNativePG operator running -kubectl get pods -n cnpg-system - -# 4. PostgreSQL cluster healthy -kubectl get cluster -n databases - -# 5. All ArgoCD apps synced -kubectl get applications -n argocd -# All should show STATUS: Synced, HEALTH: Healthy -``` - ---- - -## Rollback - -```bash -# Remove ArgoCD apps (will cascade delete managed resources) -kubectl delete application -n argocd root -kubectl delete application -n argocd blumeops-pg -kubectl delete application -n argocd cloudnative-pg -kubectl delete application -n argocd tailscale-operator - -# Remove ArgoCD -kubectl delete -k argocd/manifests/argocd/ -kubectl delete namespace argocd - -# Remove namespaces -kubectl delete namespace databases -kubectl delete namespace cnpg-system -kubectl delete namespace tailscale-system - -# Revert ACL changes -git checkout pulumi/policy.hujson -mise run tailnet-up -``` - ---- - -## Implementation Notes (Deviations from Plan) - -*Added during implementation for retrospective review* - -### Git Source: Forge Instead of GitHub - -**Plan**: Use GitHub mirror (`github.com/eblume/blumeops`) -**Actual**: Use internal Forgejo (`ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git`) - -**Why**: User preference to use internal infrastructure, accepting circular dependency for later. - -**Required changes**: -- Deploy key added to forge for ArgoCD SSH access -- Repository secret `repo-forge` with SSH private key from 1Password -- Discovered: `op read` requires `?ssh-format=openssh` query parameter for ArgoCD-compatible key format -- Egress proxy service to reach forge from cluster (targets `indri.tail8d86e.ts.net` not `forge.tail8d86e.ts.net` due to Tailscale Serve limitation) -- DNSConfig CRD for cluster-to-tailnet MagicDNS resolution -- ACL grant: `tag:k8s` → `tag:homelab` on ports 3001 (HTTP) and 2200 (SSH) - -### ArgoCD Exposure: Ingress Instead of LoadBalancer - -**Plan**: LoadBalancer service with `tailscale.com/hostname` annotation -**Actual**: Tailscale Ingress with Let's Encrypt TLS termination - -**Why**: Ingress provides automatic TLS certificates and is the recommended approach. - -**File**: `argocd/manifests/argocd/service-tailscale.yaml` uses `kind: Ingress` with `ingressClassName: tailscale` - -### Namespace: `tailscale` Instead of `tailscale-system` - -**Plan**: `tailscale-system` namespace -**Actual**: `tailscale` namespace - -**Why**: Matches upstream Tailscale operator defaults. - -### Sync Policy: Manual Instead of Automated - -**Plan**: `syncPolicy.automated` with prune and selfHeal -**Actual**: Manual sync policy for workload apps; auto-sync only for app-of-apps - -**Why**: User preference for explicit control over deployments during initial migration phase. - -**Pattern**: -- `apps.yaml` (app-of-apps): auto-sync to pick up new Application manifests -- All workload apps: manual sync requires `argocd app sync ` - -### CloudNativePG: Helm Chart Instead of Raw Manifest - -**Plan**: Download raw CNPG manifest -**Actual**: Multi-source Application using official Helm chart from `https://cloudnative-pg.github.io/charts` - -**Why**: Helm chart is the officially supported distribution method. - -**Additional fix**: Required `ServerSideApply=true` sync option due to large CRD exceeding annotation size limit. - -### App-of-Apps: Named `apps` Instead of `root` - -**Plan**: `argocd/apps/root.yaml` -**Actual**: `argocd/apps/apps.yaml` with Application named `apps` - -**Why**: Clearer naming; `apps` manages apps, `argocd` manages itself. - -### ArgoCD Self-Management Added - -**Plan**: Not explicitly planned -**Actual**: `argocd/apps/argocd.yaml` Application for ArgoCD self-management - -**Why**: Standard GitOps pattern - ArgoCD manages its own deployment after bootstrap. - -### CRI-O Registry Mirror for Zot - -**Plan**: Not in original plan -**Actual**: Configured CRI-O to use zot as pull-through cache for docker.io, ghcr.io, quay.io - -**Why**: Reduces external bandwidth, speeds up pulls, avoids rate limits. - -**Implementation**: Ansible `minikube` role applies `/etc/containers/registries.conf.d/zot-mirror.conf` inside minikube VM using stable hostname `host.containers.internal:5050`. - -### ProxyClass for CRI-O Image Compatibility - -**Plan**: Not mentioned -**Actual**: Required `ProxyClass` with fully-qualified image paths (`docker.io/tailscale/...`) - -**Why**: CRI-O requires fully-qualified image references; default Tailscale operator uses short names. - -### Actual File Structure - -``` -argocd/ - apps/ - apps.yaml # App-of-apps (auto-sync) - argocd.yaml # ArgoCD self-management (manual sync) - tailscale-operator.yaml # Tailscale operator (manual sync) - cloudnative-pg.yaml # CNPG operator via Helm (manual sync) - manifests/ - tailscale-operator/ - kustomization.yaml - operator.yaml - proxyclass.yaml # CRI-O compatibility - dnsconfig.yaml # Cluster-to-tailnet DNS - egress-forge.yaml # Egress proxy for forge - secret.yaml.tpl # OAuth secret template (manual) - README.md - argocd/ - kustomization.yaml # Uses remote base from upstream - service-tailscale.yaml # Ingress (not LoadBalancer) - argocd-cmd-params-cm.yaml # Disable HTTPS redirect - repo-forge-secret.yaml.tpl # SSH key template (manual) - README.md - cloudnative-pg/ - values.yaml # Helm values (currently minimal) - README.md -``` - -### Bootstrap Commands (Actual) - -```bash -# 1. Create namespaces -kubectl create namespace tailscale -kubectl create namespace argocd - -# 2. Apply secrets (manual, uses 1Password) -op inject -i argocd/manifests/tailscale-operator/secret.yaml.tpl | kubectl apply -f - - -PRIV_KEY=$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/csjncynh6htjvnh2l2da65y32q/private key?ssh-format=openssh")$'\n' && \ -kubectl create secret generic repo-forge -n argocd \ - --from-literal=type=git \ - --from-literal=url='ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git' \ - --from-literal=insecure=true \ - --from-literal=sshPrivateKey="$PRIV_KEY" && \ -kubectl label secret repo-forge -n argocd argocd.argoproj.io/secret-type=repository - -# 3. Bootstrap tailscale-operator -kubectl apply -k argocd/manifests/tailscale-operator/ - -# 4. Bootstrap ArgoCD -kubectl apply -k argocd/manifests/argocd/ - -# 5. Login and change password -argocd login argocd.tail8d86e.ts.net --username admin --grpc-web -argocd account update-password - -# 6. Apply ArgoCD Applications -kubectl apply -f argocd/apps/argocd.yaml -kubectl apply -f argocd/apps/apps.yaml - -# 7. Sync workloads -argocd app sync tailscale-operator -argocd app sync cloudnative-pg -``` diff --git a/plans/completed/k8s-migration/P2_grafana.complete.md b/plans/completed/k8s-migration/P2_grafana.complete.md deleted file mode 100644 index 7bb37f3..0000000 --- a/plans/completed/k8s-migration/P2_grafana.complete.md +++ /dev/null @@ -1,396 +0,0 @@ -# Phase 2: Grafana Migration (Pilot) - -**Goal**: Migrate Grafana as lowest-risk pilot service - -**Status**: Complete (2026-01-19) - -**Prerequisites**: [Phase 1](P1_k8s_infrastructure.complete.md) complete - ---- - -## Overview - -This phase migrates Grafana from Homebrew/Ansible on indri to Kubernetes, establishing the pattern for future service migrations. Additionally, we establish the pattern of mirroring Helm chart repositories to forge for resilience and GitOps consistency. - ---- - -## Key Decisions - -### Helm Chart Mirroring - -**Problem**: P1 uses external Helm repos which creates external dependencies. - -**Solution**: Mirror Helm chart Git repositories to forge, reference charts from git path. - -ArgoCD auto-detects Helm charts when a directory contains `Chart.yaml`. No build step needed. - -| Chart | Upstream Git Repo | Forge Mirror | Chart Path | -|-------|-------------------|--------------|------------| -| cloudnative-pg | `github.com/cloudnative-pg/charts` | `forge/eblume/cloudnative-pg-charts` | `charts/cloudnative-pg/` | -| grafana | `github.com/grafana/helm-charts` | `forge/eblume/grafana-helm-charts` | `charts/grafana/` | - -### Database Storage - -Use SQLite with 1Gi PVC (not k8s PostgreSQL). Grafana stores minimal persistent data and dashboards are git-provisioned. - -### Datasource URLs - -From k8s pods, use `host.containers.internal` to reach indri services: -- Prometheus: `http://host.containers.internal:9090` -- Loki: `http://host.containers.internal:3100` (requires ansible change to bind 0.0.0.0) - -### Ingress - -Tailscale Ingress with Let's Encrypt TLS (following ArgoCD pattern), with `crio-compat` proxy class. - -### Secrets Management - -Admin password stored in 1Password, injected manually via `op inject`. Future: migrate to External Secrets Operator or similar. - ---- - -## Prerequisites - -### 0.1 Mirror Helm Chart Repos to Forge - -**User action**: Create mirrors in forge: - -1. **CloudNativePG charts** (fix existing P1 app): - - Mirror: `https://github.com/cloudnative-pg/charts` - - To: `forge.tail8d86e.ts.net/eblume/cloudnative-pg-charts` - -2. **Grafana helm-charts** (new): - - Mirror: `https://github.com/grafana/helm-charts` - - To: `forge.tail8d86e.ts.net/eblume/grafana-helm-charts` - -### 0.2 Update Loki to Bind 0.0.0.0 - -**File**: `ansible/roles/loki/templates/loki-config.yaml.j2` - -Add under `server:`: -```yaml -http_listen_address: 0.0.0.0 -``` - -Deploy: `mise run provision-indri -- --tags loki` - ---- - -## Steps - -### 1. Fix CloudNativePG to Use Forge Mirror - -Update `argocd/apps/cloudnative-pg.yaml` to use forge-mirrored chart: - -```yaml -sources: - - repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/cloudnative-pg-charts.git - targetRevision: cloudnative-pg-0.23.0 # git tag - path: charts/cloudnative-pg - helm: - releaseName: cloudnative-pg - valueFiles: - - $values/argocd/manifests/cloudnative-pg/values.yaml - - repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git - targetRevision: main - ref: values -``` - ---- - -### 2. Create Grafana Helm Values - -**File**: `argocd/manifests/grafana/values.yaml` - -```yaml -admin: - existingSecret: grafana-admin - userKey: admin-user - passwordKey: admin-password - -persistence: - enabled: true - type: pvc - size: 1Gi - -grafana.ini: - server: - root_url: https://grafana.tail8d86e.ts.net - analytics: - check_for_updates: false - reporting_enabled: false - -datasources: - datasources.yaml: - apiVersion: 1 - datasources: - - name: Prometheus - type: prometheus - access: proxy - uid: prometheus - url: http://host.containers.internal:9090 - isDefault: true - editable: false - - name: Loki - type: loki - access: proxy - uid: loki - url: http://host.containers.internal:3100 - editable: false - -sidecar: - dashboards: - enabled: true - label: grafana_dashboard - labelValue: "1" - -service: - type: ClusterIP - port: 80 - -resources: - requests: - memory: "128Mi" - cpu: "100m" - limits: - memory: "512Mi" - cpu: "500m" -``` - ---- - -### 3. Create Grafana ArgoCD Application - -**File**: `argocd/apps/grafana.yaml` - -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: grafana - namespace: argocd -spec: - project: default - sources: - - repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/grafana-helm-charts.git - targetRevision: grafana-8.8.2 - path: charts/grafana - helm: - releaseName: grafana - valueFiles: - - $values/argocd/manifests/grafana/values.yaml - - repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git - targetRevision: main - ref: values - destination: - server: https://kubernetes.default.svc - namespace: monitoring - syncPolicy: - syncOptions: - - CreateNamespace=true -``` - ---- - -### 4. Create Grafana Config Application - -**File**: `argocd/apps/grafana-config.yaml` - -Deploys Tailscale Ingress and Dashboard ConfigMaps from `argocd/manifests/grafana-config/`. - ---- - -### 5. Create Grafana Config Manifests - -**Directory**: `argocd/manifests/grafana-config/` - -Contents: -- `kustomization.yaml` -- `ingress-tailscale.yaml` - Tailscale Ingress for `grafana.tail8d86e.ts.net` -- `secret-admin.yaml.tpl` - Admin password template (1Password-backed) -- `README.md` - Notes on secrets management -- `dashboards/configmap-*.yaml` - 9 dashboard ConfigMaps - -**Ingress**: -```yaml -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: grafana-tailscale - namespace: monitoring - annotations: - tailscale.com/proxy-class: "crio-compat" -spec: - ingressClassName: tailscale - defaultBackend: - service: - name: grafana - port: - number: 80 - tls: - - hosts: - - grafana -``` - -**Secret template** (`secret-admin.yaml.tpl`): -```yaml -# Apply: op inject -i secret-admin.yaml.tpl | kubectl apply -f - -apiVersion: v1 -kind: Secret -metadata: - name: grafana-admin - namespace: monitoring -type: Opaque -stringData: - admin-user: admin - admin-password: {{ op://vg6xf6vvfmoh5hqjjhlhbeoaie/oxkcr3xtxnewy7noep2izvyr6y/password }} -``` - -**Dashboard ConfigMaps**: Convert each JSON from `ansible/roles/grafana/files/dashboards/` to ConfigMap with label `grafana_dashboard: "1"`. - ---- - -### 6. Deploy to Kubernetes - -```bash -# Create namespace and secret -ki create namespace monitoring -op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl | ki apply -f - - -# Push changes and sync -argocd app sync grafana -argocd app sync grafana-config -``` - ---- - -### 7. Tailscale Service Cutover - -Remove `svc:grafana` from `ansible/roles/tailscale_serve/defaults/main.yml`, then: - -```bash -mise run provision-indri -- --tags tailscale-serve -``` - ---- - -### 8. Stop Brew Grafana - -```bash -ssh indri 'brew services stop grafana' -``` - ---- - -### 9. Retire Ansible Grafana Role - -Once k8s Grafana is verified working: - -1. **Remove role from playbook** - Delete grafana role entry from `ansible/playbooks/indri.yml` - -2. **Delete the role directory** - `rm -rf ansible/roles/grafana/` - -3. **Update zk documentation** - Note in `~/code/personal/zk/1767747119-YCPO.md` that Grafana is now k8s-hosted - ---- - -## New Files - -| Path | Purpose | -|------|---------| -| `argocd/apps/grafana.yaml` | Grafana Helm chart Application | -| `argocd/apps/grafana-config.yaml` | Grafana config Application | -| `argocd/manifests/grafana/values.yaml` | Helm values | -| `argocd/manifests/grafana-config/kustomization.yaml` | Kustomize config | -| `argocd/manifests/grafana-config/ingress-tailscale.yaml` | Tailscale Ingress | -| `argocd/manifests/grafana-config/secret-admin.yaml.tpl` | Admin password template | -| `argocd/manifests/grafana-config/README.md` | Secrets management notes | -| `argocd/manifests/grafana-config/dashboards/configmap-*.yaml` | 9 dashboard ConfigMaps | - -## Modified Files - -| Path | Change | -|------|--------| -| `argocd/apps/cloudnative-pg.yaml` | Switch to forge-mirrored chart | -| `ansible/roles/loki/templates/loki-config.yaml.j2` | Add `http_listen_address: 0.0.0.0` | -| `ansible/roles/tailscale_serve/defaults/main.yml` | Remove `svc:grafana` | -| `ansible/playbooks/indri.yml` | Remove grafana role | - -## Deleted Files - -| Path | Reason | -|------|--------| -| `ansible/roles/grafana/` | Replaced by k8s deployment | - ---- - -## Verification - -- [x] Loki accessible from k8s pods -- [x] Prometheus accessible from k8s pods -- [x] Grafana pod running in `monitoring` namespace -- [x] Grafana Ingress active -- [x] https://grafana.tail8d86e.ts.net loads -- [x] All 9 dashboards visible -- [x] Prometheus datasource queries work -- [x] Loki datasource queries work - ---- - -## Rollback - -1. Re-add `svc:grafana` to ansible tailscale_serve -2. `mise run provision-indri -- --tags tailscale-serve,grafana` -3. `argocd app delete grafana grafana-config --cascade` - ---- - -## Implementation Notes - -*Added during implementation for retrospective review* - -### SSH Credential Management - -**Issue**: Initial plan used HTTPS URLs for forge-mirrored Helm chart repos, but ArgoCD in cluster couldn't resolve `forge.tail8d86e.ts.net` (MagicDNS not available inside cluster). - -**Solution**: Use SSH URLs for all forge repos. Created a **credential template** (`repo-creds-forge`) that matches all repos under `ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/` using URL prefix matching. This allows a single SSH key (added to Forgejo user, not as deploy key) to work for all repos. - -### SSH Host Key for ArgoCD - -**Issue**: ArgoCD's known_hosts didn't include indri's SSH host key, causing `knownhosts: key is unknown` errors. - -**Solution**: Added `argocd-ssh-known-hosts-cm.yaml` as a kustomize patch to include indri's host key alongside the upstream defaults. - -**Gotcha**: Kustomize patches must **not specify namespace** - the namespace transformation happens *after* patch matching. Our patch had `namespace: argocd` which caused "no matches for Id" errors until removed. - -### Tailscale Hostname Cutover - -**Issue**: After removing `svc:grafana` from ansible's tailscale_serve config, the k8s Ingress still got a numbered hostname (`grafana-1.tail8d86e.ts.net`). - -**Solution**: The old `svc:grafana` service remained registered in Tailscale admin console even after clearing its serve config. **Manual deletion in Tailscale admin console** was required to free the `grafana` hostname for the k8s Ingress to claim. After deletion, recreating the Ingress picked up the correct hostname. - -### ArgoCD Workflow Decision - -During implementation, we established the pattern for GitOps workflow: - -- **All apps target `main` branch** (not feature branches) -- Manual sync policy on workload apps = merge doesn't auto-deploy -- Workflow: feature branch → PR → merge to main → `argocd app sync ` -- For testing: temporarily set one app to feature branch via `argocd app set --revision` - -This avoids the friction of switching `targetRevision` in manifests during development. - -### Bootstrap Dependencies - -Some resources must be applied manually before ArgoCD can manage itself: - -1. **SSH known_hosts** - chicken-and-egg: ArgoCD can't sync the config that adds the host key -2. **Credential secrets** - `repo-creds-forge` must exist before ArgoCD can pull from forge - -These are documented in `argocd/manifests/argocd/README.md` as bootstrap steps. - -### Actual Versions Used - -- Grafana Helm chart: `grafana-8.8.2` (tag in grafana-helm-charts repo) -- CloudNativePG Helm chart: `cloudnative-pg-v0.23.0` (tag in cloudnative-pg-charts repo) -- Grafana version: 11.4.0 diff --git a/plans/completed/k8s-migration/P3_postgresql.complete.md b/plans/completed/k8s-migration/P3_postgresql.complete.md deleted file mode 100644 index e74f09d..0000000 --- a/plans/completed/k8s-migration/P3_postgresql.complete.md +++ /dev/null @@ -1,359 +0,0 @@ -# Phase 3: PostgreSQL Disaster Recovery & Backup - -**Goal**: Test disaster recovery and configure borgmatic backups for k8s-pg - -**Status**: Complete (2026-01-19) - -**Prerequisites**: [Phase 2](P2_grafana.complete.md) complete - ---- - -## Overview - -Phase 3 establishes disaster recovery capabilities for the k8s PostgreSQL cluster: -1. **Fix borgmatic backup issues** - Resolve `borg: command not found` error -2. **Test disaster recovery** - Restore miniflux data from borgmatic backup to k8s-pg -3. **Create borgmatic user** - Read-only backup user in k8s-pg via CloudNativePG -4. **Configure dual database backup** - Backup both brew PostgreSQL and k8s-pg during migration - -This phase prepares for Phase 4 (miniflux migration) by verifying we can restore data to k8s-pg. - ---- - -## Key Decisions - -### Backup Both Databases During Transition - -**Decision**: Configure borgmatic to backup both `localhost:5432/miniflux` (brew) and `k8s-pg.tail8d86e.ts.net:5432/miniflux` (k8s) until migration complete. - -**Why**: Provides redundancy during migration. After Phase 4, remove localhost entry. - -### Reuse Existing borgmatic Password - -**Decision**: Use same borgmatic password from 1Password for k8s-pg user. - -**Why**: Simpler credential management, password already proven secure. - -### CloudNativePG Managed Roles - -**Decision**: Declare borgmatic user via CloudNativePG `managed.roles` instead of SQL commands. - -**Why**: Declarative, version-controlled, matches eblume user pattern. - -### Disable selfHeal on apps App - -**Decision**: Remove `selfHeal: true` from `argocd/apps/apps.yaml`. - -**Why**: Allows temporarily pointing child apps to feature branches during development without ArgoCD reverting the change. - ---- - -## Steps - -### 1. Fix borgmatic borg path issue - -**Problem**: borgmatic failing with `borg: command not found` - -**Cause**: LaunchAgent doesn't have homebrew in PATH, so `borg` binary not found. - -**Solution**: Add `local_path` to borgmatic config template. - -**File**: `ansible/roles/borgmatic/templates/config.yaml.j2` -```yaml -# Path to borg binary (LaunchAgent doesn't have homebrew in PATH) -local_path: {{ borgmatic_local_path }} -``` - -**File**: `ansible/roles/borgmatic/defaults/main.yml` -```yaml -borgmatic_local_path: /opt/homebrew/bin/borg -``` - ---- - -### 2. Run manual backup to verify fix - -```bash -mise run provision-indri -- --tags borgmatic -ssh indri '/opt/homebrew/bin/borgmatic --verbosity 1' -``` - ---- - -### 3. Extract miniflux dump from borgmatic - -```bash -ssh indri 'borgmatic list --archive latest' -ssh indri 'borgmatic restore --archive latest --destination /tmp/restore' -``` - ---- - -### 4. Add ACL grant for homelab → k8s - -**Problem**: Connection from indri to k8s-pg blocked - Tailscale proxy logs showed "no rules matched" - -**Solution**: Add ACL grant in Pulumi. - -**File**: `pulumi/policy.hujson` -```hujson -// Homelab can reach k8s PostgreSQL for borgmatic backups -{ - "src": ["tag:homelab"], - "dst": ["tag:k8s"], - "ip": ["tcp:5432"], -}, -``` - -Deploy: `mise run tailnet-up` - ---- - -### 5. Restore data to k8s-pg - -```bash -# Using eblume superuser credentials from 1Password -ssh indri "psql 'postgres://eblume@k8s-pg.tail8d86e.ts.net:5432/miniflux' -f /tmp/restore/localhost/miniflux/miniflux" -``` - -**Verification**: -```bash -psql 'postgres://eblume@k8s-pg.tail8d86e.ts.net:5432/miniflux' -c 'SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM feeds; SELECT COUNT(*) FROM entries;' -# Result: 2 users, 2 feeds, 44 entries -``` - ---- - -### 6. Create borgmatic user in k8s-pg via CloudNativePG - -**File**: `argocd/manifests/databases/secret-borgmatic.yaml.tpl` -```yaml -# Template for borgmatic backup user password -# Apply with: op inject -i secret-borgmatic.yaml.tpl | kubectl apply -f - -apiVersion: v1 -kind: Secret -metadata: - name: blumeops-pg-borgmatic - namespace: databases -type: kubernetes.io/basic-auth -stringData: - username: borgmatic - password: {{ op://vg6xf6vvfmoh5hqjjhlhbeoaie/mw2bv5we7woicjza7hc6s44yvy/db-password }} -``` - -**File**: `argocd/manifests/databases/blumeops-pg.yaml` (add to managed roles) -```yaml -managed: - roles: - # ... existing eblume role ... - # borgmatic read-only user for backups - - name: borgmatic - login: true - connectionLimit: -1 - ensure: present - inherit: true - inRoles: - - pg_read_all_data - passwordSecret: - name: blumeops-pg-borgmatic -``` - -**Deploy**: -```bash -op inject -i argocd/manifests/databases/secret-borgmatic.yaml.tpl | kubectl apply -f - -argocd app set blumeops-pg --revision feature/p3-postgresql-borgmatic -argocd app sync blumeops-pg -``` - ---- - -### 7. Configure borgmatic for dual database backup - -**File**: `ansible/roles/borgmatic/defaults/main.yml` -```yaml -borgmatic_postgresql_databases: - # Brew PostgreSQL on indri (current production) - - name: miniflux - hostname: localhost - port: 5432 - username: borgmatic - # k8s PostgreSQL (CloudNativePG) - backup both during migration - - name: miniflux - hostname: k8s-pg.tail8d86e.ts.net - port: 5432 - username: borgmatic -``` - -**File**: `ansible/roles/postgresql/tasks/main.yml` (update .pgpass) -```yaml -- name: Write .pgpass file for borgmatic backups - ansible.builtin.copy: - content: | - # Managed by ansible - only read-only roles - localhost:{{ postgresql_port }}:*:borgmatic:{{ postgresql_user_passwords['borgmatic'] }} - k8s-pg.tail8d86e.ts.net:5432:*:borgmatic:{{ postgresql_user_passwords['borgmatic'] }} - dest: ~/.pgpass - mode: '0600' - no_log: true -``` - ---- - -### 8. Verify complete backup pipeline - -```bash -mise run provision-indri -- --tags borgmatic,postgresql -ssh indri '/opt/homebrew/bin/borgmatic --verbosity 1' -ssh indri 'borgmatic list --archive latest' -``` - -**Expected output**: Archive contains both dumps: -- `localhost/miniflux/miniflux` -- `k8s-pg.tail8d86e.ts.net/miniflux/miniflux` - ---- - -### 9. Fix ArgoCD drift from CNPG defaults - -**Problem**: ArgoCD showed blumeops-pg as OutOfSync due to CNPG operator adding default values. - -**Solution**: Add CNPG defaults explicitly to managed roles. - -**File**: `argocd/manifests/databases/blumeops-pg.yaml` -```yaml -managed: - roles: - - name: eblume - # ... existing fields ... - connectionLimit: -1 - ensure: present - inherit: true - - name: borgmatic - # ... existing fields ... - connectionLimit: -1 - ensure: present - inherit: true -``` - ---- - -### 10. Update zk documentation - -Updated: -- `~/code/personal/zk/borgmatic.md` - k8s-pg backup documentation and log entry -- `~/code/personal/zk/postgresql.md` - k8s PostgreSQL section and log entry - ---- - -## New Files - -| Path | Purpose | -|------|---------| -| `argocd/manifests/databases/secret-borgmatic.yaml.tpl` | borgmatic user password template | - -## Modified Files - -| Path | Change | -|------|--------| -| `ansible/roles/borgmatic/defaults/main.yml` | Added `borgmatic_local_path`, k8s-pg database entry | -| `ansible/roles/borgmatic/templates/config.yaml.j2` | Added `local_path` option | -| `ansible/roles/postgresql/tasks/main.yml` | Added k8s-pg to .pgpass | -| `argocd/apps/apps.yaml` | Disabled selfHeal | -| `argocd/manifests/databases/blumeops-pg.yaml` | Added borgmatic managed role, CNPG defaults | -| `pulumi/policy.hujson` | Added ACL grant homelab → k8s on tcp:5432 | - ---- - -## Verification - -- [x] borgmatic backup runs successfully -- [x] Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries) -- [x] borgmatic user created in k8s-pg with pg_read_all_data role -- [x] Both localhost and k8s-pg databases in backup archive -- [x] ArgoCD shows blumeops-pg as Synced -- [x] zk documentation updated - ---- - -## Rollback - -Keep brew PostgreSQL running until Phase 4 verified. To revert: - -1. Remove k8s-pg entry from borgmatic databases -2. Remove k8s-pg from .pgpass -3. `mise run provision-indri -- --tags borgmatic,postgresql` - ---- - -## Implementation Notes - -*Added during implementation for retrospective review* - -### borgmatic LaunchAgent PATH Issue - -**Problem**: borgmatic LaunchAgent failed with `borg: command not found` - -**Root cause**: LaunchAgents run with minimal PATH that doesn't include `/opt/homebrew/bin` - -**Solution**: Added `local_path: /opt/homebrew/bin/borg` to borgmatic config. This was already done for `pg_dump_command` but not for borg itself. - -**Lesson**: Any tool invoked by borgmatic needs absolute path when running from LaunchAgent. - -### 1Password Field Name Mismatch - -**Issue**: Initial secret template used `password` field but 1Password item had `db-password`. - -**Discovery**: Error message from `op inject` indicated field not found. - -**Fix**: Updated template to use correct field name `db-password`. - -### ACL Grant Discovery - -**Problem**: Connection from indri (tag:homelab) to k8s-pg (tag:k8s) failed. - -**Diagnosis**: Checked Tailscale operator proxy logs which showed "no rules matched" - clear indication of missing ACL. - -**Solution**: Added explicit grant in `pulumi/policy.hujson` for `tag:homelab` → `tag:k8s` on `tcp:5432`. - -### ArgoCD selfHeal and Feature Branch Development - -**Problem**: When testing changes, temporarily pointed blumeops-pg app to feature branch via `argocd app set --revision`. ArgoCD's selfHeal kept reverting it back to main. - -**Discussion**: Two options considered: -- Option A: Disable selfHeal on apps app (manual sync required for new apps) -- Option B: Keep selfHeal, use different workflow - -**Decision**: Option A chosen. The apps app now only has `prune: true`, not selfHeal. This allows: -1. Temporarily testing feature branches -2. Manual control over when app manifest changes are applied - -**Trade-off**: Must manually sync apps app when adding/removing Application manifests. - -### CloudNativePG Managed Role Reconciliation - -**Issue**: After creating borgmatic secret with correct password, CNPG didn't immediately update the user. - -**Solution**: Annotated the Cluster to trigger reconciliation: -```bash -kubectl annotate cluster blumeops-pg -n databases cnpg.io/reconcile=$(date +%s) --overwrite -``` - -### ArgoCD Drift from CNPG Defaults - -**Problem**: blumeops-pg showed OutOfSync despite successful syncs. - -**Cause**: CNPG operator adds default values (`connectionLimit: -1`, `ensure: present`, `inherit: true`) to managed roles that weren't in our spec. - -**Solution**: Added these defaults explicitly to our spec to match what CNPG generates. - -**Comment added**: Documented in blumeops-pg.yaml that these are "CNPG defaults added to prevent ArgoCD drift". - -### Git Workflow for Phase 3 - -1. Created feature branch: `feature/p3-postgresql-borgmatic` -2. Made commits throughout implementation -3. Pointed blumeops-pg app to feature branch for testing -4. Created PR #32 for review -5. After merge, reset app to main: `argocd app set blumeops-pg --revision main` - -This workflow was enabled by disabling selfHeal (see above). diff --git a/plans/completed/k8s-migration/P4_miniflux.complete.md b/plans/completed/k8s-migration/P4_miniflux.complete.md deleted file mode 100644 index 1fc73cf..0000000 --- a/plans/completed/k8s-migration/P4_miniflux.complete.md +++ /dev/null @@ -1,162 +0,0 @@ -# Phase 4: Miniflux Migration to Kubernetes - -**Goal**: Migrate Miniflux entirely off indri and onto k8s, retire brew PostgreSQL, rename k8s-pg to pg - -**Status**: Complete (2026-01-20) - -**Prerequisites**: [Phase 3](P3_postgresql.complete.md) complete - ---- - -## Overview - -This phase completed the miniflux migration and retired brew PostgreSQL: -1. Deployed miniflux container in k8s via ArgoCD -2. Exposed via Tailscale Ingress at `feed.tail8d86e.ts.net` -3. Removed all miniflux infrastructure from indri (ansible role, brew service, Tailscale serve) -4. Retired brew PostgreSQL (no longer needed) -5. Renamed k8s-pg to pg (canonical Tailscale hostname) -6. Updated borgmatic to backup only `pg.tail8d86e.ts.net` -7. Updated all zk documentation - ---- - -## New Files - -| Path | Purpose | -|------|---------| -| `argocd/apps/miniflux.yaml` | ArgoCD Application definition | -| `argocd/manifests/miniflux/deployment.yaml` | Miniflux Deployment | -| `argocd/manifests/miniflux/service.yaml` | ClusterIP Service | -| `argocd/manifests/miniflux/ingress-tailscale.yaml` | Tailscale Ingress for `feed.tail8d86e.ts.net` | -| `argocd/manifests/miniflux/secret-db.yaml.tpl` | Database URL secret documentation | -| `argocd/manifests/miniflux/kustomization.yaml` | Kustomize configuration | -| `argocd/manifests/miniflux/README.md` | Setup instructions | - -## Modified Files - -| Path | Change | -|------|--------| -| `ansible/playbooks/indri.yml` | Removed miniflux and postgresql roles, simplified pre_tasks | -| `ansible/roles/tailscale_serve/defaults/main.yml` | Removed `svc:feed` and `svc:pg` entries | -| `ansible/roles/alloy/defaults/main.yml` | Removed miniflux and postgresql logs, disabled postgres metrics | -| `ansible/roles/borgmatic/defaults/main.yml` | Updated to backup only `pg.tail8d86e.ts.net` | -| `ansible/roles/borgmatic/tasks/main.yml` | Added .pgpass file management | -| `argocd/manifests/databases/service-tailscale.yaml` | Renamed hostname from k8s-pg to pg | - -## Deleted Files - -| Path | Reason | -|------|--------| -| `ansible/roles/miniflux/` | Entire role no longer needed | -| `ansible/roles/postgresql/` | Brew PostgreSQL no longer needed | - ---- - -## Verification - -- [x] Miniflux pod healthy in k8s -- [x] https://feed.tail8d86e.ts.net accessible -- [x] User `eblume` can log in -- [x] Feeds visible and entries readable -- [x] `pg.tail8d86e.ts.net` resolves to k8s PostgreSQL -- [x] Old `k8s-pg` and `feed` devices removed from Tailscale -- [x] brew miniflux and postgresql services stopped -- [x] Tailscale serve entries cleared from indri -- [x] zk documentation updated - ---- - -## Implementation Notes - -*Lessons learned and issues encountered* - -### CNPG-Generated Password vs 1Password - -**Problem**: Initial secret template used 1Password for miniflux database password, but CNPG auto-generates the bootstrap owner password. - -**Solution**: Reference the CNPG-generated password from `blumeops-pg-app` secret: -```bash -kubectl create secret generic miniflux-db -n miniflux \ - --from-literal=url="$(kubectl -n databases get secret blumeops-pg-app -o jsonpath='{.data.uri}' | base64 -d)" -``` - -### Table Ownership Issue After P3 Restore - -**Problem**: Miniflux pod crashed with "permission denied for table schema_version". - -**Root cause**: P3 restore was run as the `eblume` superuser, so all tables were created owned by `eblume`, not `miniflux`. - -**Solution**: Transfer ownership of all tables to miniflux: -```sql -DO $$ -DECLARE r RECORD; -BEGIN - FOR r IN (SELECT tablename FROM pg_tables WHERE schemaname = 'public') LOOP - EXECUTE 'ALTER TABLE public.' || quote_ident(r.tablename) || ' OWNER TO miniflux'; - END LOOP; -END$$; -``` - -### Tailscale Ingress Hostname Suffix - -**Behavior**: When requesting a Tailscale hostname that's already taken, the operator adds a suffix (e.g., `feed-1`). - -**Workflow**: -1. Deploy initially - gets `feed-1.tail8d86e.ts.net` -2. Clear old `svc:feed` from indri -3. Delete old `feed` device from Tailscale admin -4. Delete and recreate the Ingress - now claims `feed` - -### Renaming Tailscale Service Hostname - -**Problem**: Changing the `tailscale.com/hostname` annotation doesn't automatically update the Tailscale device. - -**Solution**: Delete the service and let ArgoCD recreate it: -```bash -kubectl -n databases delete service blumeops-pg-tailscale -argocd app sync blumeops-pg -``` - -### .pgpass Management Migration - -**Issue**: The postgresql role managed `~/.pgpass` for borgmatic. With postgresql role deleted, borgmatic couldn't authenticate. - -**Solution**: Moved .pgpass management to the borgmatic role. Password is still fetched in playbook pre_tasks as `borgmatic_db_password`. - -### Ansible Check Mode and Registered Variables - -**Problem**: Running `provision-indri --check --diff` failed in the podman role with "Conditional result (True) was derived from value of type 'str'" errors. - -**Root cause**: Command tasks are skipped in check mode, leaving registered variables undefined or with unexpected types when used in conditionals. - -**Solution**: Added `check_mode: false` to read-only command tasks that gather information: -```yaml -- name: Check if podman machine exists - ansible.builtin.command: - cmd: podman machine list --format json - register: podman_machine_list - changed_when: false - check_mode: false # Safe to run in check mode - read-only -``` - -**Lesson**: Any task that registers a variable used in conditionals should have `check_mode: false` if the command is read-only/safe. - -### 1Password CLI on Headless Hosts - -**Issue**: Attempted to run `op` commands on indri, but 1Password CLI requires interactive authentication (biometrics/password). - -**Solution**: All `op` commands must be in `pre_tasks` of the playbook with `delegate_to: localhost` so they run on gilbert (the workstation with GUI auth). - -### Git Workflow for Phase 4 - -1. Created feature branch: `feature/p4-miniflux` -2. Made incremental commits throughout implementation -3. Pointed `miniflux` and `blumeops-pg` apps to feature branch for testing -4. Created PR #33 for review -5. After merge, reset apps to main: - ```bash - argocd app set miniflux --revision main - argocd app set blumeops-pg --revision main - argocd app sync apps - ``` diff --git a/plans/completed/k8s-migration/P5.1_docker_migration.complete.md b/plans/completed/k8s-migration/P5.1_docker_migration.complete.md deleted file mode 100644 index d91d6de..0000000 --- a/plans/completed/k8s-migration/P5.1_docker_migration.complete.md +++ /dev/null @@ -1,208 +0,0 @@ -# Phase 5.1: Migrate Minikube from QEMU2 to Docker Driver - -**Goal**: Replace the qemu2 driver with docker to fix remote API access and simplify volume mounts - -**Status**: Complete (2026-01-21) - Cluster running, ArgoCD deployed, apps synced - -**Prerequisites**: [Phase 5](P5_devpi.complete.md) complete - ---- - -## Background - -### Original Problem (Podman → QEMU2) - -During Phase 6 (Kiwix/Transmission migration), we discovered that the **podman driver has fundamental limitations** that prevent mounting external volumes: - -1. **SMB CSI driver fails** with "Operation not permitted" - the rootless container lacks kernel-level mount capabilities -2. **`minikube mount` fails** - 9p mount gets "permission denied" inside the podman VM -3. **hostPath volumes** only work for paths inside the minikube container, not the macOS host - -We migrated to QEMU2 to get a full VM with kernel capabilities. - -### New Problem (QEMU2 → Docker) - -The QEMU2 driver introduced a **new problem**: the Kubernetes API server is inside the VM at `192.168.105.2:6443`, and Tailscale's TCP proxy cannot forward to it properly: - -- TCP connections succeed (nc -zv works) -- TLS handshake times out -- Root cause unknown, but likely related to Tailscale serve's handling of non-localhost upstreams - -Additionally, the volume mount solution with QEMU2 was complex: -- Required NFS mount from sifaka → indri -- Then `minikube mount` to pass through to VM -- Two LaunchAgents/LaunchDaemons for persistence -- macOS GUI approval required for network access - -### Why Docker? - -The **docker driver** solves both problems: - -1. **API Server on localhost**: Docker Desktop handles port forwarding from container to localhost automatically, so `tailscale serve --tcp=443 tcp://localhost:PORT` works - -2. **Simpler volume mounts**: Docker Desktop has built-in macOS file sharing. Paths shared with Docker are accessible inside containers. - -3. **Official Tailscale recommendation**: Tailscale's own [Kubernetes guide](https://tailscale.com/learn/managing-access-to-kubernetes-with-tailscale) uses minikube with the docker driver. - ---- - -## Implementation Summary - -### Infrastructure Changes - -1. **Docker Desktop installed** (manual via `brew install --cask docker`) - - Configured with 12GB memory in Docker Desktop settings - - Kubernetes option disabled (using minikube instead) - -2. **Docker minikube cluster created**: - ```bash - minikube start \ - --driver=docker \ - --container-runtime=docker \ - --cpus=6 \ - --memory=11264 \ - --disk-size=200g \ - --apiserver-names=k8s.tail8d86e.ts.net,indri \ - --apiserver-port=6443 \ - --listen-address=0.0.0.0 - ``` - -3. **Tailscale serve configured** for k8s API: - - API server on localhost (port is dynamic with docker driver) - - `tailscale serve --service=svc:k8s --tcp=443 tcp://localhost:` - -4. **Remote kubectl access working** from gilbert: - - Created `mise-tasks/ensure-minikube-indri-kubectl-config` script - - Fetches certs from indri and sets up `~/.kube/minikube-indri/config.yml` - -### Ansible Roles Updated - -- `ansible/roles/minikube/` - docker driver, removed qemu2/NFS/socket_vmnet -- `ansible/roles/tailscale_serve/` - removed svc:k8s (minikube role handles dynamic port) -- Containerd registry mirrors configured for zot pull-through cache - -### ArgoCD Bootstrap - -All apps deployed and synced from `feature/p5.1-qemu2-migration` branch: - -| App | Status | Notes | -|-----|--------|-------| -| tailscale-operator | Healthy | Manages Tailscale ingresses | -| argocd | Healthy | Self-managed | -| cloudnative-pg | Healthy | PostgreSQL operator | -| blumeops-pg | Progressing | PostgreSQL cluster starting | -| grafana | Progressing | Needs grafana-admin secret | -| grafana-config | Healthy | Dashboards and ingress | -| miniflux | Progressing | Needs miniflux-config secret | -| devpi | Progressing | Starting up | - -### Secrets Still Needed - -After PR merge, apply these secrets manually: - -```bash -# Grafana admin password -op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl | kubectl --context=minikube-indri apply -f - - -# Miniflux config -op inject -i argocd/manifests/miniflux/secret.yaml.tpl | kubectl --context=minikube-indri apply -f - -``` - ---- - -## Technical Notes - -### API Server Port - -With docker driver, the API server port is **dynamic** - Docker maps a random host port to 6443 inside the container. - -The minikube ansible role queries the port after cluster start and configures tailscale serve accordingly. - -### Registry Mirror Configuration - -Containerd uses `/etc/containerd/certs.d//hosts.toml` files. The ansible role configures mirrors for: -- `registry.tail8d86e.ts.net` (private images) -- `docker.io` -- `ghcr.io` -- `quay.io` - -### ProxyClass Renamed - -Changed from `crio-compat` to `default` - the old name was misleading since we're no longer using CRI-O. - -### Volume Mounts for P6 (Kiwix/Transmission) - -**Solution: Direct NFS from pods to sifaka** ✅ TESTED AND WORKING - -Docker NATs outbound traffic through indri's LAN IP (192.168.1.50), so sifaka's NFS exports need to allow `192.168.1.0/24`. - -Sifaka NFS exports configured: -- `192.168.1.0/24` - Docker containers via indri NAT -- `100.64.0.0/10` - Tailscale clients - -Pods can mount NFS directly: -```yaml -volumes: - - name: torrents - nfs: - server: sifaka - path: /volume1/torrents -``` - -No LaunchAgents, no `minikube mount`, no SMB CSI driver needed. - ---- - -## Verification Checklist - -- [x] Docker Desktop installed and running on indri -- [x] QEMU2 minikube deleted -- [x] Docker minikube running (6 CPUs, 11GB RAM) -- [x] API server accessible on localhost -- [x] Tailscale serve configured for svc:k8s -- [x] Remote kubectl access working from gilbert -- [x] Ansible roles updated for docker driver -- [x] socket_vmnet stopped -- [x] ArgoCD deployed and synced -- [x] All apps synced to feature branch -- [x] Apply app secrets (grafana-admin, miniflux-db, devpi-root, eblume, borgmatic) -- [x] Verify all apps healthy after secrets applied -- [x] Miniflux database restored from borgmatic backup -- [ ] Merge PR and reset apps to main branch -- [ ] `mise run indri-services-check` passes - ---- - -## Post-Merge Steps - -After PR is merged: - -```bash -# Reset all blumeops apps to main branch -argocd app set apps --revision main -argocd app set argocd --revision main -argocd app set blumeops-pg --revision main -argocd app set devpi --revision main -argocd app set grafana-config --revision main -argocd app set miniflux --revision main -argocd app set tailscale-operator --revision main - -# Sync all apps -argocd app sync apps -argocd app sync argocd -argocd app sync tailscale-operator -argocd app sync blumeops-pg -argocd app sync grafana-config -argocd app sync miniflux -argocd app sync devpi -``` - ---- - -## Rollback Plan - -If Docker driver doesn't work: - -1. Delete Docker minikube: `minikube delete` -2. Recreate QEMU2 cluster (restore old ansible config from git) -3. Accept the Tailscale TCP forwarding limitation and use SSH tunnel for remote kubectl diff --git a/plans/completed/k8s-migration/P5_devpi.complete.md b/plans/completed/k8s-migration/P5_devpi.complete.md deleted file mode 100644 index 78669ca..0000000 --- a/plans/completed/k8s-migration/P5_devpi.complete.md +++ /dev/null @@ -1,102 +0,0 @@ -# Phase 5: devpi Migration to Kubernetes - -**Goal**: Migrate devpi PyPI caching proxy from indri to k8s - -**Status**: Complete (2026-01-20) - -**Prerequisites**: [Phase 4](P4_miniflux.complete.md) complete - ---- - -## Summary - -Successfully migrated devpi from mcquack LaunchAgent on indri to Kubernetes: -- Custom container image with devpi-server + devpi-web + auto-init startup script -- StatefulSet with 50Gi PVC for data persistence -- Tailscale Ingress at `pypi.tail8d86e.ts.net` -- Root password from 1Password secret, auto-initialized on first run -- Verified pip caching proxy and mcquack package upload - ---- - -## Key Learnings - -### Registry Mirror Configuration -- Minikube's CRI-O can't resolve Tailscale hostnames directly -- Added registry mirror config to redirect `registry.tail8d86e.ts.net` → `host.containers.internal:5050` -- Also added direct insecure registry entry for `host.containers.internal:5050` -- Config in `ansible/roles/minikube/files/zot-mirror.conf` - -### Memory Requirements -- devpi-web's Whoosh search indexer needs significant memory during PyPI index build -- Initial 512Mi limit caused OOMKills -- Solution: High limit (2Gi) with low request (256Mi) - memory reclaimed after indexing - -### Environment Variable Conflicts -- Kubernetes auto-sets `DEVPI_PORT` for service discovery -- Conflicted with our port config - renamed to `DEVPI_LISTEN_PORT` - -### Tailscale Serve Cleanup -- Use `tailscale serve status --json` to see entries (non-JSON output can be empty) -- Use `tailscale serve clear svc:` to remove entries - -### ArgoCD Workflow -- Changed `apps` to manual sync (was auto-sync with prune) -- Workflow: sync apps → set revision to feature branch → sync service → test → reset to main after merge - ---- - -## Verification Checklist - -- [x] devpi pod healthy in k8s -- [x] https://pypi.tail8d86e.ts.net accessible -- [x] Web interface shows root/pypi index -- [x] `pip install ` works through proxy -- [x] mcquack v1.0.0 uploaded to eblume/dev -- [x] `pip install --index-url https://pypi.tail8d86e.ts.net/eblume/dev/+simple/ mcquack` works -- [x] Old devpi service removed from indri -- [x] zk documentation updated - ---- - -## Files Changed - -### New Files -| Path | Purpose | -|------|---------| -| `argocd/apps/devpi.yaml` | ArgoCD Application definition | -| `argocd/manifests/devpi/Dockerfile` | Container image with startup script | -| `argocd/manifests/devpi/start.sh` | Auto-init startup script | -| `argocd/manifests/devpi/statefulset.yaml` | StatefulSet with PVC | -| `argocd/manifests/devpi/service.yaml` | ClusterIP Service | -| `argocd/manifests/devpi/ingress-tailscale.yaml` | Tailscale Ingress | -| `argocd/manifests/devpi/kustomization.yaml` | Kustomize configuration | -| `argocd/manifests/devpi/secret-root.yaml.tpl` | 1Password secret template | -| `argocd/manifests/devpi/README.md` | Setup documentation | - -### Modified Files -| Path | Change | -|------|--------| -| `CLAUDE.md` | Added k8s/ArgoCD workflow documentation | -| `ansible/playbooks/indri.yml` | Removed devpi and devpi_metrics roles | -| `ansible/roles/tailscale_serve/defaults/main.yml` | Removed svc:pypi | -| `ansible/roles/alloy/defaults/main.yml` | Removed devpi log collection | -| `ansible/roles/borgmatic/defaults/main.yml` | Removed devpi backup paths | -| `ansible/roles/minikube/files/zot-mirror.conf` | Added registry mirror for Tailscale hostname | -| `argocd/apps/apps.yaml` | Changed to manual sync policy | - -### Roles Kept (not deleted) -- `ansible/roles/devpi/` - Kept for reference -- `ansible/roles/devpi_metrics/` - Kept for reference - ---- - -## Post-Merge Cleanup - -After PR merge, reset ArgoCD apps to main: -```fish -argocd app set apps --revision main -argocd app sync apps -argocd app set devpi --revision main -argocd app sync devpi -``` diff --git a/plans/completed/k8s-migration/P6_kiwix.complete.md b/plans/completed/k8s-migration/P6_kiwix.complete.md deleted file mode 100644 index 6e4ebea..0000000 --- a/plans/completed/k8s-migration/P6_kiwix.complete.md +++ /dev/null @@ -1,1039 +0,0 @@ -# Phase 6: Kiwix and Transmission Migration - -**Goal**: Migrate kiwix-serve and transmission torrent daemon to k8s with shared storage - -**Status**: Ready to implement - -**Prerequisites**: [Phase 5.1](P5.1_docker_migration.md) complete (minikube on docker driver) - ---- - -## Blocker: Podman Driver Volume Mount Limitations - -**First attempt branch:** `feature/p6-kiwix-transmission` - -The initial implementation was completed and tested, but **all volume mount approaches failed** due to the podman driver's rootless container limitations: - -| Approach | Result | -|----------|--------| -| NFS volume | Failed - CAP_SYS_ADMIN required for NFS mounts | -| SMB CSI driver | Failed - `mount.cifs` returns EPERM inside rootless container | -| `minikube mount` (9p) | Failed - permission denied mounting into podman VM | -| hostPath | Failed - path doesn't exist inside minikube container | - -**Root cause:** The podman driver runs minikube in a rootless container that lacks kernel capabilities for filesystem mounts. This is a [documented limitation](https://minikube.sigs.k8s.io/docs/drivers/podman/) of the experimental podman driver. - -**Solution:** Phase 5.1 migrates minikube from podman to QEMU2 driver, which creates an actual VM with full kernel capabilities. - -**What's preserved:** -- All k8s manifests in `feature/p6-kiwix-transmission` are complete and tested -- Prerequisites (SMB share, k8s-smb user, data rsync) are done -- Can retry P6 immediately after P5.1 completes - ---- - -## Overview - -This phase migrates two services that share storage but operate independently: -1. **Transmission** - General-purpose BitTorrent daemon (standalone service) -2. **Kiwix** - Serves ZIM archives via HTTP - -The current architecture on indri: -- Transmission downloads torrents to `~/transmission/` -- Ansible syncs a declarative torrent list to transmission -- Completed ZIMs are symlinked to kiwix's serving directory -- kiwix-serve runs as a LaunchAgent with explicit file arguments - -New architecture in k8s: -- **SMB volume** on sifaka (`/volume1/torrents`) for all torrent downloads -- **SMB CSI driver** for mounting the Synology share in k8s -- **Transmission** as a standalone service with Tailscale ingress (`torrent.tail8d86e.ts.net`) -- **Kiwix** deployment that watches for `.zim` files among all downloads -- **Declarative ZIM list** in kiwix manifest, synced to transmission automatically -- **CronJob** to detect new ZIMs and restart kiwix - -**Key design principles:** -- Transmission is a general-purpose torrent daemon, not just for kiwix -- Users can add arbitrary torrents via transmission web UI/RPC -- Kiwix declares which ZIM torrents it wants and handles syncing them to transmission -- Kiwix watches the shared download directory for any `.zim` files (regardless of how they were added) - ---- - -## Architecture Decisions - -### Storage: Direct NFS to Sifaka ✅ TESTED - -**Solution:** Direct NFS volume mounts from pods to sifaka. No SMB CSI driver or `minikube mount` needed. - -With the docker driver, minikube containers NAT outbound traffic through indri's LAN IP (192.168.1.50). Sifaka's NFS exports are configured to allow: -- `192.168.1.0/24` - Docker containers via indri NAT -- `100.64.0.0/10` - Tailscale clients - -**Storage path:** `/volume1/torrents/` on sifaka (NFS export) -- General-purpose torrent download directory -- Contains ZIM files, Linux ISOs, and whatever else users download -- Accessed via native k8s NFS volume (no credentials needed - IP-based access) - -**No backup needed:** -- Sifaka is RAID 5/6, already the backup target -- ZIM files are re-downloadable via torrent -- Other torrents are typically re-downloadable too -- Future offsite backups will cover all shares - -### Torrent Daemon: Transmission (Standalone Service) - -**Why stick with Transmission:** -- Proven reliability on indri -- Well-maintained container images (`linuxserver/transmission`) -- RPC API for automation -- DHT/PEX for good peer discovery -- Web UI for interactive management - -**Container image:** `lscr.io/linuxserver/transmission:latest` -- Includes web UI for monitoring and adding torrents -- Supports environment variable configuration -- Uses `/downloads` for completed files - -**Standalone service:** -- Own namespace: `torrent` -- Own Tailscale ingress: `torrent.tail8d86e.ts.net` -- Can be used for any torrents, not just ZIM archives -- Users interact with it directly via web UI - -### Declarative ZIM Torrent Management - -**Pattern:** Kiwix ConfigMap → Kiwix Sidecar → Transmission RPC - -1. **ConfigMap** (`kiwix-zim-torrents`) in kiwix namespace lists desired ZIM torrent URLs -2. **Kiwix sidecar** syncs ConfigMap to transmission (adds missing torrents) -3. Transmission downloads to shared SMB volume -4. Kiwix watches SMB volume for `.zim` files - -This allows adding new ZIM archives by: -1. Adding torrent URL to ConfigMap in kiwix's ArgoCD manifest -2. Syncing the kiwix ArgoCD app -3. Kiwix sidecar adds torrent to transmission -4. Waiting for download to complete -5. Kiwix restarts automatically when ZIM watcher detects the new file - -**Non-declarative torrents:** -- Users can add any torrent via `torrent.tail8d86e.ts.net` web UI -- If someone adds a ZIM torrent manually, kiwix will still pick it up -- Non-ZIM downloads coexist in the same directory - -### Kiwix Restart Orchestration - -**Challenge:** kiwix-serve doesn't hot-reload new ZIM files; requires restart. - -**Solution:** CronJob watcher -- Runs hourly (configurable) -- Lists completed `.zim` files in SMB volume (among all downloads) -- Compares with hash of last-seen list -- If changed, triggers `kubectl rollout restart deployment/kiwix` - -**Graceful handling of incomplete downloads:** -- Transmission stores incomplete files with `.part` extension -- Kiwix glob pattern `*.zim` only matches completed files -- Kiwix can start immediately with whatever ZIMs exist - ---- - -## Prerequisites (Manual Steps) - -### 1. Configure NFS Export on Sifaka - -**Status: DONE** - The `torrents` shared folder exists at `/volume1/torrents` with NFS exports allowing: -- `192.168.1.0/24` - Docker containers via indri NAT -- `100.64.0.0/10` - Tailscale clients - -### 2. Copy Existing Downloads to Sifaka - -Before migration, copy existing downloads to avoid re-downloading ~138GB: - -```bash -# From indri - mount the NFS share -sudo mount -t nfs sifaka:/volume1/torrents /Volumes/torrents - -# Then rsync (adjust mount path as needed) -rsync -avP ~/transmission/ /Volumes/torrents/ - -# Verify ZIM files -ls -la /Volumes/torrents/*.zim -``` - ---- - -## Steps - -### 1. Create Shared NFS PersistentVolume - -This PV is shared between transmission and kiwix namespaces. Uses direct NFS - no CSI driver needed. - -**File:** `argocd/manifests/torrent/pv-nfs.yaml` - -```yaml -apiVersion: v1 -kind: PersistentVolume -metadata: - name: torrents-nfs-pv -spec: - capacity: - storage: 1Ti - accessModes: - - ReadWriteMany - persistentVolumeReclaimPolicy: Retain - storageClassName: "" - nfs: - server: sifaka - path: /volume1/torrents -``` - -No secrets needed - NFS uses IP-based access control configured on sifaka. - ---- - -## Transmission Service (Standalone) - -### 3. Create Transmission Namespace Resources - -**File:** `argocd/manifests/torrent/pvc.yaml` - -```yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: torrents-storage - namespace: torrent -spec: - accessModes: - - ReadWriteMany - storageClassName: "" - volumeName: torrents-nfs-pv - resources: - requests: - storage: 1Ti -``` - -**File:** `argocd/manifests/torrent/deployment.yaml` - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: transmission - namespace: torrent -spec: - replicas: 1 - selector: - matchLabels: - app: transmission - template: - metadata: - labels: - app: transmission - spec: - containers: - - name: transmission - image: lscr.io/linuxserver/transmission:latest - env: - - name: PUID - value: "1000" - - name: PGID - value: "1000" - - name: TZ - value: "America/Los_Angeles" - ports: - - containerPort: 9091 - name: web - - containerPort: 51413 - name: peer-tcp - - containerPort: 51413 - protocol: UDP - name: peer-udp - volumeMounts: - - name: downloads - mountPath: /downloads - - name: config - mountPath: /config - resources: - requests: - memory: "256Mi" - cpu: "100m" - limits: - memory: "512Mi" - livenessProbe: - httpGet: - path: /transmission/web/ - port: 9091 - initialDelaySeconds: 30 - periodSeconds: 30 - readinessProbe: - httpGet: - path: /transmission/web/ - port: 9091 - initialDelaySeconds: 10 - periodSeconds: 10 - volumes: - - name: downloads - persistentVolumeClaim: - claimName: torrents-storage - - name: config - emptyDir: {} # Config is ephemeral; torrents persist in SMB -``` - -**File:** `argocd/manifests/torrent/service.yaml` - -```yaml -apiVersion: v1 -kind: Service -metadata: - name: transmission - namespace: torrent -spec: - selector: - app: transmission - ports: - - name: web - port: 9091 - targetPort: 9091 -``` - -**File:** `argocd/manifests/torrent/ingress-tailscale.yaml` - -```yaml -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: transmission - namespace: torrent -spec: - ingressClassName: tailscale - rules: - - host: torrent - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: transmission - port: - number: 9091 -``` - -**File:** `argocd/manifests/torrent/kustomization.yaml` - -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -namespace: torrent -resources: - - pv-nfs.yaml - - pvc.yaml - - deployment.yaml - - service.yaml - - ingress-tailscale.yaml -``` - -**File:** `argocd/apps/torrent.yaml` - -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: torrent - namespace: argocd -spec: - project: default - source: - repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git - targetRevision: main - path: argocd/manifests/torrent - destination: - server: https://kubernetes.default.svc - namespace: torrent - syncPolicy: - syncOptions: - - CreateNamespace=true -``` - ---- - -## Kiwix Service - -### 2. Create Kiwix PVC (References Same PV) - -**File:** `argocd/manifests/kiwix/pvc.yaml` - -```yaml -apiVersion: v1 -kind: PersistentVolumeClaim -metadata: - name: torrents-storage - namespace: kiwix -spec: - accessModes: - - ReadWriteMany # Need write for the sync sidecar to work - storageClassName: "" - volumeName: torrents-nfs-pv - resources: - requests: - storage: 1Ti -``` - -### 4. Create Declarative ZIM Torrent List ConfigMap - -This ConfigMap lists the ZIM archives that kiwix wants. The kiwix sidecar syncs these to transmission. - -**File:** `argocd/manifests/kiwix/configmap-zim-torrents.yaml` - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: kiwix-zim-torrents - namespace: kiwix -data: - torrents.txt: | - # Declarative ZIM archive torrent URLs - # These are synced to transmission automatically by the kiwix sidecar - # Format: one URL per line, comments start with # - # - # Users can also add ZIM torrents manually via torrent.tail8d86e.ts.net - # and kiwix will pick them up automatically. - - # Wikipedia - Top 1M English articles (43G) - https://download.kiwix.org/zim/wikipedia/wikipedia_en_top1m_maxi_2025-09.zim.torrent - - # Project Gutenberg - Public domain books (72G) - https://download.kiwix.org/zim/gutenberg/gutenberg_en_all_2023-08.zim.torrent - - # iFixit - Repair guides (3.3G) - https://download.kiwix.org/zim/ifixit/ifixit_en_all_2025-12.zim.torrent - - # Stack Exchange - https://download.kiwix.org/zim/stack_exchange/superuser.com_en_all_2025-12.zim.torrent - https://download.kiwix.org/zim/stack_exchange/math.stackexchange.com_en_all_2025-12.zim.torrent - - # LibreTexts - Open educational resources - https://download.kiwix.org/zim/libretexts/libretexts.org_en_bio_2025-01.zim.torrent - https://download.kiwix.org/zim/libretexts/libretexts.org_en_chem_2025-01.zim.torrent - https://download.kiwix.org/zim/libretexts/libretexts.org_en_eng_2025-01.zim.torrent - https://download.kiwix.org/zim/libretexts/libretexts.org_en_math_2025-01.zim.torrent - https://download.kiwix.org/zim/libretexts/libretexts.org_en_phys_2025-01.zim.torrent - https://download.kiwix.org/zim/libretexts/libretexts.org_en_human_2025-01.zim.torrent - - # DevDocs - Programming documentation - https://download.kiwix.org/zim/devdocs/devdocs_en_bash_2026-01.zim.torrent - https://download.kiwix.org/zim/devdocs/devdocs_en_python_2026-01.zim.torrent - https://download.kiwix.org/zim/devdocs/devdocs_en_go_2026-01.zim.torrent - https://download.kiwix.org/zim/devdocs/devdocs_en_kubernetes_2026-01.zim.torrent - https://download.kiwix.org/zim/devdocs/devdocs_en_docker_2026-01.zim.torrent - https://download.kiwix.org/zim/devdocs/devdocs_en_git_2026-01.zim.torrent - https://download.kiwix.org/zim/devdocs/devdocs_en_postgresql_2026-01.zim.torrent - # Add more from ansible/roles/kiwix/defaults/main.yml as needed -``` - -### 5. Create Torrent Sync Script ConfigMap - -This script syncs the declarative ZIM list to transmission. - -**File:** `argocd/manifests/kiwix/configmap-sync-script.yaml` - -```yaml -apiVersion: v1 -kind: ConfigMap -metadata: - name: zim-torrent-sync-script - namespace: kiwix -data: - sync-zim-torrents.sh: | - #!/bin/bash - # Sync ZIM torrents from kiwix ConfigMap to Transmission - # Runs as a sidecar in the kiwix deployment - set -euo pipefail - - TORRENT_LIST="${TORRENT_LIST:-/config/torrents.txt}" - TRANSMISSION_HOST="${TRANSMISSION_HOST:-transmission.torrent.svc.cluster.local}" - TRANSMISSION_PORT="${TRANSMISSION_PORT:-9091}" - - echo "Syncing ZIM torrents to transmission at ${TRANSMISSION_HOST}:${TRANSMISSION_PORT}" - - # Wait for transmission to be ready - echo "Waiting for Transmission RPC..." - max_attempts=30 - attempt=0 - until curl -sf "http://${TRANSMISSION_HOST}:${TRANSMISSION_PORT}/transmission/rpc" >/dev/null 2>&1; do - attempt=$((attempt + 1)) - if [[ $attempt -ge $max_attempts ]]; then - echo "Transmission not ready after ${max_attempts} attempts, will retry next cycle" - exit 0 # Don't fail, just skip this sync - fi - sleep 10 - done - echo "Transmission is ready" - - # Get current torrents from transmission - # transmission-remote returns header + data + footer, extract just torrent names - current=$(transmission-remote "${TRANSMISSION_HOST}:${TRANSMISSION_PORT}" -l 2>/dev/null | \ - tail -n +2 | head -n -1 | awk '{print $NF}' || true) - - added=0 - skipped=0 - - while IFS= read -r url || [[ -n "$url" ]]; do - # Skip empty lines and comments - [[ -z "$url" || "$url" =~ ^[[:space:]]*# ]] && continue - # Trim whitespace - url=$(echo "$url" | xargs) - [[ -z "$url" ]] && continue - - # Extract base name from URL (remove .torrent extension) - basename=$(basename "$url" .torrent) - # Also try without .zim in case transmission reports it differently - basename_no_zim="${basename%.zim}" - - # Check if already in transmission - if echo "$current" | grep -qF "$basename_no_zim"; then - ((skipped++)) || true - else - if transmission-remote "${TRANSMISSION_HOST}:${TRANSMISSION_PORT}" -a "$url" 2>/dev/null; then - echo "Added: $basename" - ((added++)) || true - else - echo "Warning: Failed to add $url" >&2 - fi - fi - done < "$TORRENT_LIST" - - echo "Sync complete: $added added, $skipped already present" -``` - -### 6. Deploy Kiwix with Torrent Sync Sidecar - -**File:** `argocd/manifests/kiwix/deployment.yaml` - -```yaml -apiVersion: apps/v1 -kind: Deployment -metadata: - name: kiwix - namespace: kiwix - annotations: - # Track ZIM file changes for restart detection - kiwix.blumeops/zim-hash: "" -spec: - replicas: 1 - selector: - matchLabels: - app: kiwix - template: - metadata: - labels: - app: kiwix - spec: - containers: - # Main kiwix-serve container - - name: kiwix-serve - image: ghcr.io/kiwix/kiwix-serve:3.8.1 - args: - - --port=80 - - /data/*.zim # Serves ALL .zim files, regardless of how they were added - ports: - - containerPort: 80 - name: http - volumeMounts: - - name: torrents - mountPath: /data - readOnly: true - resources: - requests: - memory: "256Mi" - cpu: "100m" - limits: - memory: "1Gi" - livenessProbe: - httpGet: - path: / - port: 80 - initialDelaySeconds: 10 - periodSeconds: 30 - readinessProbe: - httpGet: - path: / - port: 80 - initialDelaySeconds: 5 - periodSeconds: 10 - - # Sidecar: Syncs declarative ZIM torrents to transmission - - name: torrent-sync - image: lscr.io/linuxserver/transmission:latest # Has transmission-remote CLI - command: ["/bin/bash", "-c"] - args: - - | - echo "Starting ZIM torrent sync sidecar" - # Initial sync - /scripts/sync-zim-torrents.sh || echo "Initial sync failed, will retry" - # Periodic sync every 30 minutes - while true; do - sleep 1800 - /scripts/sync-zim-torrents.sh || echo "Sync failed, will retry" - done - env: - - name: TRANSMISSION_HOST - value: "transmission.torrent.svc.cluster.local" - - name: TRANSMISSION_PORT - value: "9091" - - name: TORRENT_LIST - value: "/config/torrents.txt" - volumeMounts: - - name: zim-torrents-config - mountPath: /config/torrents.txt - subPath: torrents.txt - - name: sync-script - mountPath: /scripts - resources: - requests: - memory: "32Mi" - cpu: "10m" - limits: - memory: "64Mi" - - volumes: - - name: torrents - persistentVolumeClaim: - claimName: torrents-storage - - name: zim-torrents-config - configMap: - name: kiwix-zim-torrents - - name: sync-script - configMap: - name: zim-torrent-sync-script - defaultMode: 0755 -``` - -**File:** `argocd/manifests/kiwix/service.yaml` - -```yaml -apiVersion: v1 -kind: Service -metadata: - name: kiwix - namespace: kiwix -spec: - selector: - app: kiwix - ports: - - name: http - port: 80 - targetPort: 80 -``` - -### 7. Create Tailscale Ingress for Kiwix - -**File:** `argocd/manifests/kiwix/ingress-tailscale.yaml` - -```yaml -apiVersion: networking.k8s.io/v1 -kind: Ingress -metadata: - name: kiwix - namespace: kiwix -spec: - ingressClassName: tailscale - rules: - - host: kiwix - http: - paths: - - path: / - pathType: Prefix - backend: - service: - name: kiwix - port: - number: 80 -``` - -### 8. Create ZIM Watcher CronJob - -This CronJob runs hourly to detect new completed ZIMs (from any source) and triggers a kiwix restart. - -**File:** `argocd/manifests/kiwix/cronjob-zim-watcher.yaml` - -```yaml -apiVersion: batch/v1 -kind: CronJob -metadata: - name: zim-watcher - namespace: kiwix -spec: - schedule: "0 * * * *" # Every hour - concurrencyPolicy: Forbid - jobTemplate: - spec: - template: - spec: - serviceAccountName: zim-watcher - containers: - - name: watcher - image: bitnami/kubectl:latest - command: ["/bin/bash", "-c"] - args: - - | - set -euo pipefail - - # Get current ZIM files (among all downloads) - # This picks up ZIMs from both declarative list AND manually added torrents - current_zims=$(ls -1 /data/*.zim 2>/dev/null | sort | md5sum | cut -d' ' -f1 || echo "empty") - - # Get stored hash from deployment annotation - stored_hash=$(kubectl get deployment kiwix -n kiwix -o jsonpath='{.metadata.annotations.kiwix\.blumeops/zim-hash}' 2>/dev/null || echo "") - - echo "Current ZIMs hash: $current_zims" - echo "Stored hash: $stored_hash" - - # Also list what ZIMs we found - echo "ZIM files found:" - ls -la /data/*.zim 2>/dev/null || echo " (none)" - - if [[ "$current_zims" != "$stored_hash" && "$current_zims" != "empty" ]]; then - echo "ZIM files changed, restarting kiwix deployment..." - kubectl annotate deployment kiwix -n kiwix "kiwix.blumeops/zim-hash=$current_zims" --overwrite - kubectl rollout restart deployment/kiwix -n kiwix - echo "Restart triggered" - else - echo "No changes detected" - fi - volumeMounts: - - name: torrents - mountPath: /data - readOnly: true - restartPolicy: OnFailure - volumes: - - name: torrents - persistentVolumeClaim: - claimName: torrents-storage ---- -apiVersion: v1 -kind: ServiceAccount -metadata: - name: zim-watcher - namespace: kiwix ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: Role -metadata: - name: zim-watcher - namespace: kiwix -rules: - - apiGroups: ["apps"] - resources: ["deployments"] - verbs: ["get", "patch"] ---- -apiVersion: rbac.authorization.k8s.io/v1 -kind: RoleBinding -metadata: - name: zim-watcher - namespace: kiwix -subjects: - - kind: ServiceAccount - name: zim-watcher - namespace: kiwix -roleRef: - kind: Role - name: zim-watcher - apiGroup: rbac.authorization.k8s.io -``` - -### 9. Create Kiwix Kustomization - -**File:** `argocd/manifests/kiwix/kustomization.yaml` - -```yaml -apiVersion: kustomize.config.k8s.io/v1beta1 -kind: Kustomization -namespace: kiwix -resources: - - pvc.yaml - - configmap-zim-torrents.yaml - - configmap-sync-script.yaml - - deployment.yaml - - service.yaml - - ingress-tailscale.yaml - - cronjob-zim-watcher.yaml -``` - -### 10. Create Kiwix ArgoCD Application - -**File:** `argocd/apps/kiwix.yaml` - -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: kiwix - namespace: argocd -spec: - project: default - source: - repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git - targetRevision: main - path: argocd/manifests/kiwix - destination: - server: https://kubernetes.default.svc - namespace: kiwix - syncPolicy: - syncOptions: - - CreateNamespace=true -``` - ---- - -## Deployment Sequence - -### Phase A: Storage Setup (Manual) - -1. **Configure SMB share on sifaka** (see Prerequisites section) -2. **Copy existing downloads:** - ```bash - ssh indri 'rsync -avP ~/transmission/ sifaka:/volume1/torrents/' - ``` -3. **Verify SMB access from indri:** - ```bash - # Test SMB mount via Finder or smbclient - smbclient -L //sifaka -U eblume - ``` - -### Phase B: Deploy Transmission to Kubernetes - -Deploy transmission first since kiwix depends on it. - -1. **Create feature branch** (if not already done) -2. **Add torrent manifests** to `argocd/manifests/torrent/` -3. **Add ArgoCD Application** to `argocd/apps/torrent.yaml` -4. **Push branch to forge** -5. **Sync ArgoCD apps:** - ```bash - argocd app sync apps - argocd app set torrent --revision feature/p6-kiwix - argocd app sync torrent - ``` -6. **Verify transmission deployment:** - ```bash - kubectl --context=minikube-indri -n torrent get pods - kubectl --context=minikube-indri -n torrent logs deployment/transmission - ``` -7. **Test transmission web UI:** - - Open https://torrent.tail8d86e.ts.net in browser - - Should see transmission web interface - -### Phase C: Deploy Kiwix to Kubernetes - -1. **Add kiwix manifests** to `argocd/manifests/kiwix/` -2. **Add ArgoCD Application** to `argocd/apps/kiwix.yaml` -3. **Push to forge** -4. **Sync ArgoCD:** - ```bash - argocd app set kiwix --revision feature/p6-kiwix - argocd app sync kiwix - ``` -5. **Verify kiwix deployment:** - ```bash - kubectl --context=minikube-indri -n kiwix get pods - kubectl --context=minikube-indri -n kiwix logs deployment/kiwix -c kiwix-serve - kubectl --context=minikube-indri -n kiwix logs deployment/kiwix -c torrent-sync - ``` - -### Phase D: Verification - -1. **Test kiwix access:** - ```bash - curl -s https://kiwix.tail8d86e.ts.net/ | head -20 - ``` -2. **Verify ZIM files are served:** - - Open https://kiwix.tail8d86e.ts.net in browser - - Should see library with existing ZIM archives -3. **Check transmission status via k8s:** - ```bash - kubectl --context=minikube-indri -n torrent exec deployment/transmission -- transmission-remote -l - ``` -4. **Verify torrent sync is working:** - ```bash - kubectl --context=minikube-indri -n kiwix logs deployment/kiwix -c torrent-sync - ``` -5. **Add a test torrent manually** via https://torrent.tail8d86e.ts.net to verify interactive use - -### Phase E: Cutover - -1. **Verify all services working correctly** -2. **Stop transmission on indri:** - ```bash - ssh indri 'brew services stop transmission-cli' - ``` -3. **Stop kiwix on indri:** - ```bash - ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.kiwix-serve.plist' - ``` -4. **Clear kiwix Tailscale serve entry:** - ```bash - ssh indri 'tailscale serve status --json' - ssh indri 'tailscale serve clear svc:kiwix' - ``` -5. **Delete svc:kiwix device from Tailscale admin** (if needed to free hostname) -6. **Verify k8s services claim the hostnames:** - ```bash - curl -s https://kiwix.tail8d86e.ts.net/ - curl -s https://torrent.tail8d86e.ts.net/transmission/web/ - ``` - -### Phase F: Cleanup - -1. **Remove indri transmission/kiwix from ansible:** - - Remove `transmission` and `transmission_metrics` roles from `indri.yml` - - Remove `kiwix` role from `indri.yml` - - Remove `svc:kiwix` from `tailscale_serve` - - Remove transmission/kiwix log collection from `alloy` -2. **Run ansible to clean up:** - ```bash - mise run provision-indri -- --tags tailscale-serve,alloy - ``` -3. **Merge PR** after all verification -4. **Reset ArgoCD to main:** - ```bash - argocd app set torrent --revision main - argocd app sync torrent - argocd app set kiwix --revision main - argocd app sync kiwix - ``` - ---- - -## Adding New ZIM Archives (Declarative) - -To add a new ZIM archive via GitOps: - -1. **Find torrent URL** on https://download.kiwix.org/zim/ -2. **Add URL to ConfigMap** in `argocd/manifests/kiwix/configmap-zim-torrents.yaml` -3. **Commit and push** to feature branch -4. **Sync ArgoCD:** - ```bash - argocd app sync kiwix - ``` -5. **Wait for download** (check transmission at https://torrent.tail8d86e.ts.net) -6. **Kiwix restarts automatically** when ZIM watcher detects the new file (hourly) - - Or manually: `kubectl rollout restart deployment/kiwix -n kiwix` - -## Adding ZIM Archives (Manual/Interactive) - -Alternatively, add a ZIM torrent manually: - -1. **Open transmission web UI** at https://torrent.tail8d86e.ts.net -2. **Add torrent** via URL or file upload -3. **Wait for download** to complete -4. **Kiwix restarts automatically** when ZIM watcher detects the new file (hourly) - - Or manually: `kubectl rollout restart deployment/kiwix -n kiwix` - -Note: Manually added ZIM torrents are NOT tracked in git. If you want them to persist across cluster rebuilds, add them to the ConfigMap. - -## Adding Non-ZIM Torrents - -The transmission service is general-purpose: - -1. **Open transmission web UI** at https://torrent.tail8d86e.ts.net -2. **Add any torrent** (Linux ISOs, etc.) -3. **Downloads go to** `/volume1/torrents/` on sifaka SMB share -4. **Access downloads** via SMB mount or sifaka's file browser - -Non-ZIM downloads don't affect kiwix - it only serves `.zim` files. - ---- - -## Rollback Plan - -If migration fails: - -1. **Stop k8s services:** - ```bash - argocd app delete kiwix --cascade - argocd app delete torrent --cascade - kubectl delete namespace kiwix - kubectl delete namespace torrent - kubectl delete pv torrents-smb-pv - ``` -2. **Restart indri services:** - ```bash - ssh indri 'brew services start transmission-cli' - ssh indri 'launchctl load ~/Library/LaunchAgents/mcquack.eblume.kiwix-serve.plist' - ``` -3. **Re-enable Tailscale serve:** - ```bash - mise run provision-indri -- --tags tailscale-serve - ``` -4. **Verify access:** - ```bash - curl https://kiwix.tail8d86e.ts.net/ - ``` - ---- - -## Files Summary - -### New Files - -| Path | Purpose | -|------|---------| -| **Transmission (torrent namespace)** | | -| `argocd/apps/torrent.yaml` | ArgoCD Application for transmission | -| `argocd/manifests/torrent/pv-nfs.yaml` | Shared NFS PersistentVolume | -| `argocd/manifests/torrent/pvc.yaml` | Transmission PVC | -| `argocd/manifests/torrent/deployment.yaml` | Transmission deployment | -| `argocd/manifests/torrent/service.yaml` | Transmission service | -| `argocd/manifests/torrent/ingress-tailscale.yaml` | Tailscale Ingress for torrent.tail8d86e.ts.net | -| `argocd/manifests/torrent/kustomization.yaml` | Kustomize configuration | -| **Kiwix (kiwix namespace)** | | -| `argocd/apps/kiwix.yaml` | ArgoCD Application for kiwix | -| `argocd/manifests/kiwix/pvc.yaml` | Kiwix PVC (references shared PV) | -| `argocd/manifests/kiwix/configmap-zim-torrents.yaml` | Declarative ZIM torrent URL list | -| `argocd/manifests/kiwix/configmap-sync-script.yaml` | ZIM torrent sync script | -| `argocd/manifests/kiwix/deployment.yaml` | Kiwix deployment with sync sidecar | -| `argocd/manifests/kiwix/service.yaml` | Kiwix service | -| `argocd/manifests/kiwix/ingress-tailscale.yaml` | Tailscale Ingress for kiwix.tail8d86e.ts.net | -| `argocd/manifests/kiwix/cronjob-zim-watcher.yaml` | ZIM watcher CronJob + RBAC | -| `argocd/manifests/kiwix/kustomization.yaml` | Kustomize configuration | - -### Modified Files - -| Path | Change | -|------|--------| -| `ansible/playbooks/indri.yml` | Remove transmission, transmission_metrics, kiwix roles | -| `ansible/roles/tailscale_serve/defaults/main.yml` | Remove svc:kiwix | -| `ansible/roles/alloy/defaults/main.yml` | Remove transmission/kiwix log collection | - -### Roles Kept (not deleted) - -- `ansible/roles/transmission/` - Kept for reference -- `ansible/roles/transmission_metrics/` - Kept for reference -- `ansible/roles/kiwix/` - Kept for reference - ---- - -## Verification Checklist - -- [x] NFS export configured on sifaka (`/volume1/torrents`) -- [x] NFS exports allow 192.168.1.0/24 and 100.64.0.0/10 -- [x] Direct NFS mount from pod tested and working -- [ ] Existing downloads copied to sifaka -- [ ] Transmission pod running in k8s (`torrent` namespace) -- [ ] https://torrent.tail8d86e.ts.net accessible (web UI) -- [ ] Can add torrents manually via web UI -- [ ] Kiwix pod running in k8s (`kiwix` namespace) -- [ ] https://kiwix.tail8d86e.ts.net accessible -- [ ] All existing ZIM archives visible in kiwix -- [ ] Kiwix torrent-sync sidecar synced ZIMs to transmission -- [ ] ZIM watcher CronJob ran successfully -- [ ] Indri transmission stopped -- [ ] Indri kiwix stopped -- [ ] Tailscale hostname cutover complete (both services) -- [ ] Ansible playbook updated -- [ ] zk documentation updated diff --git a/plans/completed/k8s-migration/P7_forgejo.md b/plans/completed/k8s-migration/P7_forgejo.md deleted file mode 100644 index f994d12..0000000 --- a/plans/completed/k8s-migration/P7_forgejo.md +++ /dev/null @@ -1,394 +0,0 @@ -# Phase 7: Forgejo Migration to Kubernetes - -**Goal**: Migrate Forgejo from indri (macOS Homebrew) to Kubernetes via ArgoCD - -**Status**: Planning (2026-01-21) - -**Prerequisites**: [Phase 6](P6_kiwix.complete.md) complete - ---- - -## Critical Risks & Mitigations - -### 1. Circular Dependency (Highest Risk) - -ArgoCD pulls manifests from Forgejo. If k8s Forgejo fails, we cannot redeploy it. - -**Mitigation**: blumeops is mirrored to `github.com/eblume/blumeops`. DR procedure documented to switch ArgoCD to GitHub temporarily (see Disaster Recovery section). - -### 2. Split Hostnames Required - -The Tailscale k8s operator [cannot expose both HTTPS and TCP/SSH on the same hostname](https://github.com/tailscale/tailscale/issues/15539). See also [user comment](https://github.com/tailscale/tailscale/issues/15539#issuecomment-3782368432). - -**Solution**: -- **HTTPS (web UI)**: `forge.tail8d86e.ts.net` via Tailscale Ingress -- **SSH (git operations)**: `git.tail8d86e.ts.net` via Tailscale LoadBalancer - ---- - -## Current State - -### Forgejo on indri - -| Component | Location/Details | -|-----------|------------------| -| Data directory | `/opt/homebrew/var/forgejo/` (~426MB) | -| SQLite database | `/opt/homebrew/var/forgejo/data/forgejo.db` (4.1MB) | -| Git repositories | `/opt/homebrew/var/forgejo/data/forgejo-repositories/` (~418MB) | -| Configuration | `/opt/homebrew/var/forgejo/custom/conf/app.ini` (contains secrets) | -| HTTP port | 3001 (localhost) | -| SSH port | 2200 (localhost) | -| Tailscale | `svc:forge` with tcp:22→2200 and https:443→3001 | -| Backup | borgmatic backs up to sifaka | - -### Hosted Repositories (8 total) - -- blumeops (mirrored to GitHub) -- cloudnative-pg-charts -- csi-driver-smb -- devpi -- dotfiles -- grafana-helm-charts -- mcquack -- zot - ---- - -## Architecture Decision: Helm Chart via ArgoCD - -Following established pattern from cloudnative-pg and grafana: -1. Mirror `https://code.forgejo.org/forgejo-helm/forgejo-helm` to forge -2. ArgoCD Application with multi-source (chart + values) -3. Values file in `argocd/manifests/forgejo/values.yaml` - ---- - -## All `forge` References Requiring Update - -### SSH URLs (change to `git.tail8d86e.ts.net:22`) - -| File | Current | After | -|------|---------|-------| -| `argocd/apps/apps.yaml` | `ssh://forgejo@indri.tail8d86e.ts.net:2200/...` | `ssh://forgejo@git.tail8d86e.ts.net/...` | -| `argocd/apps/argocd.yaml` | same | same | -| `argocd/apps/blumeops-pg.yaml` | same | same | -| `argocd/apps/cloudnative-pg.yaml` | same | same | -| `argocd/apps/devpi.yaml` | same | same | -| `argocd/apps/grafana.yaml` | same | same | -| `argocd/apps/grafana-config.yaml` | same | same | -| `argocd/apps/kiwix.yaml` | same | same | -| `argocd/apps/miniflux.yaml` | same | same | -| `argocd/apps/tailscale-operator.yaml` | same | same | -| `argocd/apps/torrent.yaml` | same | same | -| `argocd/manifests/argocd/repo-forge-secret.yaml.tpl` | `ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/` | `ssh://forgejo@git.tail8d86e.ts.net/eblume/` | -| `ansible/group_vars/all.yml` | `ssh://forgejo@forge.tail8d86e.ts.net/...` | `ssh://forgejo@git.tail8d86e.ts.net/...` | - -### SSH Known Hosts (add `git.tail8d86e.ts.net`) - -| File | Change | -|------|--------| -| `argocd/manifests/argocd/argocd-ssh-known-hosts-cm.yaml` | Add `git.tail8d86e.ts.net ssh-ed25519 AAAA...` | - -### HTTPS URLs (stay as `forge.tail8d86e.ts.net`) - -These remain unchanged: -- `CLAUDE.md:135` - Mirror location -- `mise-tasks/pr-comments:23` - Forge API base -- `mise-tasks/indri-services-check:65` - HTTP health check (update to check k8s) - -### Ansible/Indri Cleanup (remove after migration) - -| File | Action | -|------|--------| -| `ansible/playbooks/indri.yml:36-37` | Remove forgejo role | -| `ansible/roles/tailscale_serve/defaults/main.yml:6` | Remove `svc:forge` entry | -| `ansible/roles/alloy/defaults/main.yml:31-32` | Remove forgejo log collection | -| `ansible/roles/borgmatic/defaults/main.yml:17` | Update backup path | - -### Tailscale/Pulumi (update after hostname cutover) - -| File | Change | -|------|--------| -| `argocd/manifests/tailscale-operator/egress-forge.yaml` | Delete (no longer needed) | -| `pulumi/policy.hujson` | Update `tag:forge` ACLs for k8s source | - ---- - -## Pre-Migration Checklist - -- [ ] GitHub mirror verified current -- [ ] Full borgmatic backup completed and verified -- [ ] Manual backup of `/opt/homebrew/var/forgejo` on indri -- [ ] Document all SSH deploy keys and webhooks -- [ ] **User action**: Mirror forgejo-helm chart to forge -- [ ] Extract secrets from app.ini to 1Password: - - `INTERNAL_TOKEN` - - `SECRET_KEY` - - `JWT_SECRET` - - Any OAuth/webhook secrets - ---- - -## Steps - -### Phase A: Create k8s Manifests - -**New Files:** -``` -argocd/apps/forgejo.yaml # ArgoCD Application (multi-source Helm) -argocd/manifests/forgejo/values.yaml # Helm chart values -argocd/manifests/forgejo/kustomization.yaml # Kustomize config -argocd/manifests/forgejo/pvc.yaml # 10Gi PersistentVolumeClaim -argocd/manifests/forgejo/secret-app.yaml.tpl # Secrets from 1Password -``` - -**Key values.yaml settings:** -```yaml -service: - ssh: - type: LoadBalancer - loadBalancerClass: tailscale - port: 22 - annotations: - tailscale.com/hostname: "git-1" # Test hostname first - -ingress: - enabled: true - className: tailscale - hosts: - - host: forge-1 # Test hostname first - -gitea: - config: - server: - DOMAIN: forge-1.tail8d86e.ts.net - ROOT_URL: https://forge-1.tail8d86e.ts.net/ - SSH_DOMAIN: git-1.tail8d86e.ts.net - SSH_PORT: 22 - database: - DB_TYPE: sqlite3 - PATH: /data/forgejo.db -``` - ---- - -### Phase B: Deploy to Test Hostnames - -1. Create feature branch, push to forge -2. Sync ArgoCD apps: `argocd app sync apps` -3. Point forgejo app to feature branch: `argocd app set forgejo --revision feature/p7-forgejo` -4. Sync forgejo app: `argocd app sync forgejo` -5. Verify pods running (empty data initially) - ---- - -### Phase C: Data Migration (~10 min downtime) - -1. **Stop indri Forgejo** - ```bash - ssh indri 'brew services stop forgejo' - ``` - -2. **Copy data** (option A: rsync via NFS staging) - ```bash - ssh indri 'rsync -avP /opt/homebrew/var/forgejo/ sifaka:/volume1/forgejo-migration/' - ``` - -3. **Copy to PVC and fix permissions** - ```bash - kubectl exec -n forgejo deployment/forgejo -- rsync -avP /staging/ /data/ - kubectl exec -n forgejo deployment/forgejo -- chown -R 1000:1000 /data - ``` - -4. **Restart Forgejo** - ```bash - kubectl rollout restart deployment/forgejo -n forgejo - ``` - ---- - -### Phase D: Validation (Critical) - -- [ ] Web UI accessible at `forge-1.tail8d86e.ts.net` -- [ ] SSH works: `ssh -T forgejo@git-1.tail8d86e.ts.net` -- [ ] All 8 repos visible and accessible -- [ ] Git clone works -- [ ] Git push works (test on non-critical repo) -- [ ] eblume user preserved with correct permissions -- [ ] PR history intact -- [ ] Webhooks functioning -- [ ] GitHub mirror push still works - ---- - -### Phase E: Hostname Cutover - -1. **Clear indri Tailscale serve** - ```bash - ssh indri 'tailscale serve clear svc:forge' - ``` - -2. **User action**: Delete `svc:forge` and `forge-1` devices from Tailscale admin - -3. **Update manifests**: Change `forge-1` → `forge`, `git-1` → `git` - -4. **Sync ArgoCD** - -5. **Verify hostnames claimed** - ```bash - curl https://forge.tail8d86e.ts.net/api/v1/version - ssh -T forgejo@git.tail8d86e.ts.net - ``` - ---- - -### Phase F: Update ArgoCD to Use New Forgejo - -1. **Get SSH host key from k8s Forgejo** - ```bash - kubectl exec -n forgejo deployment/forgejo -- cat /data/ssh/ssh_host_ed25519_key.pub - ``` - -2. **Update known_hosts ConfigMap** with `git.tail8d86e.ts.net` key - -3. **Update repo-creds-forge secret** (manual kubectl commands) - -4. **Update all ArgoCD Application manifests** with new repoURL - -5. **Delete egress-forge.yaml** (no longer needed) - -6. **Sync ArgoCD** and verify all apps sync successfully - ---- - -### Phase G: Update Local Git Remotes - -```bash -cd ~/code/personal/blumeops -git remote set-url origin ssh://forgejo@git.tail8d86e.ts.net/eblume/blumeops.git -# Repeat for all 8 repos -``` - ---- - -### Phase H: Cleanup - -1. Remove forgejo role from `ansible/playbooks/indri.yml` -2. Remove `svc:forge` from `ansible/roles/tailscale_serve/defaults/main.yml` -3. Remove forgejo log collection from `ansible/roles/alloy/defaults/main.yml` -4. Delete `argocd/manifests/tailscale-operator/egress-forge.yaml` -5. Update `mise-tasks/indri-services-check` -6. Run ansible to clean up indri: `mise run provision-indri -- --tags tailscale-serve,alloy` -7. Update zk documentation (forgejo, argocd, blumeops cards) -8. Merge PR -9. Reset ArgoCD to main - ---- - -## Disaster Recovery Procedure - -**Add to [[forgejo]] zk card:** - -### When Forgejo is Unavailable - -1. **Add GitHub repository to ArgoCD** - ```bash - argocd repo add https://github.com/eblume/blumeops.git \ - --username eblume \ - --password $(op read "op:////github-pat") - ``` - -2. **Point critical apps to GitHub** - ```bash - argocd app set apps --repo https://github.com/eblume/blumeops.git - argocd app set forgejo --repo https://github.com/eblume/blumeops.git - argocd app sync forgejo - ``` - -3. **Fix Forgejo** (restore from backup, fix config, etc.) - -4. **Verify Forgejo is healthy** - ```bash - curl https://forge.tail8d86e.ts.net/api/v1/version - ssh -T forgejo@git.tail8d86e.ts.net - ``` - -5. **Switch back to Forgejo** - ```bash - argocd app set apps --repo ssh://forgejo@git.tail8d86e.ts.net/eblume/blumeops.git - argocd app set forgejo --repo ssh://forgejo@git.tail8d86e.ts.net/eblume/blumeops.git - argocd app sync apps - argocd repo rm https://github.com/eblume/blumeops.git - ``` - ---- - -## Files Summary - -### New Files - -| Path | Purpose | -|------|---------| -| `argocd/apps/forgejo.yaml` | ArgoCD Application (multi-source Helm) | -| `argocd/manifests/forgejo/values.yaml` | Helm chart values | -| `argocd/manifests/forgejo/kustomization.yaml` | Kustomize config | -| `argocd/manifests/forgejo/pvc.yaml` | 10Gi PersistentVolumeClaim | -| `argocd/manifests/forgejo/secret-app.yaml.tpl` | Secrets template | - -### Modified Files - -| Path | Change | -|------|--------| -| All `argocd/apps/*.yaml` | Update repoURL to `git.tail8d86e.ts.net` | -| `argocd/manifests/argocd/argocd-ssh-known-hosts-cm.yaml` | Add `git.tail8d86e.ts.net` | -| `argocd/manifests/argocd/repo-forge-secret.yaml.tpl` | Update URL | -| `ansible/playbooks/indri.yml` | Remove forgejo role | -| `ansible/roles/tailscale_serve/defaults/main.yml` | Remove `svc:forge` | -| `ansible/roles/alloy/defaults/main.yml` | Remove forgejo logs | - -### Files to Delete - -| Path | Reason | -|------|--------| -| `argocd/manifests/tailscale-operator/egress-forge.yaml` | No longer needed | - ---- - -## Rollback - -If migration fails at any point: - -1. **Delete k8s resources** - ```bash - argocd app delete forgejo --cascade - kubectl delete namespace forgejo - ``` - -2. **Restart indri Forgejo** - ```bash - ssh indri 'brew services start forgejo' - ``` - -3. **Re-enable Tailscale serve** - ```bash - mise run provision-indri -- --tags tailscale-serve - ``` - -4. **Revert ArgoCD apps to indri URLs** (if changed) - ---- - -## Verification Checklist - -- [ ] GitHub mirror verified current -- [ ] Helm chart mirrored to forge -- [ ] Secrets extracted to 1Password -- [ ] k8s Forgejo pod running -- [ ] All 8 repos accessible -- [ ] SSH clone/push works via `git.tail8d86e.ts.net` -- [ ] HTTPS works via `forge.tail8d86e.ts.net` -- [ ] ArgoCD syncs from new URL -- [ ] All local remotes updated -- [ ] Indri cleanup complete -- [ ] zk docs updated -- [ ] DR procedure documented in [[forgejo]] card diff --git a/plans/completed/k8s-migration/P8_woodpecker.md b/plans/completed/k8s-migration/P8_woodpecker.md deleted file mode 100644 index 904398e..0000000 --- a/plans/completed/k8s-migration/P8_woodpecker.md +++ /dev/null @@ -1,32 +0,0 @@ -# Phase 8: CI/CD (Woodpecker) - -**Goal**: Deploy Woodpecker CI integrated with Forgejo - -**Status**: Pending - -**Prerequisites**: [Phase 7](P7_forgejo.md) complete - ---- - -## Steps - -### 1. Create Forgejo OAuth application - -- Callback: https://ci.tail8d86e.ts.net/authorize -- Store in 1Password - ---- - -### 2. Deploy Woodpecker Server + Agent - ---- - -### 3. Configure Tailscale LoadBalancer - -Tag: `svc:ci` - ---- - -### 4. Test pipeline - -Create `.woodpecker.yaml` in test repo diff --git a/plans/completed/k8s-migration/P9_cleanup.md b/plans/completed/k8s-migration/P9_cleanup.md deleted file mode 100644 index 9178b01..0000000 --- a/plans/completed/k8s-migration/P9_cleanup.md +++ /dev/null @@ -1,52 +0,0 @@ -# Phase 9: Cleanup - -**Goal**: Remove deprecated services, harden system - -**Status**: Pending - -**Prerequisites**: [Phase 8](P8_woodpecker.md) complete - ---- - -## Steps - -### 1. Stop/remove unused brew services - -- postgresql@18 -- grafana -- miniflux -- forgejo - ---- - -### 2. Update ansible playbook - -- Remove migrated service roles -- Add k8s deployment references - ---- - -### 3. Configure Velero backups (optional) - -- Install with MinIO on sifaka -- Schedule daily cluster backups - ---- - -### 4. Update zk documentation - -- New architecture -- Runbooks -- DR procedures - ---- - -## Plan Completion - -When all phases are complete and verified: - -```bash -# Rename this folder to indicate completion -git mv plans/k8s-migration plans/k8s-migration.complete -git commit -m "Complete k8s migration plan" -```