Remove plans, they dont seem to work
All checks were successful
Test CI / test (push) Successful in 3s

This commit is contained in:
Erich Blume 2026-01-24 16:21:49 -08:00
commit ceba6b3c2c
18 changed files with 0 additions and 6816 deletions

View file

@ -1,179 +0,0 @@
# Forgejo Actions CI/CD Bootstrap Plan
This plan details the setup of Forgejo Actions as the CI/CD system for blumeops, starting with the bootstrapping problem: using Forgejo to build and deploy Forgejo itself.
## Goals
1. **Forgejo Actions** as the primary CI system (replaces Woodpecker from original plan)
2. **Self-hosted Forgejo** built from source, deployed as mcquack LaunchAgent on indri
3. **Container builds** for ArgoCD manifests (devpi, etc.)
4. **Cron-scheduled tasks** via k8s CronJobs (not Actions)
5. **Local development** parity using `act` for workflow testing
## Why Forgejo Actions over Woodpecker?
- Native integration with Forgejo (no OAuth setup, automatic repo detection)
- GitHub Actions compatible syntax (huge ecosystem of reusable actions)
- `act` tool for local testing on gilbert
- Single system to maintain instead of two
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────┐
│ INDRI │
│ ┌─────────────────────┐ │
│ │ Forgejo │ ← Built from source │
│ │ (mcquack agent) │ ← Deploys itself via CI │
│ │ │ │
│ │ - Web UI (3001) │ │
│ │ - SSH (2200) │ │
│ │ - Actions enabled │ │
│ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│ SSH deploy
┌─────────────────────────────────────────────────────────────────┐
│ KUBERNETES (minikube) │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Forgejo Runner │ │ Other Services │ │
│ │ (host mode) │ │ (via ArgoCD) │ │
│ │ │ │ │ │
│ │ - Custom image │ │ │ │
│ │ - Node.js + tools │ │ │ │
│ │ - Docker builds │ │ │ │
│ └─────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
## Phases
| Phase | Name | Description | Status |
|-------|------|-------------|--------|
| 1 | [Enable Actions](P1_enable_actions.md) | Configure Forgejo for Actions, deploy runner in host mode | ✅ Complete |
| 2 | [Custom Runner Image](P2_mirror_and_build.md) | Build custom runner with Node.js/tools, enable standard Actions | ✅ Complete |
| 3 | [Mirror Forgejo & Build](P3_mirror_forgejo.md) | Mirror upstream Forgejo, create build workflow | Planning |
| 4 | [Self-Deploy](P4_self_deploy.md) | Forgejo deploys itself, transition to mcquack | Planning |
| 5 | [Container Builds](P5_container_builds.md) | Build custom container images (devpi, etc.) | Planning |
## The Bootstrap Problem
**Chicken-and-egg**: We need Forgejo Actions to build Forgejo, but Forgejo must be running first.
**Additional complication**: The stock runner image lacks Node.js, so standard GitHub Actions don't work.
**Solution**:
1. Keep current brew-based Forgejo running during setup ✅
2. Enable Actions, deploy runner in host mode ✅
3. **Build custom runner image** with Node.js and tools (bootstrap manually, then automate)
4. Mirror upstream Forgejo, create build workflow
5. Address cross-compilation challenge (Linux runner → macOS target)
6. First CI build creates the binary
7. CI deploys binary to indri as mcquack service
8. `brew services stop forgejo` and uninstall
9. Future builds: Forgejo builds and deploys itself
**Cross-compilation challenge**:
The runner runs in Linux containers (k8s), but Forgejo needs to run on indri (macOS ARM64). Options:
- Cross-compile with CGO_ENABLED=1 (complex, needs OSX toolchain)
- Cross-compile with CGO_ENABLED=0 (breaks Tailscale DNS resolution)
- Build on gilbert manually, use CI only for deploy
- Run a native macOS runner on indri (outside k8s)
This will be addressed in Phase 3.
**Risk mitigation**: If self-deployment breaks Forgejo:
- blumeops is mirrored to GitHub
- Manual recovery: build on gilbert, scp to indri, restart service
- See Disaster Recovery section in P4
## Host Mode Runner
The runner uses **host mode** (`ubuntu-latest:host`), meaning:
- Jobs run directly in the runner container (no Docker/k8s pods spawned)
- Tools must be pre-installed in the runner image
- Stock image lacks Node.js, so `actions/checkout@v4` doesn't work
- Solution: Build custom runner image with necessary tools (Phase 2)
## Ansible Role Strategy
The forgejo ansible role will follow the zot/alloy pattern:
1. **Check binary exists** at expected path
2. **If missing**: Fail with message pointing to CI trigger instructions
3. **If present**: Deploy config, ensure LaunchAgent loaded
Ansible does NOT:
- Build the binary (that's CI's job)
- Deploy new versions (that's CI's job)
Ansible DOES:
- Manage app.ini configuration (via template with secrets from 1Password)
- Manage mcquack LaunchAgent plist
- Ensure service is running
- Collect logs via Alloy
## Files Summary
### New Files
| Path | Purpose |
|------|---------|
| `argocd/apps/forgejo-runner.yaml` | ArgoCD Application for runner ✅ |
| `argocd/manifests/forgejo-runner/` | Runner k8s manifests ✅ |
| `argocd/manifests/forgejo-runner/Dockerfile` | Custom runner image (P2) |
| `.forgejo/workflows/build-runner.yml` | Auto-rebuild runner image (P2) |
| `.forgejo/workflows/test.yml` | Test workflow ✅ |
| (on forge) `eblume/forgejo/.forgejo/workflows/` | Build workflow in forgejo mirror (P3) |
### Modified Files
| Path | Change |
|------|--------|
| `ansible/roles/forgejo/` | Complete rewrite for mcquack pattern (P4) |
| `ansible/roles/alloy/defaults/main.yml` | Update forgejo log paths (P4) |
| zk cards | Update forgejo, argocd, blumeops cards |
### Credentials Needed
| Item | Purpose | Storage |
|------|---------|---------|
| Runner registration token | Runner auth to Forgejo | 1Password ✅ |
| SSH deploy key | Runner SSH to indri (for Forgejo deploy) | 1Password + k8s secret (P3) |
## Related Plans
- [P7_forgejo.md](../k8s-migration/P7_forgejo.md) - Original k8s migration plan (superseded for Forgejo itself, but SSH hostname split info still relevant)
- [P8_woodpecker.md](../k8s-migration/P8_woodpecker.md) - Original Woodpecker plan (superseded by Forgejo Actions)
## Decision Log
### 2026-01-23: Custom runner image as Phase 2
**Decision**: Move custom runner image work from P4 to P2
**Rationale**:
- Stock runner lacks Node.js, can't run `actions/checkout@v4`
- Need working GitHub Actions before building Forgejo
- Bootstrap manually (podman build on gilbert), then automate
### 2026-01-23: Forgejo Actions over Woodpecker
**Decision**: Use Forgejo Actions instead of Woodpecker CI
**Rationale**:
- Native Forgejo integration (Actions is built-in)
- GitHub Actions compatible (reuse existing actions)
- `act` for local testing
- One less system to deploy and maintain
### 2026-01-23: Keep Forgejo on indri (not k8s)
**Decision**: Forgejo stays on indri as mcquack service, not migrated to k8s
**Rationale**:
- Avoid circular dependency (ArgoCD needs Forgejo to deploy Forgejo)
- Simpler SSH handling (direct port, no k8s networking complexity)
- Forgejo is critical infrastructure, benefits from isolation
- Can still use Tailscale serve for external access

View file

@ -1,322 +0,0 @@
# Phase 1: Enable Forgejo Actions
**Goal**: Configure Forgejo to support Actions workflows and deploy a runner in k8s
**Status**: Completed (2026-01-23)
**Prerequisites**: None (uses existing brew-based Forgejo)
---
## Current State
- Forgejo runs via `brew services` on indri
- Config at `/opt/homebrew/var/forgejo/custom/conf/app.ini`
- Actions not enabled
- No runners deployed
---
## Step 1: Enable Actions in Forgejo
### 1.1 Update app.ini
SSH to indri and edit the Forgejo config:
```bash
ssh indri 'vim /opt/homebrew/var/forgejo/custom/conf/app.ini'
```
Add the following sections:
```ini
[actions]
ENABLED = true
DEFAULT_ACTIONS_URL = https://code.forgejo.org
[repository]
; Allow workflows to be stored in .forgejo/workflows
DEFAULT_REPO_UNITS = repo.code,repo.issues,repo.pulls,repo.releases,repo.wiki,repo.projects,repo.packages,repo.actions
```
### 1.2 Restart Forgejo
```bash
ssh indri 'brew services restart forgejo'
```
### 1.3 Verify Actions Enabled
1. Go to https://forge.tail8d86e.ts.net
2. Navigate to any repo → Settings → Actions
3. Should see "Enable Repository Actions" option
---
## Step 2: Create Runner Registration Token
### 2.1 Generate Token in Forgejo UI
1. Go to https://forge.tail8d86e.ts.net/admin/actions/runners
2. Click "Create new Runner"
3. Copy the registration token
4. Store in 1Password (blumeops vault) as "Forgejo Runner Token"
### 2.2 Create k8s Secret Template
Create `argocd/manifests/forgejo-runner/secret-token.yaml.tpl`:
```yaml
# Template for op inject
apiVersion: v1
kind: Secret
metadata:
name: forgejo-runner-token
namespace: forgejo-runner
type: Opaque
stringData:
token: "op://blumeops/<runner-token-item>/token"
```
---
## Step 3: Deploy Runner to Kubernetes
### 3.1 Create ArgoCD Application
Create `argocd/apps/forgejo-runner.yaml`:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: forgejo-runner
namespace: argocd
spec:
project: default
source:
repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git
targetRevision: main
path: argocd/manifests/forgejo-runner
destination:
server: https://kubernetes.default.svc
namespace: forgejo-runner
syncPolicy:
syncOptions:
- CreateNamespace=true
```
### 3.2 Create Runner Manifests
Create directory `argocd/manifests/forgejo-runner/` with:
**kustomization.yaml**:
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: forgejo-runner
resources:
- namespace.yaml
- deployment.yaml
- serviceaccount.yaml
- secret-token.yaml
```
**namespace.yaml**:
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: forgejo-runner
```
**serviceaccount.yaml**:
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: forgejo-runner
namespace: forgejo-runner
```
**deployment.yaml**:
```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: forgejo-runner
namespace: forgejo-runner
spec:
replicas: 1
selector:
matchLabels:
app: forgejo-runner
template:
metadata:
labels:
app: forgejo-runner
spec:
serviceAccountName: forgejo-runner
containers:
- name: runner
image: code.forgejo.org/forgejo/runner:3.5.1
env:
- name: FORGEJO_INSTANCE_URL
value: "https://forge.tail8d86e.ts.net"
- name: RUNNER_NAME
value: "k8s-runner-1"
- name: RUNNER_TOKEN
valueFrom:
secretKeyRef:
name: forgejo-runner-token
key: token
command:
- /bin/sh
- -c
- |
# Register runner if not already registered
if [ ! -f /data/.runner ]; then
forgejo-runner register \
--instance "$FORGEJO_INSTANCE_URL" \
--token "$RUNNER_TOKEN" \
--name "$RUNNER_NAME" \
--labels "ubuntu-latest:docker://node:20-bookworm,ubuntu-22.04:docker://ubuntu:22.04" \
--no-interactive
fi
# Start the runner daemon
forgejo-runner daemon
volumeMounts:
- name: runner-data
mountPath: /data
- name: docker-sock
mountPath: /var/run/docker.sock
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "1Gi"
cpu: "1000m"
volumes:
- name: runner-data
emptyDir: {}
- name: docker-sock
hostPath:
path: /var/run/docker.sock
type: Socket
```
**Note**: The runner needs access to Docker to run workflow jobs in containers. In minikube with docker driver, `/var/run/docker.sock` is available.
---
## Step 4: Deploy and Verify
### 4.1 Inject Secrets and Deploy
```bash
# Inject secrets
op inject -i argocd/manifests/forgejo-runner/secret-token.yaml.tpl \
-o argocd/manifests/forgejo-runner/secret-token.yaml
# Sync apps
argocd app sync apps
argocd app sync forgejo-runner
```
### 4.2 Verify Runner Registration
```bash
# Check runner pod
kubectl --context=minikube-indri -n forgejo-runner get pods
# Check runner logs
kubectl --context=minikube-indri -n forgejo-runner logs -f deployment/forgejo-runner
# Verify in Forgejo UI
# Go to https://forge.tail8d86e.ts.net/admin/actions/runners
# Should see "k8s-runner-1" as online
```
---
## Step 5: Test with Simple Workflow
### 5.1 Create Test Workflow
In the blumeops repo, create `.forgejo/workflows/test.yml`:
```yaml
name: Test CI
on:
push:
branches: [main]
pull_request:
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Hello World
run: |
echo "Hello from Forgejo Actions!"
echo "Runner: ${{ runner.name }}"
echo "Repo: ${{ github.repository }}"
```
### 5.2 Push and Verify
```bash
git add .forgejo/
git commit -m "Add test workflow for Forgejo Actions"
git push
```
Check https://forge.tail8d86e.ts.net/eblume/blumeops/actions for the workflow run.
---
## Verification Checklist
- [x] Actions enabled in app.ini
- [x] Forgejo restarted successfully
- [x] Runner token stored in 1Password
- [x] Runner deployment created in ArgoCD
- [x] Runner pod running in k8s
- [x] Runner shows as online in Forgejo admin
- [x] Test workflow runs successfully
---
## Troubleshooting
### Runner Can't Connect to Forgejo
The runner needs to reach `forge.tail8d86e.ts.net` from inside k8s. This should work via Tailscale operator egress (already configured for ArgoCD).
If not working:
```bash
# Test from inside k8s
kubectl --context=minikube-indri run -it --rm curl --image=curlimages/curl -- \
curl -v https://forge.tail8d86e.ts.net/api/v1/version
```
### Docker Socket Permission Denied
The runner container needs to access the Docker socket. In minikube with docker driver, this should work. If permission denied:
```bash
# Check socket permissions
kubectl --context=minikube-indri -n forgejo-runner exec deployment/forgejo-runner -- ls -la /var/run/docker.sock
```
May need to run runner as root or adjust security context.
---
## Next Phase
Once runner is working, proceed to [Phase 2: Mirror & Build](P2_mirror_and_build.md).

View file

@ -1,347 +0,0 @@
# Phase 2: Custom Runner Image
**Goal**: Build a custom forgejo-runner image with necessary tools, enabling standard GitHub Actions
**Status**: Complete (2026-01-23)
**Prerequisites**: [Phase 1](P1_enable_actions.md) complete (Actions enabled, runner deployed in host mode)
---
## Problem Statement
The stock `code.forgejo.org/forgejo/runner:3.5.1` image lacks tools needed for standard GitHub Actions:
- **Node.js** - Required by most actions (checkout, setup-*, etc.)
- **Git** - For repository operations (present but minimal)
- **Common build tools** - make, gcc, curl, jq, etc.
In host mode, jobs run directly in the runner container, so these tools must be pre-installed.
### Chicken-and-Egg Problem
We can't use `actions/checkout@v4` to build the custom runner because that action requires Node.js, which we don't have yet. Solution: Bootstrap manually, then automate.
---
## Step 1: Create Dockerfile for Custom Runner
Create `argocd/manifests/forgejo-runner/Dockerfile`:
```dockerfile
FROM code.forgejo.org/forgejo/runner:3.5.1
# The base image is Debian-based
# Install tools needed for GitHub Actions and builds
RUN apt-get update && apt-get install -y --no-install-recommends \
# Required for actions/checkout and other Node-based actions
nodejs \
npm \
# Build essentials
git \
curl \
wget \
jq \
make \
gcc \
g++ \
# For container builds (if we add Docker-in-Docker later)
ca-certificates \
&& rm -rf /var/lib/apt/lists/*
# Verify Node.js is available
RUN node --version && npm --version
```
---
## Step 2: Bootstrap - Build Image Manually
Since we can't use CI yet, build the image manually on gilbert and push to zot.
### 2.1 Build with Podman
```bash
cd ~/code/personal/blumeops/argocd/manifests/forgejo-runner
# Build for linux/arm64 (minikube on M1 Mac)
podman build --platform linux/arm64 -t registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest .
# Push to zot (no auth required)
podman push registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest
```
### 2.2 Verify Image in Registry
```bash
curl -s https://registry.tail8d86e.ts.net/v2/blumeops/forgejo-runner/tags/list | jq .
```
---
## Step 3: Update Runner Deployment
### 3.1 Update deployment.yaml
Change the image from stock to custom:
```yaml
# Before
image: code.forgejo.org/forgejo/runner:3.5.1
# After
image: registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest
```
### 3.2 Update kustomization.yaml
Add Dockerfile to resources (for reference, not deployed):
```yaml
# Note: Dockerfile is for building, not k8s deployment
# It lives here for co-location with the runner manifests
```
### 3.3 Sync Deployment
```bash
argocd app sync forgejo-runner
# Verify new image is running
kubectl --context=minikube-indri -n forgejo-runner get pods -o jsonpath='{.items[*].spec.containers[*].image}'
```
---
## Step 4: Test with Real GitHub Action
Now that we have Node.js, test with `actions/checkout@v4`.
### 4.1 Update Test Workflow
Update `.forgejo/workflows/test.yml`:
```yaml
name: Test CI
on:
push:
branches: [main]
pull_request:
workflow_dispatch:
jobs:
test:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Verify tools
run: |
echo "Node.js: $(node --version)"
echo "npm: $(npm --version)"
echo "Git: $(git --version)"
echo "Make: $(make --version | head -1)"
- name: Show repo info
run: |
echo "Repository: ${{ github.repository }}"
echo "Branch: ${{ github.ref_name }}"
ls -la
```
### 4.2 Push and Verify
```bash
git add .forgejo/workflows/test.yml
git commit -m "Test checkout action with custom runner"
git push
```
Check https://forge.tail8d86e.ts.net/eblume/blumeops/actions - should see successful run with `actions/checkout@v4`.
---
## Step 5: Create Auto-Build Workflow for Runner
Now that Actions work properly, create a workflow to rebuild the runner image automatically.
### 5.1 Create Build Workflow
Create `.forgejo/workflows/build-runner.yml`:
```yaml
name: Build Runner Image
on:
push:
paths:
- 'argocd/manifests/forgejo-runner/Dockerfile'
- '.forgejo/workflows/build-runner.yml'
workflow_dispatch:
env:
REGISTRY: registry.tail8d86e.ts.net
IMAGE_NAME: blumeops/forgejo-runner
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Build image
run: |
cd argocd/manifests/forgejo-runner
# Use docker build (available in runner container)
# Note: This builds for the runner's native arch
docker build -t ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} .
docker tag ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
- name: Push to registry
run: |
# Zot has no auth, just push
docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
- name: Verify push
run: |
curl -sf "https://${{ env.REGISTRY }}/v2/${{ env.IMAGE_NAME }}/tags/list" | jq .
echo "Image pushed: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}"
```
### 5.2 Note on Docker-in-Docker
The runner runs in host mode, so we need Docker CLI available. Options:
1. **Add Docker CLI to the custom image** (see Dockerfile update below)
2. **Mount Docker socket from minikube** (requires deployment change)
3. **Use Podman instead** (rootless, no socket needed)
For now, we'll add Docker CLI to the image and mount the socket.
### 5.3 Update Dockerfile for Docker Builds
```dockerfile
FROM code.forgejo.org/forgejo/runner:3.5.1
RUN apt-get update && apt-get install -y --no-install-recommends \
nodejs \
npm \
git \
curl \
wget \
jq \
make \
gcc \
g++ \
ca-certificates \
# Docker CLI for building container images
docker.io \
&& rm -rf /var/lib/apt/lists/*
RUN node --version && npm --version && docker --version
```
### 5.4 Update Deployment for Docker Socket
Add Docker socket mount to `deployment.yaml`:
```yaml
volumeMounts:
- name: runner-data
mountPath: /data
- name: runner-config
mountPath: /config
- name: docker-sock
mountPath: /var/run/docker.sock
volumes:
- name: runner-data
emptyDir: {}
- name: runner-config
configMap:
name: forgejo-runner-config
- name: docker-sock
hostPath:
path: /var/run/docker.sock
type: Socket
```
---
## Step 6: Verification
### 6.1 Manual Image Build Works
```bash
# On gilbert
podman build --platform linux/arm64 -t registry.tail8d86e.ts.net/blumeops/forgejo-runner:test .
podman push registry.tail8d86e.ts.net/blumeops/forgejo-runner:test
```
### 6.2 Runner Uses Custom Image
```bash
kubectl --context=minikube-indri -n forgejo-runner get pods -o jsonpath='{.items[*].spec.containers[*].image}'
# Should show: registry.tail8d86e.ts.net/blumeops/forgejo-runner:latest
```
### 6.3 GitHub Actions Work
- `actions/checkout@v4` succeeds
- Test workflow shows Node.js, npm, git versions
### 6.4 Auto-Build Workflow Works
Push a change to the Dockerfile and verify:
1. Workflow triggers
2. Image builds successfully
3. Image pushed to zot
---
## Verification Checklist
- [x] Dockerfile created for custom runner (Alpine-based with apk)
- [x] Image built manually on gilbert (podman build)
- [x] Image pushed to zot registry
- [x] Runner deployment updated to use custom image
- [x] Runner pod running with new image
- [x] `actions/checkout@v4` works in test workflow
- [ ] Auto-build workflow created (deferred - needs Docker socket)
- [ ] Docker socket mounted (for container builds)
- [ ] Auto-build workflow successfully rebuilds runner
---
## Troubleshooting
### Image Pull Fails in Minikube
Minikube needs to be able to pull from zot. Check registry mirror config:
```bash
ssh indri 'minikube ssh -- cat /etc/containerd/certs.d/registry.tail8d86e.ts.net/hosts.toml'
```
### Docker Build Fails in Workflow
If Docker socket mount doesn't work:
1. Check socket exists in minikube: `minikube ssh -- ls -la /var/run/docker.sock`
2. Check permissions: runner may need to be in docker group
3. Alternative: Use `podman` (rootless) instead of Docker
### Node.js Actions Still Fail
Ensure the runner pod restarted after image update:
```bash
kubectl --context=minikube-indri -n forgejo-runner rollout restart deployment/forgejo-runner
kubectl --context=minikube-indri -n forgejo-runner logs -f deployment/forgejo-runner
```
---
## Next Phase
Once the custom runner is working with auto-build, proceed to [Phase 3: Mirror Forgejo & Build](P3_mirror_and_build.md) to set up Forgejo source builds.

View file

@ -1,349 +0,0 @@
# Phase 3: Mirror Forgejo & Build from Source
**Goal**: Mirror upstream Forgejo to forge and create a workflow that builds it for macOS ARM64
**Status**: Planning
**Prerequisites**: [Phase 2](P2_mirror_and_build.md) complete (custom runner image with Node.js/tools)
---
## Problem Statement
We want to build Forgejo from source to:
1. Have full control over the binary running on indri
2. Enable self-deployment via CI
3. Ensure proper macOS DNS resolution (requires CGO_ENABLED=1)
### The Cross-Compilation Challenge
The runner runs in a Linux container (k8s on indri), but the target is macOS ARM64 (indri itself).
**Options**:
| Option | Pros | Cons |
|--------|------|------|
| A. Cross-compile CGO_ENABLED=0 | Simple, no special toolchain | Breaks Tailscale MagicDNS resolution |
| B. Cross-compile CGO_ENABLED=1 | Proper DNS | Needs OSX cross-compiler (osxcross), complex |
| C. Build on gilbert manually | Works now, simple | Not automated, manual step |
| D. Native macOS runner on indri | Full native build | Runner outside k8s, different architecture |
| E. Hybrid: build on gilbert, deploy via CI | Uses existing tools | Partial automation |
**Recommendation**: Start with Option C/E (manual build on gilbert, CI just deploys), then consider Option D if we want full automation.
---
## Step 1: Mirror Upstream Forgejo
### 1.1 User Action: Create Mirror on Forge
**Manual step** (hairpinning doesn't work from indri):
1. Go to https://forge.tail8d86e.ts.net
2. Click "+" → "New Migration"
3. Select "Gitea" as clone source
4. URL: `https://codeberg.org/forgejo/forgejo.git`
5. Repository name: `forgejo`
6. Check "This repository will be a mirror"
7. Click "Migrate Repository"
### 1.2 Clone Mirror Locally
```bash
git clone ssh://forgejo@forge.tail8d86e.ts.net/eblume/forgejo.git ~/code/3rd/forgejo
cd ~/code/3rd/forgejo
```
---
## Step 2: Understand Forgejo Build Process
### 2.1 Build Requirements
From Forgejo's `Makefile` and docs:
- **Go**: 1.23+ (check `go.mod` for exact version)
- **Node.js**: 20+ (for frontend)
- **Make**: GNU Make
- **Git**: For version embedding
### 2.2 Build Commands
```bash
# Install frontend dependencies and build
make deps-frontend
make frontend
# Build backend (with CGO for proper DNS on macOS)
CGO_ENABLED=1 TAGS="bindata sqlite sqlite_unlock_notify" make backend
# Or all-in-one
CGO_ENABLED=1 TAGS="bindata sqlite sqlite_unlock_notify" make build
```
### 2.3 Output
Binary at `gitea` (yes, the binary is still named `gitea` for compatibility).
---
## Step 3: Build on Gilbert (Manual Bootstrap)
For the initial bootstrap, build on gilbert (macOS ARM64 native).
### 3.1 Setup Build Environment
```bash
cd ~/code/3rd/forgejo
mise use go@1.23 node@20
# Verify tools
go version
node --version
make --version
```
### 3.2 Build
```bash
# Clean build
make clean
# Build frontend
make deps-frontend
make frontend
# Build backend with CGO (important for macOS DNS!)
CGO_ENABLED=1 TAGS="bindata sqlite sqlite_unlock_notify" make backend
# Verify binary
./gitea --version
file gitea # Should show: Mach-O 64-bit executable arm64
```
### 3.3 Deploy to Indri
```bash
# Copy binary
scp gitea indri:~/.local/bin/forgejo-new
# Verify on indri
ssh indri '~/.local/bin/forgejo-new --version'
```
---
## Step 4: Create Deploy Workflow (Option E)
Since cross-compilation is complex, use a hybrid approach:
1. Build on gilbert (manual trigger or pre-built)
2. CI workflow fetches and deploys
### 4.1 SSH Deploy Key for Runner
The runner needs SSH access to indri to deploy the binary.
**Generate key on gilbert**:
```bash
ssh-keygen -t ed25519 -C "forgejo-runner-deploy" -f ~/.ssh/forgejo-runner-deploy -N ""
```
**Add public key to indri's authorized_keys**:
```bash
cat ~/.ssh/forgejo-runner-deploy.pub | ssh indri 'cat >> ~/.ssh/authorized_keys'
```
**Store private key in 1Password** (blumeops vault) as "Forgejo Runner Deploy Key"
### 4.2 Create k8s Secret
Create `argocd/manifests/forgejo-runner/secret-ssh.yaml.tpl`:
```yaml
apiVersion: v1
kind: Secret
metadata:
name: forgejo-runner-ssh
namespace: forgejo-runner
type: Opaque
stringData:
id_ed25519: |
op://blumeops/<deploy-key-item>/private-key
known_hosts: |
# Get with: ssh-keyscan indri.tail8d86e.ts.net 2>/dev/null | grep ed25519
indri.tail8d86e.ts.net ssh-ed25519 AAAAC3...
```
### 4.3 Update Deployment for SSH
Add SSH secret mount to `deployment.yaml`:
```yaml
volumeMounts:
- name: ssh-key
mountPath: /root/.ssh
readOnly: true
volumes:
- name: ssh-key
secret:
secretName: forgejo-runner-ssh
defaultMode: 0600
```
### 4.4 Create Deploy-Only Workflow
Create `.forgejo/workflows/deploy-forgejo.yml` in blumeops:
```yaml
name: Deploy Forgejo
on:
workflow_dispatch:
inputs:
version:
description: 'Version to deploy (tag or commit)'
required: true
default: 'v10.0.0'
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Deploy to indri
env:
VERSION: ${{ github.event.inputs.version }}
run: |
# SSH config
mkdir -p ~/.ssh
cp /root/.ssh/id_ed25519 ~/.ssh/
cp /root/.ssh/known_hosts ~/.ssh/
chmod 600 ~/.ssh/id_ed25519
# Deploy script
ssh erichblume@indri.tail8d86e.ts.net << 'EOF'
set -e
cd ~/.local/bin
# Verify the new binary exists and runs
if [ ! -f forgejo-new ]; then
echo "ERROR: forgejo-new not found. Build on gilbert first:"
echo " cd ~/code/3rd/forgejo && git checkout $VERSION"
echo " CGO_ENABLED=1 TAGS='bindata sqlite sqlite_unlock_notify' make build"
echo " scp gitea indri:~/.local/bin/forgejo-new"
exit 1
fi
./forgejo-new --version
# Stop current service
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist 2>/dev/null || true
# Atomic swap
mv forgejo forgejo-old 2>/dev/null || true
mv forgejo-new forgejo
# Start new service
launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist
# Verify it's running
sleep 5
curl -sf http://localhost:3001/api/v1/version || exit 1
echo "Deploy successful!"
./forgejo --version
EOF
```
---
## Future: Full CI Build (Option D)
If we want full automation, consider running a native macOS runner on indri:
### Native Runner on Indri
```bash
# Install forgejo-runner on indri via mise
ssh indri 'mise use forgejo-runner'
# Register as a macOS runner
ssh indri 'forgejo-runner register \
--instance https://forge.tail8d86e.ts.net \
--token "$TOKEN" \
--name "indri-native" \
--labels "macos-arm64:host" \
--no-interactive'
# Create LaunchAgent for runner
# (similar to other mcquack services)
```
Then workflow uses:
```yaml
runs-on: macos-arm64
```
This enables full native builds in CI. Document in a future phase if needed.
---
## Verification Checklist
- [ ] Forgejo mirrored to forge
- [ ] Mirror cloned to ~/code/3rd/forgejo
- [ ] Build succeeds on gilbert
- [ ] Binary is valid macOS ARM64 executable
- [ ] Binary deployed to indri ~/.local/bin/
- [ ] SSH deploy key created and stored in 1Password
- [ ] Deploy key added to indri authorized_keys
- [ ] (Optional) k8s SSH secret created
- [ ] (Optional) Deploy workflow created
---
## Troubleshooting
### Build Fails: Node.js Version
```
error: engine "node" is incompatible
```
Update Node.js: `mise use node@20`
### Build Fails: Go Version
```
go: go.mod requires go >= 1.23
```
Update Go: `mise use go@1.23`
### Binary Crashes on indri
Check if CGO was enabled:
```bash
# If built without CGO, DNS resolution may fail
./forgejo --version # Should work
./forgejo web # May fail to resolve Tailscale hostnames
```
Rebuild with `CGO_ENABLED=1`.
### SSH Deploy Fails
Check runner has SSH access:
```bash
# Test from inside runner pod
kubectl --context=minikube-indri -n forgejo-runner exec deployment/forgejo-runner -- \
ssh -i /root/.ssh/id_ed25519 erichblume@indri.tail8d86e.ts.net 'echo ok'
```
---
## Next Phase
Once Forgejo is building and deploying successfully, proceed to [Phase 4: Self-Deploy](P4_self_deploy.md) for the full mcquack transition.

View file

@ -1,409 +0,0 @@
# Phase 4: Self-Deploy & Transition to mcquack
**Goal**: Complete the bootstrap - Forgejo deploys itself, transition from brew to mcquack LaunchAgent
**Status**: Planning
**Prerequisites**: [Phase 3](P3_mirror_forgejo.md) complete (Forgejo builds and deploys to indri)
---
## Overview
This phase completes the bootstrap:
1. First successful CI deploy creates the binary
2. Transition from brew service to mcquack LaunchAgent
3. Update ansible role to mcquack pattern
4. Remove brew forgejo
After this phase, Forgejo builds and deploys itself on every tagged release.
---
## Step 1: Prepare indri for mcquack
### 1.1 Create Directory Structure
```bash
ssh indri << 'EOF'
mkdir -p ~/.local/bin
mkdir -p ~/.config/forgejo
mkdir -p ~/Library/Logs
EOF
```
### 1.2 Prepare Data Directory
The existing data is at `/opt/homebrew/var/forgejo`. We'll keep it there for now (simpler), or optionally migrate to `~/forgejo`.
**Option A: Keep existing path** (recommended for simplicity)
- Data stays at `/opt/homebrew/var/forgejo`
- Binary moves to `~/.local/bin/forgejo`
**Option B: Full migration**
- Move data to `~/forgejo`
- Requires updating app.ini paths
For this plan, we'll use Option A.
---
## Step 2: First CI Deploy
### 2.1 Trigger Build with Deploy
1. Go to https://forge.tail8d86e.ts.net/eblume/forgejo/actions
2. Select "Build Forgejo" workflow
3. Click "Run workflow"
4. Set deploy=true
5. Monitor the run
### 2.2 Verify Binary Deployed
```bash
ssh indri 'ls -la ~/.local/bin/forgejo && ~/.local/bin/forgejo --version'
```
At this point:
- New binary is at `~/.local/bin/forgejo`
- Brew forgejo is still running
- LaunchAgent doesn't exist yet
---
## Step 3: Create mcquack LaunchAgent
### 3.1 Create Plist Manually (One-Time Bootstrap)
```bash
ssh indri << 'EOF'
cat > ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist << 'PLIST'
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.forgejo</string>
<key>ProgramArguments</key>
<array>
<string>/Users/erichblume/.local/bin/forgejo</string>
<string>web</string>
<string>--config</string>
<string>/opt/homebrew/var/forgejo/custom/conf/app.ini</string>
<string>--work-path</string>
<string>/opt/homebrew/var/forgejo</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>/Users/erichblume/Library/Logs/mcquack.forgejo.out.log</string>
<key>StandardErrorPath</key>
<string>/Users/erichblume/Library/Logs/mcquack.forgejo.err.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>HOME</key>
<string>/Users/erichblume</string>
<key>USER</key>
<string>erichblume</string>
</dict>
</dict>
</plist>
PLIST
EOF
```
---
## Step 4: Cutover from Brew to mcquack
### 4.1 Stop Brew Service
```bash
ssh indri 'brew services stop forgejo'
```
### 4.2 Start mcquack Service
```bash
ssh indri 'launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist'
```
### 4.3 Verify Service Running
```bash
# Check process
ssh indri 'launchctl list | grep forgejo'
# Check logs
ssh indri 'tail -20 ~/Library/Logs/mcquack.forgejo.err.log'
# Check HTTP
curl -s https://forge.tail8d86e.ts.net/api/v1/version
```
### 4.4 Verify Git Operations
```bash
# SSH test
ssh -T forgejo@forge.tail8d86e.ts.net
# Clone test
git clone ssh://forgejo@forge.tail8d86e.ts.net/eblume/blumeops.git /tmp/test-clone
rm -rf /tmp/test-clone
```
---
## Step 5: Update Ansible Role
### 5.1 Rewrite forgejo Role
Replace `ansible/roles/forgejo/tasks/main.yml`:
```yaml
---
# Forgejo is built from source via CI and deployed automatically.
# This role manages the configuration and LaunchAgent only.
#
# BINARY DEPLOYMENT:
# The binary at ~/.local/bin/forgejo is deployed by Forgejo Actions CI.
# If missing, trigger a build at:
# https://forge.tail8d86e.ts.net/eblume/forgejo/actions
#
# CONFIGURATION:
# app.ini at /opt/homebrew/var/forgejo/custom/conf/app.ini contains secrets
# and is NOT managed by ansible. It is backed up by borgmatic.
- name: Verify forgejo binary exists
ansible.builtin.stat:
path: "{{ forgejo_binary }}"
register: forgejo_binary_stat
- name: Fail if forgejo binary not found
ansible.builtin.fail:
msg: |
Forgejo binary not found at {{ forgejo_binary }}.
The binary is deployed by Forgejo Actions CI. To build and deploy:
1. Go to https://forge.tail8d86e.ts.net/eblume/forgejo/actions
2. Select "Build Forgejo" workflow
3. Click "Run workflow" with deploy=true
Alternatively, build manually on gilbert and scp to indri.
when: not forgejo_binary_stat.stat.exists
- name: Check forgejo config exists
ansible.builtin.stat:
path: "{{ forgejo_config }}"
register: forgejo_config_stat
- name: Fail if forgejo config is missing
ansible.builtin.fail:
msg: |
Forgejo config not found at {{ forgejo_config }}
This file contains secrets and is not managed by ansible.
To restore from backup, run:
borgmatic --config ~/.config/borgmatic/config.yaml extract --archive latest \
--path {{ forgejo_config }}
when: not forgejo_config_stat.stat.exists
- name: Deploy forgejo LaunchAgent plist
ansible.builtin.template:
src: forgejo.plist.j2
dest: ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist
mode: '0644'
notify: Restart forgejo
- name: Check if forgejo LaunchAgent is loaded
ansible.builtin.command: launchctl list mcquack.eblume.forgejo
register: forgejo_launchctl_check
changed_when: false
failed_when: false
- name: Load forgejo LaunchAgent if not loaded
ansible.builtin.command: launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist
when: forgejo_launchctl_check.rc != 0
changed_when: true
failed_when: false
```
### 5.2 Create defaults/main.yml
```yaml
---
# Forgejo binary and paths
forgejo_binary: /Users/erichblume/.local/bin/forgejo
forgejo_work_path: /opt/homebrew/var/forgejo
forgejo_config: "{{ forgejo_work_path }}/custom/conf/app.ini"
forgejo_log_dir: /Users/erichblume/Library/Logs
# HTTP and SSH ports (must match app.ini)
forgejo_http_port: 3001
forgejo_ssh_port: 2200
```
### 5.3 Create templates/forgejo.plist.j2
```xml
<?xml version="1.0" encoding="UTF-8"?>
<!-- {{ ansible_managed }} -->
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>mcquack.eblume.forgejo</string>
<key>ProgramArguments</key>
<array>
<string>{{ forgejo_binary }}</string>
<string>web</string>
<string>--config</string>
<string>{{ forgejo_config }}</string>
<string>--work-path</string>
<string>{{ forgejo_work_path }}</string>
</array>
<key>RunAtLoad</key>
<true/>
<key>KeepAlive</key>
<true/>
<key>StandardOutPath</key>
<string>{{ forgejo_log_dir }}/mcquack.forgejo.out.log</string>
<key>StandardErrorPath</key>
<string>{{ forgejo_log_dir }}/mcquack.forgejo.err.log</string>
<key>EnvironmentVariables</key>
<dict>
<key>HOME</key>
<string>/Users/erichblume</string>
<key>USER</key>
<string>erichblume</string>
</dict>
</dict>
</plist>
```
### 5.4 Update handlers/main.yml
```yaml
---
- name: Restart forgejo
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist
changed_when: true
```
---
## Step 6: Update Alloy Log Collection
Update `ansible/roles/alloy/defaults/main.yml`:
Change forgejo log paths from brew to mcquack:
```yaml
alloy_brew_logs:
# Remove forgejo from here
- path: /opt/homebrew/var/log/tailscaled.log
service: tailscale
stream: stdout
alloy_mcquack_logs:
# ... existing entries ...
- path: /Users/erichblume/Library/Logs/mcquack.forgejo.out.log
service: forgejo
stream: stdout
- path: /Users/erichblume/Library/Logs/mcquack.forgejo.err.log
service: forgejo
stream: stderr
```
---
## Step 7: Remove Brew Forgejo
### 7.1 Uninstall Brew Package
```bash
ssh indri 'brew uninstall forgejo'
```
### 7.2 Remove Old Logs
```bash
ssh indri 'rm -f /opt/homebrew/var/log/forgejo.log'
```
---
## Step 8: Run Ansible
```bash
mise run provision-indri -- --tags forgejo,alloy
```
---
## Disaster Recovery
### If CI Deploy Breaks Forgejo
1. **Build manually on gilbert**:
```bash
cd ~/code/3rd/forgejo
git pull
mise use go node
TAGS="bindata sqlite sqlite_unlock_notify" make build
scp gitea indri:~/.local/bin/forgejo
```
2. **Restart service**:
```bash
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist; launchctl load ~/Library/LaunchAgents/mcquack.eblume.forgejo.plist'
```
3. **Verify**:
```bash
curl https://forge.tail8d86e.ts.net/api/v1/version
```
### If Forgejo Won't Start
1. Check logs: `ssh indri 'tail -100 ~/Library/Logs/mcquack.forgejo.err.log'`
2. Check binary: `ssh indri '~/.local/bin/forgejo --version'`
3. Check config: `ssh indri 'cat /opt/homebrew/var/forgejo/custom/conf/app.ini | head -50'`
4. Try running manually: `ssh indri '~/.local/bin/forgejo web --config /opt/homebrew/var/forgejo/custom/conf/app.ini --work-path /opt/homebrew/var/forgejo'`
### Switch ArgoCD to GitHub (Nuclear Option)
If Forgejo is down and you need to deploy fixes:
```bash
argocd repo add https://github.com/eblume/blumeops.git --username eblume --password $GITHUB_PAT
argocd app set apps --repo https://github.com/eblume/blumeops.git
argocd app sync apps
```
After recovery, switch back to Forgejo.
---
## Verification Checklist
- [ ] CI deploy completed successfully
- [ ] Binary at `~/.local/bin/forgejo`
- [ ] mcquack LaunchAgent created
- [ ] Brew service stopped
- [ ] mcquack service started
- [ ] HTTP works (`curl https://forge.tail8d86e.ts.net/api/v1/version`)
- [ ] SSH works (`ssh -T forgejo@forge.tail8d86e.ts.net`)
- [ ] Git clone/push works
- [ ] Ansible role updated
- [ ] Alloy logs updated
- [ ] Brew package uninstalled
- [ ] `mise run provision-indri` succeeds
---
## Next Phase
After bootstrap is complete, proceed to [Phase 5: Container Builds](P5_container_builds.md) to set up container image building for ArgoCD.

View file

@ -1,505 +0,0 @@
# Phase 5: Container Image Builds
**Goal**: Set up CI workflows to build custom container images and push to zot registry
**Status**: Planning
**Prerequisites**: [Phase 4](P4_self_deploy.md) complete (Forgejo self-deploying, Actions working)
---
## Overview
With Forgejo Actions operational (including custom runner from P2), we can now build container images for:
- Custom devpi with pre-installed plugins
- Any other custom images needed for k8s services
- Release artifacts for Python packages
**Note**: The custom runner image build is covered in [Phase 2](P2_mirror_and_build.md). This phase focuses on application container builds.
---
## Use Case 1: devpi Custom Image
### Current State
devpi runs from `registry.tail8d86e.ts.net/blumeops/devpi:latest`, built manually:
- Base image: python
- Adds: devpi-server, devpi-web
- Startup script for auto-initialization
### Goal
Automate builds triggered by:
- Push to devpi repo on forge
- Manual workflow dispatch
- Optionally: upstream devpi release (via schedule check)
---
## Step 1: Create Workflow for devpi
### 1.1 Ensure devpi Repo Has Dockerfile
The Dockerfile already exists at `argocd/manifests/devpi/Dockerfile`. We'll create a workflow in the blumeops repo that builds it.
### 1.2 Create Build Workflow
Create `.forgejo/workflows/build-devpi.yml` in blumeops repo:
```yaml
name: Build devpi Image
on:
push:
paths:
- 'argocd/manifests/devpi/Dockerfile'
- 'argocd/manifests/devpi/start.sh'
- '.forgejo/workflows/build-devpi.yml'
workflow_dispatch:
inputs:
tag:
description: 'Image tag (default: latest)'
required: false
default: 'latest'
env:
REGISTRY: registry.tail8d86e.ts.net
IMAGE_NAME: blumeops/devpi
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Determine tag
id: tag
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
TAG="${{ github.event.inputs.tag }}"
else
TAG="latest"
fi
echo "tag=$TAG" >> "$GITHUB_OUTPUT"
- name: Build image
uses: docker/build-push-action@v5
with:
context: argocd/manifests/devpi
file: argocd/manifests/devpi/Dockerfile
platforms: linux/arm64
load: true
tags: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}
- name: Push to registry
run: |
# Zot has no auth, just push
docker push ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ steps.tag.outputs.tag }}
- name: Verify push
run: |
# Check image exists in registry
curl -sf "https://${{ env.REGISTRY }}/v2/${{ env.IMAGE_NAME }}/tags/list" | jq .
```
### 1.3 Runner Needs Registry Access
The runner needs to reach `registry.tail8d86e.ts.net`. This should work via Tailscale egress (same as Forgejo access).
If not, add egress for registry in `argocd/manifests/tailscale-operator/`:
```yaml
apiVersion: tailscale.com/v1alpha1
kind: Connector
metadata:
name: egress-registry
namespace: tailscale-operator
spec:
hostname: egress-registry
subnetRouter:
advertiseRoutes:
- registry.tail8d86e.ts.net/32
```
---
## Step 2: Test Build Workflow
### 2.1 Push and Trigger
```bash
# Make a small change to trigger
echo "# Build $(date)" >> argocd/manifests/devpi/Dockerfile
git add argocd/manifests/devpi/Dockerfile
git commit -m "Trigger devpi image rebuild"
git push
```
### 2.2 Monitor Build
1. Go to https://forge.tail8d86e.ts.net/eblume/blumeops/actions
2. Watch "Build devpi Image" workflow
3. Verify success
### 2.3 Verify Image in Registry
```bash
curl -s https://registry.tail8d86e.ts.net/v2/blumeops/devpi/tags/list | jq .
```
### 2.4 Restart devpi to Use New Image
```bash
kubectl --context=minikube-indri -n devpi rollout restart statefulset/devpi
```
---
## Step 3: Reusable Container Build Workflow
### 3.1 Create Reusable Workflow
Create `.forgejo/workflows/build-container.yml`:
```yaml
name: Build Container Image
on:
workflow_call:
inputs:
context:
description: 'Build context path'
required: true
type: string
dockerfile:
description: 'Dockerfile path (relative to context)'
required: false
type: string
default: 'Dockerfile'
image_name:
description: 'Image name (without registry)'
required: true
type: string
tag:
description: 'Image tag'
required: false
type: string
default: 'latest'
platforms:
description: 'Target platforms'
required: false
type: string
default: 'linux/arm64'
env:
REGISTRY: registry.tail8d86e.ts.net
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Build and push
uses: docker/build-push-action@v5
with:
context: ${{ inputs.context }}
file: ${{ inputs.context }}/${{ inputs.dockerfile }}
platforms: ${{ inputs.platforms }}
push: true
tags: ${{ env.REGISTRY }}/${{ inputs.image_name }}:${{ inputs.tag }}
- name: Verify push
run: |
curl -sf "https://${{ env.REGISTRY }}/v2/${{ inputs.image_name }}/tags/list" | jq .
```
### 3.2 Use in devpi Workflow
Simplify `.forgejo/workflows/build-devpi.yml`:
```yaml
name: Build devpi Image
on:
push:
paths:
- 'argocd/manifests/devpi/**'
workflow_dispatch:
jobs:
build:
uses: ./.forgejo/workflows/build-container.yml
with:
context: argocd/manifests/devpi
image_name: blumeops/devpi
```
---
## Step 4: Python Package Builds (Optional)
### 4.1 Use Case
Build Python packages from forge repos and publish to devpi.
Example: `mcquack` package (LaunchAgent management library)
### 4.2 Create Python Build Workflow
Create `.forgejo/workflows/build-python.yml`:
```yaml
name: Build Python Package
on:
workflow_call:
inputs:
package_path:
description: 'Path to package (contains pyproject.toml)'
required: false
type: string
default: '.'
python_version:
description: 'Python version'
required: false
type: string
default: '3.12'
publish:
description: 'Publish to devpi'
required: false
type: boolean
default: false
secrets:
DEVPI_PASSWORD:
required: false
env:
DEVPI_URL: https://pypi.tail8d86e.ts.net
jobs:
build:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: ${{ inputs.python_version }}
- name: Install uv
run: pip install uv
- name: Build package
run: |
cd ${{ inputs.package_path }}
uv build
- name: Upload artifact
uses: actions/upload-artifact@v4
with:
name: dist
path: ${{ inputs.package_path }}/dist/
- name: Publish to devpi
if: inputs.publish
run: |
cd ${{ inputs.package_path }}
uv publish \
--publish-url ${{ env.DEVPI_URL }}/eblume/dev/ \
--username eblume \
--password "${{ secrets.DEVPI_PASSWORD }}"
```
---
## Step 5: Scheduled Builds (Cron)
### 5.1 Weekly Rebuild
Keep images fresh with weekly rebuilds:
```yaml
name: Weekly Image Rebuilds
on:
schedule:
# Every Sunday at 3 AM UTC
- cron: '0 3 * * 0'
workflow_dispatch:
jobs:
devpi:
uses: ./.forgejo/workflows/build-container.yml
with:
context: argocd/manifests/devpi
image_name: blumeops/devpi
```
---
## Future Improvements
### Multi-Arch Builds
For images that need both ARM64 and AMD64:
```yaml
platforms: linux/arm64,linux/amd64
```
Requires QEMU emulation setup in runner (already supported by buildx).
### Build Caching
Use GitHub/Forgejo cache actions:
```yaml
- name: Cache Docker layers
uses: actions/cache@v4
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ hashFiles('**/Dockerfile') }}
```
### Security Scanning
Add Trivy or similar:
```yaml
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: '${{ env.REGISTRY }}/${{ inputs.image_name }}:${{ inputs.tag }}'
```
---
## Step 6: Runner Observability (Logging & Metrics)
### 6.1 Problem
The forgejo-runner pod generates logs and metrics that should be collected for:
- Debugging failed workflow runs
- Monitoring runner health and capacity
- Alerting on runner failures
### 6.2 Log Collection via Alloy
The forgejo-runner namespace needs to be included in Alloy's k8s log collection. Alloy is already configured to scrape logs from k8s pods - verify the runner namespace is included.
Check current Alloy config:
```bash
ssh indri 'cat ~/.config/alloy/config.alloy | grep -A20 discovery.kubernetes'
```
If using namespace filtering, ensure `forgejo-runner` is included.
### 6.3 Metrics Collection
The forgejo-runner exposes Prometheus metrics. Add a ServiceMonitor or configure Alloy to scrape:
**Option A: ServiceMonitor (if using Prometheus Operator)**
Create `argocd/manifests/forgejo-runner/servicemonitor.yaml`:
```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: forgejo-runner
namespace: forgejo-runner
spec:
selector:
matchLabels:
app: forgejo-runner
endpoints:
- port: metrics
interval: 30s
```
**Option B: Alloy scrape config**
Add to Alloy's k8s scrape config to discover the runner pod's metrics endpoint.
### 6.4 Create Runner Service for Metrics
Add `argocd/manifests/forgejo-runner/service.yaml`:
```yaml
apiVersion: v1
kind: Service
metadata:
name: forgejo-runner-metrics
namespace: forgejo-runner
labels:
app: forgejo-runner
spec:
selector:
app: forgejo-runner
ports:
- name: metrics
port: 8080
targetPort: 8080
```
Update kustomization.yaml to include the service.
### 6.5 Grafana Dashboard
Consider creating a dashboard for:
- Runner status (online/offline)
- Job queue depth
- Job execution time
- Success/failure rates
### 6.6 Verification
```bash
# Check runner logs are appearing in Loki
# Go to Grafana → Explore → Loki
# Query: {namespace="forgejo-runner"}
# Check metrics are being scraped
# Go to Grafana → Explore → Prometheus
# Query: forgejo_runner_*
```
---
## Verification Checklist
- [ ] devpi build workflow created
- [ ] devpi image builds successfully
- [ ] Image pushed to zot registry
- [ ] devpi pod uses new image
- [ ] Reusable container workflow created
- [ ] (Optional) Python build workflow created
- [ ] (Optional) Scheduled builds configured
- [ ] Runner logs visible in Loki
- [ ] Runner metrics scraped by Prometheus/Alloy
---
## Summary
With this phase complete, we have:
1. **Forgejo Actions** running with k8s runner
2. **Forgejo self-deploys** from CI on tagged releases
3. **Container images** built automatically on push
4. Infrastructure for Python package builds
5. **Runner observability** with logs in Loki and metrics in Prometheus
The CI/CD bootstrap is complete. Future work:
- Add more container builds as needed
- Add Python package publishing for internal tools
- Consider adding a macOS runner on indri for native builds
- Create Grafana dashboards for CI/CD monitoring

View file

@ -1,79 +0,0 @@
# Blumeops Minikube Migration Plan
**Status**: Completed (2026-01-23)
This plan detailed the phased migration of blumeops services from direct hosting on indri (Mac Mini M1) to a minikube cluster. The migration is now complete for all services that will be migrated.
## Final Status
| Phase | Name | Status | Notes |
|-------|------|--------|-------|
| 0 | [Foundation](P0_foundation.complete.md) | ✅ Complete | Container registry (zot) + minikube cluster |
| 1 | [K8s Infrastructure](P1_k8s_infrastructure.complete.md) | ✅ Complete | Tailscale operator, ArgoCD, CloudNativePG, PostgreSQL cluster |
| 2 | [Grafana](P2_grafana.complete.md) | ✅ Complete | Migrated Grafana via ArgoCD |
| 3 | [PostgreSQL](P3_postgresql.complete.md) | ✅ Complete | Data migration to k8s PostgreSQL |
| 4 | [Miniflux](P4_miniflux.complete.md) | ✅ Complete | Migrated Miniflux via ArgoCD |
| 5 | [devpi](P5_devpi.complete.md) | ✅ Complete | Migrated devpi via ArgoCD |
| 5.1 | [Docker Migration](P5.1_docker_migration.complete.md) | ✅ Complete | Switched minikube to docker driver (not QEMU2) |
| 6 | [Kiwix](P6_kiwix.complete.md) | ✅ Complete | Migrated Kiwix + Transmission via ArgoCD |
| 7 | [Forgejo](P7_forgejo.md) | ⏭️ Won't Do | Forgejo stays on indri - see [CI/CD Bootstrap](../../ci-cd-bootstrap/) |
| 8 | [Woodpecker](P8_woodpecker.md) | ⏭️ Won't Do | Replaced by Forgejo Actions - see [CI/CD Bootstrap](../../ci-cd-bootstrap/) |
| 9 | [Cleanup](P9_cleanup.md) | ⏭️ Won't Do | Observability cleanup done separately (2026-01-22) |
## What Was Migrated to K8s
| Service | Status | Notes |
|---------|--------|-------|
| Grafana | ✅ In k8s | Helm chart via ArgoCD |
| PostgreSQL | ✅ In k8s | CloudNativePG operator |
| Miniflux | ✅ In k8s | Using k8s PostgreSQL |
| devpi | ✅ In k8s | Custom container image |
| Kiwix | ✅ In k8s | NFS mount from sifaka |
| Transmission | ✅ In k8s | NFS mount from sifaka |
| Prometheus | ✅ In k8s | Migrated 2026-01-22 |
| Loki | ✅ In k8s | Migrated 2026-01-22 |
| Alloy (k8s) | ✅ In k8s | DaemonSet for pod logs |
| TeslaMate | ✅ In k8s | Added 2026-01-23 |
## What Stays on Indri
| Service | Reason |
|---------|--------|
| **Forgejo** | Critical infrastructure, avoids circular dependency with ArgoCD |
| **Zot Registry** | K8s needs images to start - must be outside k8s |
| **Alloy (host)** | Collects host-level metrics and logs |
| **Borgmatic** | Backup system must survive k8s failures |
| **Plex** | Uses own NAT traversal, not Tailscale |
## Architecture Decisions Made
### Minikube Driver: Docker (not QEMU2/Podman)
- Original plan called for QEMU2, but docker driver proved simpler
- NFS mounts work via Docker NAT through indri's LAN IP
- API server accessible via Tailscale TCP passthrough
### Forgejo: Stays on Indri
- Original P7 planned k8s migration
- Decision changed: Forgejo is critical infrastructure
- Will be built from source via Forgejo Actions CI
- See [CI/CD Bootstrap Plan](../../ci-cd-bootstrap/) for details
### CI/CD: Forgejo Actions (not Woodpecker)
- Original P8 planned Woodpecker deployment
- Decision changed: Use Forgejo's native Actions instead
- Simpler (one less system), GitHub Actions compatible
- See [CI/CD Bootstrap Plan](../../ci-cd-bootstrap/) for details
### Observability: Migrated to K8s
- Original plan kept Prometheus/Loki on indri
- Changed: Migrated both to k8s (2026-01-22)
- Alloy on indri pushes to k8s endpoints
- Alloy DaemonSet in k8s collects pod logs
## Lessons Learned
1. **Docker driver is simpler than QEMU2** - Direct NFS mounts work, no VM complexity
2. **Tailscale operator works well** - Easy service exposure with automatic TLS
3. **CloudNativePG is production-ready** - Good operator, easy backups
4. **Keep critical infra outside k8s** - Forgejo and zot must survive k8s failures
5. **CGO matters on macOS** - Alloy needed CGO=1 for Tailscale DNS resolution

File diff suppressed because it is too large Load diff

View file

@ -1,657 +0,0 @@
# Phase 1: Kubernetes Infrastructure
**Goal**: Tailscale operator, ArgoCD, CloudNativePG operator, PostgreSQL cluster
**Status**: In Progress
**Prerequisites**: [Phase 0](P0_foundation.complete.md) complete
---
## Overview
Phase 1 establishes the k8s control plane infrastructure:
1. **Tailscale operator** - Exposes services on the tailnet
2. **ArgoCD** - GitOps continuous delivery
3. **CloudNativePG** - PostgreSQL operator
4. **PostgreSQL cluster** - Database for future app migrations
The deployment follows a bootstrap pattern:
- First two components deployed via `kubectl apply -k` (no GitOps yet)
- ArgoCD then takes over management of all components including itself
- All subsequent deployments use ArgoCD
---
## Kubernetes Tags Overview
| Tag | Purpose | Applied To |
|-----|---------|------------|
| `tag:k8s-api` | Controls access to the K8s API server | indri (Phase 0.14) |
| `tag:k8s-operator` | Identifies the Tailscale K8s Operator | OAuth client for operator |
| `tag:k8s` | Default tag for operator-managed resources | Proxies, services, ingresses created by operator |
**Ownership chain**: `tag:k8s-operator` must own `tag:k8s` so the operator can assign that tag to devices it creates.
---
## PostgreSQL Migration Strategy
The k8s PostgreSQL cluster will eventually replace the brew PostgreSQL on indri.
| Phase | `pg.tail8d86e.ts.net` points to | Miniflux connects to |
|-------|--------------------------------|---------------------|
| Current | brew PostgreSQL (indri) | `pg.tail8d86e.ts.net` |
| Phase 1 | brew PostgreSQL (indri) | `pg.tail8d86e.ts.net` (no change) |
| Phase 4 | brew PostgreSQL (indri) | k8s PG (internal, after miniflux migrates to k8s) |
| Post-Phase 4 | k8s PostgreSQL | k8s PG (internal) |
| Cleanup | k8s PostgreSQL | k8s PG (internal) |
This allows zero-downtime migration - the Tailscale service switches after apps are migrated.
---
## Steps
### 1. Update Pulumi ACLs for k8s workloads ✓
**Status**: Complete
Added to `pulumi/policy.hujson`:
- `tag:k8s-operator` - for the operator OAuth client
- `tag:k8s` - for operator-managed resources (owned by `tag:k8s-operator`)
- Grant for `tag:k8s``tag:registry` access
---
### 2. Create Tailscale OAuth client ✓
**Status**: Complete
OAuth client stored in 1Password (vault: `vg6xf6vvfmoh5hqjjhlhbeoaie`, item: `2it22lavwgbxdskoaxanej354q`)
**Configuration used:**
- Tags: `tag:k8s-operator`
- Devices write scope tag: `tag:k8s`
- Scopes: Devices Core (R/W), Auth Keys (R/W), Services (Write)
---
### 3. Deploy Tailscale Kubernetes Operator (Bootstrap)
Deploy via `kubectl apply -k` - will be migrated to ArgoCD management in Step 5.
**Setup manifests directory:**
```bash
mkdir -p argocd/manifests/tailscale-operator
cd argocd/manifests/tailscale-operator
# Download static manifest from Tailscale repo
curl -sL https://raw.githubusercontent.com/tailscale/tailscale/main/cmd/k8s-operator/deploy/manifests/operator.yaml -o operator.yaml
# Download CRDs
curl -sL https://raw.githubusercontent.com/tailscale/tailscale/main/cmd/k8s-operator/deploy/crds/tailscale.com_connectors.yaml -o crds/connectors.yaml
curl -sL https://raw.githubusercontent.com/tailscale/tailscale/main/cmd/k8s-operator/deploy/crds/tailscale.com_proxyclasses.yaml -o crds/proxyclasses.yaml
# ... (other CRDs as needed)
```
**Create kustomization.yaml:**
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: tailscale-system
resources:
- operator.yaml
secretGenerator:
- name: operator-oauth
namespace: tailscale-system
literals:
- client_id=PLACEHOLDER
- client_secret=PLACEHOLDER
generatorOptions:
disableNameSuffixHash: true
```
**Deploy:**
```bash
# Get credentials from 1Password and create secret manually (kustomize secretGenerator is for reference)
CLIENT_ID=$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get 2it22lavwgbxdskoaxanej354q --fields client-id --reveal)
CLIENT_SECRET=$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get 2it22lavwgbxdskoaxanej354q --fields client-secret --reveal)
kubectl create namespace tailscale-system
kubectl create secret generic operator-oauth \
--namespace tailscale-system \
--from-literal=client_id=$CLIENT_ID \
--from-literal=client_secret=$CLIENT_SECRET
# Apply operator manifests
kubectl apply -k argocd/manifests/tailscale-operator/
```
**Verification:**
```bash
kubectl get pods -n tailscale-system
# Expected: operator pod Running
kubectl logs -n tailscale-system -l app.kubernetes.io/name=tailscale-operator
```
---
### 4. Deploy ArgoCD
Deploy ArgoCD and expose via Tailscale as `argocd.tail8d86e.ts.net`.
**Prerequisites:**
- Add `tag:argocd` to Pulumi ACLs
- Create Tailscale service `argocd` in admin console
**Setup manifests:**
```bash
mkdir -p argocd/manifests/argocd
# Download ArgoCD install manifest
curl -sL https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml -o argocd/manifests/argocd/install.yaml
```
**Create kustomization.yaml:**
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: argocd
resources:
- install.yaml
- service-tailscale.yaml # LoadBalancer for Tailscale exposure
```
**Create service-tailscale.yaml:**
```yaml
apiVersion: v1
kind: Service
metadata:
name: argocd-server-tailscale
namespace: argocd
annotations:
tailscale.com/hostname: "argocd"
spec:
type: LoadBalancer
loadBalancerClass: tailscale
selector:
app.kubernetes.io/name: argocd-server
ports:
- name: https
port: 443
targetPort: 8080
```
**Deploy:**
```bash
kubectl create namespace argocd
kubectl apply -k argocd/manifests/argocd/
```
**Get initial admin password:**
```bash
kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath="{.data.password}" | base64 -d
```
**Verification:**
- https://argocd.tail8d86e.ts.net loads
- Can login with admin / <initial-password>
**Post-setup:**
1. Change admin password, store in 1Password
2. Configure git repo connection to `github.com/eblume/blumeops` (public, no auth needed)
- Note: Using GitHub mirror since ArgoCD can't easily reach forge without additional networking
---
### 5. Migrate Tailscale Operator to ArgoCD
Create ArgoCD Application to manage the Tailscale operator.
**Create argocd/apps/tailscale-operator.yaml:**
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: tailscale-operator
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/eblume/blumeops.git
targetRevision: main
path: argocd/manifests/tailscale-operator
destination:
server: https://kubernetes.default.svc
namespace: tailscale-system
syncPolicy:
automated:
prune: true
selfHeal: true
```
**Apply:**
```bash
kubectl apply -f argocd/apps/tailscale-operator.yaml
```
**Note on secrets:** The OAuth secret was created manually in Step 3. For GitOps, consider:
- Sealed Secrets
- External Secrets Operator
- SOPS
For now, the secret remains manually managed outside of ArgoCD.
---
### 6. Deploy CloudNativePG via ArgoCD
**Setup manifests:**
```bash
mkdir -p argocd/manifests/cloudnative-pg
# Download CNPG operator manifest
curl -sL https://raw.githubusercontent.com/cloudnative-pg/cloudnative-pg/release-1.24/releases/cnpg-1.24.0.yaml -o argocd/manifests/cloudnative-pg/operator.yaml
```
**Create kustomization.yaml:**
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- operator.yaml
```
**Create ArgoCD Application (argocd/apps/cloudnative-pg.yaml):**
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: cloudnative-pg
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/eblume/blumeops.git
targetRevision: main
path: argocd/manifests/cloudnative-pg
destination:
server: https://kubernetes.default.svc
namespace: cnpg-system
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
```
**Apply:**
```bash
kubectl apply -f argocd/apps/cloudnative-pg.yaml
```
**Verification:**
```bash
kubectl get pods -n cnpg-system
# Expected: cnpg-controller-manager Running
```
---
### 7. Create PostgreSQL Cluster via ArgoCD
Create the database cluster. **Not exposed via Tailscale yet** - internal only until apps migrate.
**Create argocd/manifests/databases/blumeops-pg.yaml:**
```yaml
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
name: blumeops-pg
namespace: databases
spec:
instances: 1
storage:
size: 10Gi
storageClass: standard
monitoring:
enablePodMonitor: true
bootstrap:
initdb:
database: miniflux
owner: miniflux
```
**Create kustomization.yaml:**
```yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: databases
resources:
- blumeops-pg.yaml
```
**Create ArgoCD Application (argocd/apps/blumeops-pg.yaml):**
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: blumeops-pg
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/eblume/blumeops.git
targetRevision: main
path: argocd/manifests/databases
destination:
server: https://kubernetes.default.svc
namespace: databases
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
```
**Apply:**
```bash
kubectl apply -f argocd/apps/blumeops-pg.yaml
```
**Verification:**
```bash
kubectl get cluster -n databases
# Expected: blumeops-pg with STATUS "Cluster in healthy state"
kubectl get pods -n databases
# Expected: blumeops-pg-1 Running
# Get connection secret
kubectl -n databases get secret blumeops-pg-app -o jsonpath='{.data.uri}' | base64 -d
```
---
### 8. Create App-of-Apps Root Application
Once all components are deployed, create a root application to manage all apps.
**Create argocd/apps/root.yaml:**
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: root
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/eblume/blumeops.git
targetRevision: main
path: argocd/apps
destination:
server: https://kubernetes.default.svc
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
```
**Apply:**
```bash
kubectl apply -f argocd/apps/root.yaml
```
Now ArgoCD manages itself and all other applications via the app-of-apps pattern.
---
## New Files Summary
```
argocd/
apps/
root.yaml # App-of-apps root
tailscale-operator.yaml # Tailscale operator app
cloudnative-pg.yaml # CNPG operator app
blumeops-pg.yaml # PostgreSQL cluster app
manifests/
tailscale-operator/
kustomization.yaml
operator.yaml
argocd/
kustomization.yaml
install.yaml
service-tailscale.yaml
cloudnative-pg/
kustomization.yaml
operator.yaml
databases/
kustomization.yaml
blumeops-pg.yaml
```
---
## Pulumi ACL Updates Required
Add to `pulumi/policy.hujson`:
```hujson
"tag:argocd": ["autogroup:admin", "tag:blumeops"],
```
Add to Erich's test accept list:
```hujson
"accept": [..., "tag:argocd:443"],
```
Add to Allison's deny list:
```hujson
"deny": [..., "tag:argocd:443"],
```
---
## Verification Checklist
```bash
# 1. Tailscale operator running
kubectl get pods -n tailscale-system
# 2. ArgoCD accessible
curl -k https://argocd.tail8d86e.ts.net/healthz
# 3. CloudNativePG operator running
kubectl get pods -n cnpg-system
# 4. PostgreSQL cluster healthy
kubectl get cluster -n databases
# 5. All ArgoCD apps synced
kubectl get applications -n argocd
# All should show STATUS: Synced, HEALTH: Healthy
```
---
## Rollback
```bash
# Remove ArgoCD apps (will cascade delete managed resources)
kubectl delete application -n argocd root
kubectl delete application -n argocd blumeops-pg
kubectl delete application -n argocd cloudnative-pg
kubectl delete application -n argocd tailscale-operator
# Remove ArgoCD
kubectl delete -k argocd/manifests/argocd/
kubectl delete namespace argocd
# Remove namespaces
kubectl delete namespace databases
kubectl delete namespace cnpg-system
kubectl delete namespace tailscale-system
# Revert ACL changes
git checkout pulumi/policy.hujson
mise run tailnet-up
```
---
## Implementation Notes (Deviations from Plan)
*Added during implementation for retrospective review*
### Git Source: Forge Instead of GitHub
**Plan**: Use GitHub mirror (`github.com/eblume/blumeops`)
**Actual**: Use internal Forgejo (`ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git`)
**Why**: User preference to use internal infrastructure, accepting circular dependency for later.
**Required changes**:
- Deploy key added to forge for ArgoCD SSH access
- Repository secret `repo-forge` with SSH private key from 1Password
- Discovered: `op read` requires `?ssh-format=openssh` query parameter for ArgoCD-compatible key format
- Egress proxy service to reach forge from cluster (targets `indri.tail8d86e.ts.net` not `forge.tail8d86e.ts.net` due to Tailscale Serve limitation)
- DNSConfig CRD for cluster-to-tailnet MagicDNS resolution
- ACL grant: `tag:k8s``tag:homelab` on ports 3001 (HTTP) and 2200 (SSH)
### ArgoCD Exposure: Ingress Instead of LoadBalancer
**Plan**: LoadBalancer service with `tailscale.com/hostname` annotation
**Actual**: Tailscale Ingress with Let's Encrypt TLS termination
**Why**: Ingress provides automatic TLS certificates and is the recommended approach.
**File**: `argocd/manifests/argocd/service-tailscale.yaml` uses `kind: Ingress` with `ingressClassName: tailscale`
### Namespace: `tailscale` Instead of `tailscale-system`
**Plan**: `tailscale-system` namespace
**Actual**: `tailscale` namespace
**Why**: Matches upstream Tailscale operator defaults.
### Sync Policy: Manual Instead of Automated
**Plan**: `syncPolicy.automated` with prune and selfHeal
**Actual**: Manual sync policy for workload apps; auto-sync only for app-of-apps
**Why**: User preference for explicit control over deployments during initial migration phase.
**Pattern**:
- `apps.yaml` (app-of-apps): auto-sync to pick up new Application manifests
- All workload apps: manual sync requires `argocd app sync <name>`
### CloudNativePG: Helm Chart Instead of Raw Manifest
**Plan**: Download raw CNPG manifest
**Actual**: Multi-source Application using official Helm chart from `https://cloudnative-pg.github.io/charts`
**Why**: Helm chart is the officially supported distribution method.
**Additional fix**: Required `ServerSideApply=true` sync option due to large CRD exceeding annotation size limit.
### App-of-Apps: Named `apps` Instead of `root`
**Plan**: `argocd/apps/root.yaml`
**Actual**: `argocd/apps/apps.yaml` with Application named `apps`
**Why**: Clearer naming; `apps` manages apps, `argocd` manages itself.
### ArgoCD Self-Management Added
**Plan**: Not explicitly planned
**Actual**: `argocd/apps/argocd.yaml` Application for ArgoCD self-management
**Why**: Standard GitOps pattern - ArgoCD manages its own deployment after bootstrap.
### CRI-O Registry Mirror for Zot
**Plan**: Not in original plan
**Actual**: Configured CRI-O to use zot as pull-through cache for docker.io, ghcr.io, quay.io
**Why**: Reduces external bandwidth, speeds up pulls, avoids rate limits.
**Implementation**: Ansible `minikube` role applies `/etc/containers/registries.conf.d/zot-mirror.conf` inside minikube VM using stable hostname `host.containers.internal:5050`.
### ProxyClass for CRI-O Image Compatibility
**Plan**: Not mentioned
**Actual**: Required `ProxyClass` with fully-qualified image paths (`docker.io/tailscale/...`)
**Why**: CRI-O requires fully-qualified image references; default Tailscale operator uses short names.
### Actual File Structure
```
argocd/
apps/
apps.yaml # App-of-apps (auto-sync)
argocd.yaml # ArgoCD self-management (manual sync)
tailscale-operator.yaml # Tailscale operator (manual sync)
cloudnative-pg.yaml # CNPG operator via Helm (manual sync)
manifests/
tailscale-operator/
kustomization.yaml
operator.yaml
proxyclass.yaml # CRI-O compatibility
dnsconfig.yaml # Cluster-to-tailnet DNS
egress-forge.yaml # Egress proxy for forge
secret.yaml.tpl # OAuth secret template (manual)
README.md
argocd/
kustomization.yaml # Uses remote base from upstream
service-tailscale.yaml # Ingress (not LoadBalancer)
argocd-cmd-params-cm.yaml # Disable HTTPS redirect
repo-forge-secret.yaml.tpl # SSH key template (manual)
README.md
cloudnative-pg/
values.yaml # Helm values (currently minimal)
README.md
```
### Bootstrap Commands (Actual)
```bash
# 1. Create namespaces
kubectl create namespace tailscale
kubectl create namespace argocd
# 2. Apply secrets (manual, uses 1Password)
op inject -i argocd/manifests/tailscale-operator/secret.yaml.tpl | kubectl apply -f -
PRIV_KEY=$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/csjncynh6htjvnh2l2da65y32q/private key?ssh-format=openssh")$'\n' && \
kubectl create secret generic repo-forge -n argocd \
--from-literal=type=git \
--from-literal=url='ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git' \
--from-literal=insecure=true \
--from-literal=sshPrivateKey="$PRIV_KEY" && \
kubectl label secret repo-forge -n argocd argocd.argoproj.io/secret-type=repository
# 3. Bootstrap tailscale-operator
kubectl apply -k argocd/manifests/tailscale-operator/
# 4. Bootstrap ArgoCD
kubectl apply -k argocd/manifests/argocd/
# 5. Login and change password
argocd login argocd.tail8d86e.ts.net --username admin --grpc-web
argocd account update-password
# 6. Apply ArgoCD Applications
kubectl apply -f argocd/apps/argocd.yaml
kubectl apply -f argocd/apps/apps.yaml
# 7. Sync workloads
argocd app sync tailscale-operator
argocd app sync cloudnative-pg
```

View file

@ -1,396 +0,0 @@
# Phase 2: Grafana Migration (Pilot)
**Goal**: Migrate Grafana as lowest-risk pilot service
**Status**: Complete (2026-01-19)
**Prerequisites**: [Phase 1](P1_k8s_infrastructure.complete.md) complete
---
## Overview
This phase migrates Grafana from Homebrew/Ansible on indri to Kubernetes, establishing the pattern for future service migrations. Additionally, we establish the pattern of mirroring Helm chart repositories to forge for resilience and GitOps consistency.
---
## Key Decisions
### Helm Chart Mirroring
**Problem**: P1 uses external Helm repos which creates external dependencies.
**Solution**: Mirror Helm chart Git repositories to forge, reference charts from git path.
ArgoCD auto-detects Helm charts when a directory contains `Chart.yaml`. No build step needed.
| Chart | Upstream Git Repo | Forge Mirror | Chart Path |
|-------|-------------------|--------------|------------|
| cloudnative-pg | `github.com/cloudnative-pg/charts` | `forge/eblume/cloudnative-pg-charts` | `charts/cloudnative-pg/` |
| grafana | `github.com/grafana/helm-charts` | `forge/eblume/grafana-helm-charts` | `charts/grafana/` |
### Database Storage
Use SQLite with 1Gi PVC (not k8s PostgreSQL). Grafana stores minimal persistent data and dashboards are git-provisioned.
### Datasource URLs
From k8s pods, use `host.containers.internal` to reach indri services:
- Prometheus: `http://host.containers.internal:9090`
- Loki: `http://host.containers.internal:3100` (requires ansible change to bind 0.0.0.0)
### Ingress
Tailscale Ingress with Let's Encrypt TLS (following ArgoCD pattern), with `crio-compat` proxy class.
### Secrets Management
Admin password stored in 1Password, injected manually via `op inject`. Future: migrate to External Secrets Operator or similar.
---
## Prerequisites
### 0.1 Mirror Helm Chart Repos to Forge
**User action**: Create mirrors in forge:
1. **CloudNativePG charts** (fix existing P1 app):
- Mirror: `https://github.com/cloudnative-pg/charts`
- To: `forge.tail8d86e.ts.net/eblume/cloudnative-pg-charts`
2. **Grafana helm-charts** (new):
- Mirror: `https://github.com/grafana/helm-charts`
- To: `forge.tail8d86e.ts.net/eblume/grafana-helm-charts`
### 0.2 Update Loki to Bind 0.0.0.0
**File**: `ansible/roles/loki/templates/loki-config.yaml.j2`
Add under `server:`:
```yaml
http_listen_address: 0.0.0.0
```
Deploy: `mise run provision-indri -- --tags loki`
---
## Steps
### 1. Fix CloudNativePG to Use Forge Mirror
Update `argocd/apps/cloudnative-pg.yaml` to use forge-mirrored chart:
```yaml
sources:
- repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/cloudnative-pg-charts.git
targetRevision: cloudnative-pg-0.23.0 # git tag
path: charts/cloudnative-pg
helm:
releaseName: cloudnative-pg
valueFiles:
- $values/argocd/manifests/cloudnative-pg/values.yaml
- repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git
targetRevision: main
ref: values
```
---
### 2. Create Grafana Helm Values
**File**: `argocd/manifests/grafana/values.yaml`
```yaml
admin:
existingSecret: grafana-admin
userKey: admin-user
passwordKey: admin-password
persistence:
enabled: true
type: pvc
size: 1Gi
grafana.ini:
server:
root_url: https://grafana.tail8d86e.ts.net
analytics:
check_for_updates: false
reporting_enabled: false
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
uid: prometheus
url: http://host.containers.internal:9090
isDefault: true
editable: false
- name: Loki
type: loki
access: proxy
uid: loki
url: http://host.containers.internal:3100
editable: false
sidecar:
dashboards:
enabled: true
label: grafana_dashboard
labelValue: "1"
service:
type: ClusterIP
port: 80
resources:
requests:
memory: "128Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
```
---
### 3. Create Grafana ArgoCD Application
**File**: `argocd/apps/grafana.yaml`
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: grafana
namespace: argocd
spec:
project: default
sources:
- repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/grafana-helm-charts.git
targetRevision: grafana-8.8.2
path: charts/grafana
helm:
releaseName: grafana
valueFiles:
- $values/argocd/manifests/grafana/values.yaml
- repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git
targetRevision: main
ref: values
destination:
server: https://kubernetes.default.svc
namespace: monitoring
syncPolicy:
syncOptions:
- CreateNamespace=true
```
---
### 4. Create Grafana Config Application
**File**: `argocd/apps/grafana-config.yaml`
Deploys Tailscale Ingress and Dashboard ConfigMaps from `argocd/manifests/grafana-config/`.
---
### 5. Create Grafana Config Manifests
**Directory**: `argocd/manifests/grafana-config/`
Contents:
- `kustomization.yaml`
- `ingress-tailscale.yaml` - Tailscale Ingress for `grafana.tail8d86e.ts.net`
- `secret-admin.yaml.tpl` - Admin password template (1Password-backed)
- `README.md` - Notes on secrets management
- `dashboards/configmap-*.yaml` - 9 dashboard ConfigMaps
**Ingress**:
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: grafana-tailscale
namespace: monitoring
annotations:
tailscale.com/proxy-class: "crio-compat"
spec:
ingressClassName: tailscale
defaultBackend:
service:
name: grafana
port:
number: 80
tls:
- hosts:
- grafana
```
**Secret template** (`secret-admin.yaml.tpl`):
```yaml
# Apply: op inject -i secret-admin.yaml.tpl | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: grafana-admin
namespace: monitoring
type: Opaque
stringData:
admin-user: admin
admin-password: {{ op://vg6xf6vvfmoh5hqjjhlhbeoaie/oxkcr3xtxnewy7noep2izvyr6y/password }}
```
**Dashboard ConfigMaps**: Convert each JSON from `ansible/roles/grafana/files/dashboards/` to ConfigMap with label `grafana_dashboard: "1"`.
---
### 6. Deploy to Kubernetes
```bash
# Create namespace and secret
ki create namespace monitoring
op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl | ki apply -f -
# Push changes and sync
argocd app sync grafana
argocd app sync grafana-config
```
---
### 7. Tailscale Service Cutover
Remove `svc:grafana` from `ansible/roles/tailscale_serve/defaults/main.yml`, then:
```bash
mise run provision-indri -- --tags tailscale-serve
```
---
### 8. Stop Brew Grafana
```bash
ssh indri 'brew services stop grafana'
```
---
### 9. Retire Ansible Grafana Role
Once k8s Grafana is verified working:
1. **Remove role from playbook** - Delete grafana role entry from `ansible/playbooks/indri.yml`
2. **Delete the role directory** - `rm -rf ansible/roles/grafana/`
3. **Update zk documentation** - Note in `~/code/personal/zk/1767747119-YCPO.md` that Grafana is now k8s-hosted
---
## New Files
| Path | Purpose |
|------|---------|
| `argocd/apps/grafana.yaml` | Grafana Helm chart Application |
| `argocd/apps/grafana-config.yaml` | Grafana config Application |
| `argocd/manifests/grafana/values.yaml` | Helm values |
| `argocd/manifests/grafana-config/kustomization.yaml` | Kustomize config |
| `argocd/manifests/grafana-config/ingress-tailscale.yaml` | Tailscale Ingress |
| `argocd/manifests/grafana-config/secret-admin.yaml.tpl` | Admin password template |
| `argocd/manifests/grafana-config/README.md` | Secrets management notes |
| `argocd/manifests/grafana-config/dashboards/configmap-*.yaml` | 9 dashboard ConfigMaps |
## Modified Files
| Path | Change |
|------|--------|
| `argocd/apps/cloudnative-pg.yaml` | Switch to forge-mirrored chart |
| `ansible/roles/loki/templates/loki-config.yaml.j2` | Add `http_listen_address: 0.0.0.0` |
| `ansible/roles/tailscale_serve/defaults/main.yml` | Remove `svc:grafana` |
| `ansible/playbooks/indri.yml` | Remove grafana role |
## Deleted Files
| Path | Reason |
|------|--------|
| `ansible/roles/grafana/` | Replaced by k8s deployment |
---
## Verification
- [x] Loki accessible from k8s pods
- [x] Prometheus accessible from k8s pods
- [x] Grafana pod running in `monitoring` namespace
- [x] Grafana Ingress active
- [x] https://grafana.tail8d86e.ts.net loads
- [x] All 9 dashboards visible
- [x] Prometheus datasource queries work
- [x] Loki datasource queries work
---
## Rollback
1. Re-add `svc:grafana` to ansible tailscale_serve
2. `mise run provision-indri -- --tags tailscale-serve,grafana`
3. `argocd app delete grafana grafana-config --cascade`
---
## Implementation Notes
*Added during implementation for retrospective review*
### SSH Credential Management
**Issue**: Initial plan used HTTPS URLs for forge-mirrored Helm chart repos, but ArgoCD in cluster couldn't resolve `forge.tail8d86e.ts.net` (MagicDNS not available inside cluster).
**Solution**: Use SSH URLs for all forge repos. Created a **credential template** (`repo-creds-forge`) that matches all repos under `ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/` using URL prefix matching. This allows a single SSH key (added to Forgejo user, not as deploy key) to work for all repos.
### SSH Host Key for ArgoCD
**Issue**: ArgoCD's known_hosts didn't include indri's SSH host key, causing `knownhosts: key is unknown` errors.
**Solution**: Added `argocd-ssh-known-hosts-cm.yaml` as a kustomize patch to include indri's host key alongside the upstream defaults.
**Gotcha**: Kustomize patches must **not specify namespace** - the namespace transformation happens *after* patch matching. Our patch had `namespace: argocd` which caused "no matches for Id" errors until removed.
### Tailscale Hostname Cutover
**Issue**: After removing `svc:grafana` from ansible's tailscale_serve config, the k8s Ingress still got a numbered hostname (`grafana-1.tail8d86e.ts.net`).
**Solution**: The old `svc:grafana` service remained registered in Tailscale admin console even after clearing its serve config. **Manual deletion in Tailscale admin console** was required to free the `grafana` hostname for the k8s Ingress to claim. After deletion, recreating the Ingress picked up the correct hostname.
### ArgoCD Workflow Decision
During implementation, we established the pattern for GitOps workflow:
- **All apps target `main` branch** (not feature branches)
- Manual sync policy on workload apps = merge doesn't auto-deploy
- Workflow: feature branch → PR → merge to main → `argocd app sync <name>`
- For testing: temporarily set one app to feature branch via `argocd app set --revision`
This avoids the friction of switching `targetRevision` in manifests during development.
### Bootstrap Dependencies
Some resources must be applied manually before ArgoCD can manage itself:
1. **SSH known_hosts** - chicken-and-egg: ArgoCD can't sync the config that adds the host key
2. **Credential secrets** - `repo-creds-forge` must exist before ArgoCD can pull from forge
These are documented in `argocd/manifests/argocd/README.md` as bootstrap steps.
### Actual Versions Used
- Grafana Helm chart: `grafana-8.8.2` (tag in grafana-helm-charts repo)
- CloudNativePG Helm chart: `cloudnative-pg-v0.23.0` (tag in cloudnative-pg-charts repo)
- Grafana version: 11.4.0

View file

@ -1,359 +0,0 @@
# Phase 3: PostgreSQL Disaster Recovery & Backup
**Goal**: Test disaster recovery and configure borgmatic backups for k8s-pg
**Status**: Complete (2026-01-19)
**Prerequisites**: [Phase 2](P2_grafana.complete.md) complete
---
## Overview
Phase 3 establishes disaster recovery capabilities for the k8s PostgreSQL cluster:
1. **Fix borgmatic backup issues** - Resolve `borg: command not found` error
2. **Test disaster recovery** - Restore miniflux data from borgmatic backup to k8s-pg
3. **Create borgmatic user** - Read-only backup user in k8s-pg via CloudNativePG
4. **Configure dual database backup** - Backup both brew PostgreSQL and k8s-pg during migration
This phase prepares for Phase 4 (miniflux migration) by verifying we can restore data to k8s-pg.
---
## Key Decisions
### Backup Both Databases During Transition
**Decision**: Configure borgmatic to backup both `localhost:5432/miniflux` (brew) and `k8s-pg.tail8d86e.ts.net:5432/miniflux` (k8s) until migration complete.
**Why**: Provides redundancy during migration. After Phase 4, remove localhost entry.
### Reuse Existing borgmatic Password
**Decision**: Use same borgmatic password from 1Password for k8s-pg user.
**Why**: Simpler credential management, password already proven secure.
### CloudNativePG Managed Roles
**Decision**: Declare borgmatic user via CloudNativePG `managed.roles` instead of SQL commands.
**Why**: Declarative, version-controlled, matches eblume user pattern.
### Disable selfHeal on apps App
**Decision**: Remove `selfHeal: true` from `argocd/apps/apps.yaml`.
**Why**: Allows temporarily pointing child apps to feature branches during development without ArgoCD reverting the change.
---
## Steps
### 1. Fix borgmatic borg path issue
**Problem**: borgmatic failing with `borg: command not found`
**Cause**: LaunchAgent doesn't have homebrew in PATH, so `borg` binary not found.
**Solution**: Add `local_path` to borgmatic config template.
**File**: `ansible/roles/borgmatic/templates/config.yaml.j2`
```yaml
# Path to borg binary (LaunchAgent doesn't have homebrew in PATH)
local_path: {{ borgmatic_local_path }}
```
**File**: `ansible/roles/borgmatic/defaults/main.yml`
```yaml
borgmatic_local_path: /opt/homebrew/bin/borg
```
---
### 2. Run manual backup to verify fix
```bash
mise run provision-indri -- --tags borgmatic
ssh indri '/opt/homebrew/bin/borgmatic --verbosity 1'
```
---
### 3. Extract miniflux dump from borgmatic
```bash
ssh indri 'borgmatic list --archive latest'
ssh indri 'borgmatic restore --archive latest --destination /tmp/restore'
```
---
### 4. Add ACL grant for homelab → k8s
**Problem**: Connection from indri to k8s-pg blocked - Tailscale proxy logs showed "no rules matched"
**Solution**: Add ACL grant in Pulumi.
**File**: `pulumi/policy.hujson`
```hujson
// Homelab can reach k8s PostgreSQL for borgmatic backups
{
"src": ["tag:homelab"],
"dst": ["tag:k8s"],
"ip": ["tcp:5432"],
},
```
Deploy: `mise run tailnet-up`
---
### 5. Restore data to k8s-pg
```bash
# Using eblume superuser credentials from 1Password
ssh indri "psql 'postgres://eblume@k8s-pg.tail8d86e.ts.net:5432/miniflux' -f /tmp/restore/localhost/miniflux/miniflux"
```
**Verification**:
```bash
psql 'postgres://eblume@k8s-pg.tail8d86e.ts.net:5432/miniflux' -c 'SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM feeds; SELECT COUNT(*) FROM entries;'
# Result: 2 users, 2 feeds, 44 entries
```
---
### 6. Create borgmatic user in k8s-pg via CloudNativePG
**File**: `argocd/manifests/databases/secret-borgmatic.yaml.tpl`
```yaml
# Template for borgmatic backup user password
# Apply with: op inject -i secret-borgmatic.yaml.tpl | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
name: blumeops-pg-borgmatic
namespace: databases
type: kubernetes.io/basic-auth
stringData:
username: borgmatic
password: {{ op://vg6xf6vvfmoh5hqjjhlhbeoaie/mw2bv5we7woicjza7hc6s44yvy/db-password }}
```
**File**: `argocd/manifests/databases/blumeops-pg.yaml` (add to managed roles)
```yaml
managed:
roles:
# ... existing eblume role ...
# borgmatic read-only user for backups
- name: borgmatic
login: true
connectionLimit: -1
ensure: present
inherit: true
inRoles:
- pg_read_all_data
passwordSecret:
name: blumeops-pg-borgmatic
```
**Deploy**:
```bash
op inject -i argocd/manifests/databases/secret-borgmatic.yaml.tpl | kubectl apply -f -
argocd app set blumeops-pg --revision feature/p3-postgresql-borgmatic
argocd app sync blumeops-pg
```
---
### 7. Configure borgmatic for dual database backup
**File**: `ansible/roles/borgmatic/defaults/main.yml`
```yaml
borgmatic_postgresql_databases:
# Brew PostgreSQL on indri (current production)
- name: miniflux
hostname: localhost
port: 5432
username: borgmatic
# k8s PostgreSQL (CloudNativePG) - backup both during migration
- name: miniflux
hostname: k8s-pg.tail8d86e.ts.net
port: 5432
username: borgmatic
```
**File**: `ansible/roles/postgresql/tasks/main.yml` (update .pgpass)
```yaml
- name: Write .pgpass file for borgmatic backups
ansible.builtin.copy:
content: |
# Managed by ansible - only read-only roles
localhost:{{ postgresql_port }}:*:borgmatic:{{ postgresql_user_passwords['borgmatic'] }}
k8s-pg.tail8d86e.ts.net:5432:*:borgmatic:{{ postgresql_user_passwords['borgmatic'] }}
dest: ~/.pgpass
mode: '0600'
no_log: true
```
---
### 8. Verify complete backup pipeline
```bash
mise run provision-indri -- --tags borgmatic,postgresql
ssh indri '/opt/homebrew/bin/borgmatic --verbosity 1'
ssh indri 'borgmatic list --archive latest'
```
**Expected output**: Archive contains both dumps:
- `localhost/miniflux/miniflux`
- `k8s-pg.tail8d86e.ts.net/miniflux/miniflux`
---
### 9. Fix ArgoCD drift from CNPG defaults
**Problem**: ArgoCD showed blumeops-pg as OutOfSync due to CNPG operator adding default values.
**Solution**: Add CNPG defaults explicitly to managed roles.
**File**: `argocd/manifests/databases/blumeops-pg.yaml`
```yaml
managed:
roles:
- name: eblume
# ... existing fields ...
connectionLimit: -1
ensure: present
inherit: true
- name: borgmatic
# ... existing fields ...
connectionLimit: -1
ensure: present
inherit: true
```
---
### 10. Update zk documentation
Updated:
- `~/code/personal/zk/borgmatic.md` - k8s-pg backup documentation and log entry
- `~/code/personal/zk/postgresql.md` - k8s PostgreSQL section and log entry
---
## New Files
| Path | Purpose |
|------|---------|
| `argocd/manifests/databases/secret-borgmatic.yaml.tpl` | borgmatic user password template |
## Modified Files
| Path | Change |
|------|--------|
| `ansible/roles/borgmatic/defaults/main.yml` | Added `borgmatic_local_path`, k8s-pg database entry |
| `ansible/roles/borgmatic/templates/config.yaml.j2` | Added `local_path` option |
| `ansible/roles/postgresql/tasks/main.yml` | Added k8s-pg to .pgpass |
| `argocd/apps/apps.yaml` | Disabled selfHeal |
| `argocd/manifests/databases/blumeops-pg.yaml` | Added borgmatic managed role, CNPG defaults |
| `pulumi/policy.hujson` | Added ACL grant homelab → k8s on tcp:5432 |
---
## Verification
- [x] borgmatic backup runs successfully
- [x] Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries)
- [x] borgmatic user created in k8s-pg with pg_read_all_data role
- [x] Both localhost and k8s-pg databases in backup archive
- [x] ArgoCD shows blumeops-pg as Synced
- [x] zk documentation updated
---
## Rollback
Keep brew PostgreSQL running until Phase 4 verified. To revert:
1. Remove k8s-pg entry from borgmatic databases
2. Remove k8s-pg from .pgpass
3. `mise run provision-indri -- --tags borgmatic,postgresql`
---
## Implementation Notes
*Added during implementation for retrospective review*
### borgmatic LaunchAgent PATH Issue
**Problem**: borgmatic LaunchAgent failed with `borg: command not found`
**Root cause**: LaunchAgents run with minimal PATH that doesn't include `/opt/homebrew/bin`
**Solution**: Added `local_path: /opt/homebrew/bin/borg` to borgmatic config. This was already done for `pg_dump_command` but not for borg itself.
**Lesson**: Any tool invoked by borgmatic needs absolute path when running from LaunchAgent.
### 1Password Field Name Mismatch
**Issue**: Initial secret template used `password` field but 1Password item had `db-password`.
**Discovery**: Error message from `op inject` indicated field not found.
**Fix**: Updated template to use correct field name `db-password`.
### ACL Grant Discovery
**Problem**: Connection from indri (tag:homelab) to k8s-pg (tag:k8s) failed.
**Diagnosis**: Checked Tailscale operator proxy logs which showed "no rules matched" - clear indication of missing ACL.
**Solution**: Added explicit grant in `pulumi/policy.hujson` for `tag:homelab``tag:k8s` on `tcp:5432`.
### ArgoCD selfHeal and Feature Branch Development
**Problem**: When testing changes, temporarily pointed blumeops-pg app to feature branch via `argocd app set --revision`. ArgoCD's selfHeal kept reverting it back to main.
**Discussion**: Two options considered:
- Option A: Disable selfHeal on apps app (manual sync required for new apps)
- Option B: Keep selfHeal, use different workflow
**Decision**: Option A chosen. The apps app now only has `prune: true`, not selfHeal. This allows:
1. Temporarily testing feature branches
2. Manual control over when app manifest changes are applied
**Trade-off**: Must manually sync apps app when adding/removing Application manifests.
### CloudNativePG Managed Role Reconciliation
**Issue**: After creating borgmatic secret with correct password, CNPG didn't immediately update the user.
**Solution**: Annotated the Cluster to trigger reconciliation:
```bash
kubectl annotate cluster blumeops-pg -n databases cnpg.io/reconcile=$(date +%s) --overwrite
```
### ArgoCD Drift from CNPG Defaults
**Problem**: blumeops-pg showed OutOfSync despite successful syncs.
**Cause**: CNPG operator adds default values (`connectionLimit: -1`, `ensure: present`, `inherit: true`) to managed roles that weren't in our spec.
**Solution**: Added these defaults explicitly to our spec to match what CNPG generates.
**Comment added**: Documented in blumeops-pg.yaml that these are "CNPG defaults added to prevent ArgoCD drift".
### Git Workflow for Phase 3
1. Created feature branch: `feature/p3-postgresql-borgmatic`
2. Made commits throughout implementation
3. Pointed blumeops-pg app to feature branch for testing
4. Created PR #32 for review
5. After merge, reset app to main: `argocd app set blumeops-pg --revision main`
This workflow was enabled by disabling selfHeal (see above).

View file

@ -1,162 +0,0 @@
# Phase 4: Miniflux Migration to Kubernetes
**Goal**: Migrate Miniflux entirely off indri and onto k8s, retire brew PostgreSQL, rename k8s-pg to pg
**Status**: Complete (2026-01-20)
**Prerequisites**: [Phase 3](P3_postgresql.complete.md) complete
---
## Overview
This phase completed the miniflux migration and retired brew PostgreSQL:
1. Deployed miniflux container in k8s via ArgoCD
2. Exposed via Tailscale Ingress at `feed.tail8d86e.ts.net`
3. Removed all miniflux infrastructure from indri (ansible role, brew service, Tailscale serve)
4. Retired brew PostgreSQL (no longer needed)
5. Renamed k8s-pg to pg (canonical Tailscale hostname)
6. Updated borgmatic to backup only `pg.tail8d86e.ts.net`
7. Updated all zk documentation
---
## New Files
| Path | Purpose |
|------|---------|
| `argocd/apps/miniflux.yaml` | ArgoCD Application definition |
| `argocd/manifests/miniflux/deployment.yaml` | Miniflux Deployment |
| `argocd/manifests/miniflux/service.yaml` | ClusterIP Service |
| `argocd/manifests/miniflux/ingress-tailscale.yaml` | Tailscale Ingress for `feed.tail8d86e.ts.net` |
| `argocd/manifests/miniflux/secret-db.yaml.tpl` | Database URL secret documentation |
| `argocd/manifests/miniflux/kustomization.yaml` | Kustomize configuration |
| `argocd/manifests/miniflux/README.md` | Setup instructions |
## Modified Files
| Path | Change |
|------|--------|
| `ansible/playbooks/indri.yml` | Removed miniflux and postgresql roles, simplified pre_tasks |
| `ansible/roles/tailscale_serve/defaults/main.yml` | Removed `svc:feed` and `svc:pg` entries |
| `ansible/roles/alloy/defaults/main.yml` | Removed miniflux and postgresql logs, disabled postgres metrics |
| `ansible/roles/borgmatic/defaults/main.yml` | Updated to backup only `pg.tail8d86e.ts.net` |
| `ansible/roles/borgmatic/tasks/main.yml` | Added .pgpass file management |
| `argocd/manifests/databases/service-tailscale.yaml` | Renamed hostname from k8s-pg to pg |
## Deleted Files
| Path | Reason |
|------|--------|
| `ansible/roles/miniflux/` | Entire role no longer needed |
| `ansible/roles/postgresql/` | Brew PostgreSQL no longer needed |
---
## Verification
- [x] Miniflux pod healthy in k8s
- [x] https://feed.tail8d86e.ts.net accessible
- [x] User `eblume` can log in
- [x] Feeds visible and entries readable
- [x] `pg.tail8d86e.ts.net` resolves to k8s PostgreSQL
- [x] Old `k8s-pg` and `feed` devices removed from Tailscale
- [x] brew miniflux and postgresql services stopped
- [x] Tailscale serve entries cleared from indri
- [x] zk documentation updated
---
## Implementation Notes
*Lessons learned and issues encountered*
### CNPG-Generated Password vs 1Password
**Problem**: Initial secret template used 1Password for miniflux database password, but CNPG auto-generates the bootstrap owner password.
**Solution**: Reference the CNPG-generated password from `blumeops-pg-app` secret:
```bash
kubectl create secret generic miniflux-db -n miniflux \
--from-literal=url="$(kubectl -n databases get secret blumeops-pg-app -o jsonpath='{.data.uri}' | base64 -d)"
```
### Table Ownership Issue After P3 Restore
**Problem**: Miniflux pod crashed with "permission denied for table schema_version".
**Root cause**: P3 restore was run as the `eblume` superuser, so all tables were created owned by `eblume`, not `miniflux`.
**Solution**: Transfer ownership of all tables to miniflux:
```sql
DO $$
DECLARE r RECORD;
BEGIN
FOR r IN (SELECT tablename FROM pg_tables WHERE schemaname = 'public') LOOP
EXECUTE 'ALTER TABLE public.' || quote_ident(r.tablename) || ' OWNER TO miniflux';
END LOOP;
END$$;
```
### Tailscale Ingress Hostname Suffix
**Behavior**: When requesting a Tailscale hostname that's already taken, the operator adds a suffix (e.g., `feed-1`).
**Workflow**:
1. Deploy initially - gets `feed-1.tail8d86e.ts.net`
2. Clear old `svc:feed` from indri
3. Delete old `feed` device from Tailscale admin
4. Delete and recreate the Ingress - now claims `feed`
### Renaming Tailscale Service Hostname
**Problem**: Changing the `tailscale.com/hostname` annotation doesn't automatically update the Tailscale device.
**Solution**: Delete the service and let ArgoCD recreate it:
```bash
kubectl -n databases delete service blumeops-pg-tailscale
argocd app sync blumeops-pg
```
### .pgpass Management Migration
**Issue**: The postgresql role managed `~/.pgpass` for borgmatic. With postgresql role deleted, borgmatic couldn't authenticate.
**Solution**: Moved .pgpass management to the borgmatic role. Password is still fetched in playbook pre_tasks as `borgmatic_db_password`.
### Ansible Check Mode and Registered Variables
**Problem**: Running `provision-indri --check --diff` failed in the podman role with "Conditional result (True) was derived from value of type 'str'" errors.
**Root cause**: Command tasks are skipped in check mode, leaving registered variables undefined or with unexpected types when used in conditionals.
**Solution**: Added `check_mode: false` to read-only command tasks that gather information:
```yaml
- name: Check if podman machine exists
ansible.builtin.command:
cmd: podman machine list --format json
register: podman_machine_list
changed_when: false
check_mode: false # Safe to run in check mode - read-only
```
**Lesson**: Any task that registers a variable used in conditionals should have `check_mode: false` if the command is read-only/safe.
### 1Password CLI on Headless Hosts
**Issue**: Attempted to run `op` commands on indri, but 1Password CLI requires interactive authentication (biometrics/password).
**Solution**: All `op` commands must be in `pre_tasks` of the playbook with `delegate_to: localhost` so they run on gilbert (the workstation with GUI auth).
### Git Workflow for Phase 4
1. Created feature branch: `feature/p4-miniflux`
2. Made incremental commits throughout implementation
3. Pointed `miniflux` and `blumeops-pg` apps to feature branch for testing
4. Created PR #33 for review
5. After merge, reset apps to main:
```bash
argocd app set miniflux --revision main
argocd app set blumeops-pg --revision main
argocd app sync apps
```

View file

@ -1,208 +0,0 @@
# Phase 5.1: Migrate Minikube from QEMU2 to Docker Driver
**Goal**: Replace the qemu2 driver with docker to fix remote API access and simplify volume mounts
**Status**: Complete (2026-01-21) - Cluster running, ArgoCD deployed, apps synced
**Prerequisites**: [Phase 5](P5_devpi.complete.md) complete
---
## Background
### Original Problem (Podman → QEMU2)
During Phase 6 (Kiwix/Transmission migration), we discovered that the **podman driver has fundamental limitations** that prevent mounting external volumes:
1. **SMB CSI driver fails** with "Operation not permitted" - the rootless container lacks kernel-level mount capabilities
2. **`minikube mount` fails** - 9p mount gets "permission denied" inside the podman VM
3. **hostPath volumes** only work for paths inside the minikube container, not the macOS host
We migrated to QEMU2 to get a full VM with kernel capabilities.
### New Problem (QEMU2 → Docker)
The QEMU2 driver introduced a **new problem**: the Kubernetes API server is inside the VM at `192.168.105.2:6443`, and Tailscale's TCP proxy cannot forward to it properly:
- TCP connections succeed (nc -zv works)
- TLS handshake times out
- Root cause unknown, but likely related to Tailscale serve's handling of non-localhost upstreams
Additionally, the volume mount solution with QEMU2 was complex:
- Required NFS mount from sifaka → indri
- Then `minikube mount` to pass through to VM
- Two LaunchAgents/LaunchDaemons for persistence
- macOS GUI approval required for network access
### Why Docker?
The **docker driver** solves both problems:
1. **API Server on localhost**: Docker Desktop handles port forwarding from container to localhost automatically, so `tailscale serve --tcp=443 tcp://localhost:PORT` works
2. **Simpler volume mounts**: Docker Desktop has built-in macOS file sharing. Paths shared with Docker are accessible inside containers.
3. **Official Tailscale recommendation**: Tailscale's own [Kubernetes guide](https://tailscale.com/learn/managing-access-to-kubernetes-with-tailscale) uses minikube with the docker driver.
---
## Implementation Summary
### Infrastructure Changes
1. **Docker Desktop installed** (manual via `brew install --cask docker`)
- Configured with 12GB memory in Docker Desktop settings
- Kubernetes option disabled (using minikube instead)
2. **Docker minikube cluster created**:
```bash
minikube start \
--driver=docker \
--container-runtime=docker \
--cpus=6 \
--memory=11264 \
--disk-size=200g \
--apiserver-names=k8s.tail8d86e.ts.net,indri \
--apiserver-port=6443 \
--listen-address=0.0.0.0
```
3. **Tailscale serve configured** for k8s API:
- API server on localhost (port is dynamic with docker driver)
- `tailscale serve --service=svc:k8s --tcp=443 tcp://localhost:<PORT>`
4. **Remote kubectl access working** from gilbert:
- Created `mise-tasks/ensure-minikube-indri-kubectl-config` script
- Fetches certs from indri and sets up `~/.kube/minikube-indri/config.yml`
### Ansible Roles Updated
- `ansible/roles/minikube/` - docker driver, removed qemu2/NFS/socket_vmnet
- `ansible/roles/tailscale_serve/` - removed svc:k8s (minikube role handles dynamic port)
- Containerd registry mirrors configured for zot pull-through cache
### ArgoCD Bootstrap
All apps deployed and synced from `feature/p5.1-qemu2-migration` branch:
| App | Status | Notes |
|-----|--------|-------|
| tailscale-operator | Healthy | Manages Tailscale ingresses |
| argocd | Healthy | Self-managed |
| cloudnative-pg | Healthy | PostgreSQL operator |
| blumeops-pg | Progressing | PostgreSQL cluster starting |
| grafana | Progressing | Needs grafana-admin secret |
| grafana-config | Healthy | Dashboards and ingress |
| miniflux | Progressing | Needs miniflux-config secret |
| devpi | Progressing | Starting up |
### Secrets Still Needed
After PR merge, apply these secrets manually:
```bash
# Grafana admin password
op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl | kubectl --context=minikube-indri apply -f -
# Miniflux config
op inject -i argocd/manifests/miniflux/secret.yaml.tpl | kubectl --context=minikube-indri apply -f -
```
---
## Technical Notes
### API Server Port
With docker driver, the API server port is **dynamic** - Docker maps a random host port to 6443 inside the container.
The minikube ansible role queries the port after cluster start and configures tailscale serve accordingly.
### Registry Mirror Configuration
Containerd uses `/etc/containerd/certs.d/<registry>/hosts.toml` files. The ansible role configures mirrors for:
- `registry.tail8d86e.ts.net` (private images)
- `docker.io`
- `ghcr.io`
- `quay.io`
### ProxyClass Renamed
Changed from `crio-compat` to `default` - the old name was misleading since we're no longer using CRI-O.
### Volume Mounts for P6 (Kiwix/Transmission)
**Solution: Direct NFS from pods to sifaka** ✅ TESTED AND WORKING
Docker NATs outbound traffic through indri's LAN IP (192.168.1.50), so sifaka's NFS exports need to allow `192.168.1.0/24`.
Sifaka NFS exports configured:
- `192.168.1.0/24` - Docker containers via indri NAT
- `100.64.0.0/10` - Tailscale clients
Pods can mount NFS directly:
```yaml
volumes:
- name: torrents
nfs:
server: sifaka
path: /volume1/torrents
```
No LaunchAgents, no `minikube mount`, no SMB CSI driver needed.
---
## Verification Checklist
- [x] Docker Desktop installed and running on indri
- [x] QEMU2 minikube deleted
- [x] Docker minikube running (6 CPUs, 11GB RAM)
- [x] API server accessible on localhost
- [x] Tailscale serve configured for svc:k8s
- [x] Remote kubectl access working from gilbert
- [x] Ansible roles updated for docker driver
- [x] socket_vmnet stopped
- [x] ArgoCD deployed and synced
- [x] All apps synced to feature branch
- [x] Apply app secrets (grafana-admin, miniflux-db, devpi-root, eblume, borgmatic)
- [x] Verify all apps healthy after secrets applied
- [x] Miniflux database restored from borgmatic backup
- [ ] Merge PR and reset apps to main branch
- [ ] `mise run indri-services-check` passes
---
## Post-Merge Steps
After PR is merged:
```bash
# Reset all blumeops apps to main branch
argocd app set apps --revision main
argocd app set argocd --revision main
argocd app set blumeops-pg --revision main
argocd app set devpi --revision main
argocd app set grafana-config --revision main
argocd app set miniflux --revision main
argocd app set tailscale-operator --revision main
# Sync all apps
argocd app sync apps
argocd app sync argocd
argocd app sync tailscale-operator
argocd app sync blumeops-pg
argocd app sync grafana-config
argocd app sync miniflux
argocd app sync devpi
```
---
## Rollback Plan
If Docker driver doesn't work:
1. Delete Docker minikube: `minikube delete`
2. Recreate QEMU2 cluster (restore old ansible config from git)
3. Accept the Tailscale TCP forwarding limitation and use SSH tunnel for remote kubectl

View file

@ -1,102 +0,0 @@
# Phase 5: devpi Migration to Kubernetes
**Goal**: Migrate devpi PyPI caching proxy from indri to k8s
**Status**: Complete (2026-01-20)
**Prerequisites**: [Phase 4](P4_miniflux.complete.md) complete
---
## Summary
Successfully migrated devpi from mcquack LaunchAgent on indri to Kubernetes:
- Custom container image with devpi-server + devpi-web + auto-init startup script
- StatefulSet with 50Gi PVC for data persistence
- Tailscale Ingress at `pypi.tail8d86e.ts.net`
- Root password from 1Password secret, auto-initialized on first run
- Verified pip caching proxy and mcquack package upload
---
## Key Learnings
### Registry Mirror Configuration
- Minikube's CRI-O can't resolve Tailscale hostnames directly
- Added registry mirror config to redirect `registry.tail8d86e.ts.net``host.containers.internal:5050`
- Also added direct insecure registry entry for `host.containers.internal:5050`
- Config in `ansible/roles/minikube/files/zot-mirror.conf`
### Memory Requirements
- devpi-web's Whoosh search indexer needs significant memory during PyPI index build
- Initial 512Mi limit caused OOMKills
- Solution: High limit (2Gi) with low request (256Mi) - memory reclaimed after indexing
### Environment Variable Conflicts
- Kubernetes auto-sets `DEVPI_PORT` for service discovery
- Conflicted with our port config - renamed to `DEVPI_LISTEN_PORT`
### Tailscale Serve Cleanup
- Use `tailscale serve status --json` to see entries (non-JSON output can be empty)
- Use `tailscale serve clear svc:<name>` to remove entries
### ArgoCD Workflow
- Changed `apps` to manual sync (was auto-sync with prune)
- Workflow: sync apps → set revision to feature branch → sync service → test → reset to main after merge
---
## Verification Checklist
- [x] devpi pod healthy in k8s
- [x] https://pypi.tail8d86e.ts.net accessible
- [x] Web interface shows root/pypi index
- [x] `pip install <package>` works through proxy
- [x] mcquack v1.0.0 uploaded to eblume/dev
- [x] `pip install --index-url https://pypi.tail8d86e.ts.net/eblume/dev/+simple/ mcquack` works
- [x] Old devpi service removed from indri
- [x] zk documentation updated
---
## Files Changed
### New Files
| Path | Purpose |
|------|---------|
| `argocd/apps/devpi.yaml` | ArgoCD Application definition |
| `argocd/manifests/devpi/Dockerfile` | Container image with startup script |
| `argocd/manifests/devpi/start.sh` | Auto-init startup script |
| `argocd/manifests/devpi/statefulset.yaml` | StatefulSet with PVC |
| `argocd/manifests/devpi/service.yaml` | ClusterIP Service |
| `argocd/manifests/devpi/ingress-tailscale.yaml` | Tailscale Ingress |
| `argocd/manifests/devpi/kustomization.yaml` | Kustomize configuration |
| `argocd/manifests/devpi/secret-root.yaml.tpl` | 1Password secret template |
| `argocd/manifests/devpi/README.md` | Setup documentation |
### Modified Files
| Path | Change |
|------|--------|
| `CLAUDE.md` | Added k8s/ArgoCD workflow documentation |
| `ansible/playbooks/indri.yml` | Removed devpi and devpi_metrics roles |
| `ansible/roles/tailscale_serve/defaults/main.yml` | Removed svc:pypi |
| `ansible/roles/alloy/defaults/main.yml` | Removed devpi log collection |
| `ansible/roles/borgmatic/defaults/main.yml` | Removed devpi backup paths |
| `ansible/roles/minikube/files/zot-mirror.conf` | Added registry mirror for Tailscale hostname |
| `argocd/apps/apps.yaml` | Changed to manual sync policy |
### Roles Kept (not deleted)
- `ansible/roles/devpi/` - Kept for reference
- `ansible/roles/devpi_metrics/` - Kept for reference
---
## Post-Merge Cleanup
After PR merge, reset ArgoCD apps to main:
```fish
argocd app set apps --revision main
argocd app sync apps
argocd app set devpi --revision main
argocd app sync devpi
```

File diff suppressed because it is too large Load diff

View file

@ -1,394 +0,0 @@
# Phase 7: Forgejo Migration to Kubernetes
**Goal**: Migrate Forgejo from indri (macOS Homebrew) to Kubernetes via ArgoCD
**Status**: Planning (2026-01-21)
**Prerequisites**: [Phase 6](P6_kiwix.complete.md) complete
---
## Critical Risks & Mitigations
### 1. Circular Dependency (Highest Risk)
ArgoCD pulls manifests from Forgejo. If k8s Forgejo fails, we cannot redeploy it.
**Mitigation**: blumeops is mirrored to `github.com/eblume/blumeops`. DR procedure documented to switch ArgoCD to GitHub temporarily (see Disaster Recovery section).
### 2. Split Hostnames Required
The Tailscale k8s operator [cannot expose both HTTPS and TCP/SSH on the same hostname](https://github.com/tailscale/tailscale/issues/15539). See also [user comment](https://github.com/tailscale/tailscale/issues/15539#issuecomment-3782368432).
**Solution**:
- **HTTPS (web UI)**: `forge.tail8d86e.ts.net` via Tailscale Ingress
- **SSH (git operations)**: `git.tail8d86e.ts.net` via Tailscale LoadBalancer
---
## Current State
### Forgejo on indri
| Component | Location/Details |
|-----------|------------------|
| Data directory | `/opt/homebrew/var/forgejo/` (~426MB) |
| SQLite database | `/opt/homebrew/var/forgejo/data/forgejo.db` (4.1MB) |
| Git repositories | `/opt/homebrew/var/forgejo/data/forgejo-repositories/` (~418MB) |
| Configuration | `/opt/homebrew/var/forgejo/custom/conf/app.ini` (contains secrets) |
| HTTP port | 3001 (localhost) |
| SSH port | 2200 (localhost) |
| Tailscale | `svc:forge` with tcp:22→2200 and https:443→3001 |
| Backup | borgmatic backs up to sifaka |
### Hosted Repositories (8 total)
- blumeops (mirrored to GitHub)
- cloudnative-pg-charts
- csi-driver-smb
- devpi
- dotfiles
- grafana-helm-charts
- mcquack
- zot
---
## Architecture Decision: Helm Chart via ArgoCD
Following established pattern from cloudnative-pg and grafana:
1. Mirror `https://code.forgejo.org/forgejo-helm/forgejo-helm` to forge
2. ArgoCD Application with multi-source (chart + values)
3. Values file in `argocd/manifests/forgejo/values.yaml`
---
## All `forge` References Requiring Update
### SSH URLs (change to `git.tail8d86e.ts.net:22`)
| File | Current | After |
|------|---------|-------|
| `argocd/apps/apps.yaml` | `ssh://forgejo@indri.tail8d86e.ts.net:2200/...` | `ssh://forgejo@git.tail8d86e.ts.net/...` |
| `argocd/apps/argocd.yaml` | same | same |
| `argocd/apps/blumeops-pg.yaml` | same | same |
| `argocd/apps/cloudnative-pg.yaml` | same | same |
| `argocd/apps/devpi.yaml` | same | same |
| `argocd/apps/grafana.yaml` | same | same |
| `argocd/apps/grafana-config.yaml` | same | same |
| `argocd/apps/kiwix.yaml` | same | same |
| `argocd/apps/miniflux.yaml` | same | same |
| `argocd/apps/tailscale-operator.yaml` | same | same |
| `argocd/apps/torrent.yaml` | same | same |
| `argocd/manifests/argocd/repo-forge-secret.yaml.tpl` | `ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/` | `ssh://forgejo@git.tail8d86e.ts.net/eblume/` |
| `ansible/group_vars/all.yml` | `ssh://forgejo@forge.tail8d86e.ts.net/...` | `ssh://forgejo@git.tail8d86e.ts.net/...` |
### SSH Known Hosts (add `git.tail8d86e.ts.net`)
| File | Change |
|------|--------|
| `argocd/manifests/argocd/argocd-ssh-known-hosts-cm.yaml` | Add `git.tail8d86e.ts.net ssh-ed25519 AAAA...` |
### HTTPS URLs (stay as `forge.tail8d86e.ts.net`)
These remain unchanged:
- `CLAUDE.md:135` - Mirror location
- `mise-tasks/pr-comments:23` - Forge API base
- `mise-tasks/indri-services-check:65` - HTTP health check (update to check k8s)
### Ansible/Indri Cleanup (remove after migration)
| File | Action |
|------|--------|
| `ansible/playbooks/indri.yml:36-37` | Remove forgejo role |
| `ansible/roles/tailscale_serve/defaults/main.yml:6` | Remove `svc:forge` entry |
| `ansible/roles/alloy/defaults/main.yml:31-32` | Remove forgejo log collection |
| `ansible/roles/borgmatic/defaults/main.yml:17` | Update backup path |
### Tailscale/Pulumi (update after hostname cutover)
| File | Change |
|------|--------|
| `argocd/manifests/tailscale-operator/egress-forge.yaml` | Delete (no longer needed) |
| `pulumi/policy.hujson` | Update `tag:forge` ACLs for k8s source |
---
## Pre-Migration Checklist
- [ ] GitHub mirror verified current
- [ ] Full borgmatic backup completed and verified
- [ ] Manual backup of `/opt/homebrew/var/forgejo` on indri
- [ ] Document all SSH deploy keys and webhooks
- [ ] **User action**: Mirror forgejo-helm chart to forge
- [ ] Extract secrets from app.ini to 1Password:
- `INTERNAL_TOKEN`
- `SECRET_KEY`
- `JWT_SECRET`
- Any OAuth/webhook secrets
---
## Steps
### Phase A: Create k8s Manifests
**New Files:**
```
argocd/apps/forgejo.yaml # ArgoCD Application (multi-source Helm)
argocd/manifests/forgejo/values.yaml # Helm chart values
argocd/manifests/forgejo/kustomization.yaml # Kustomize config
argocd/manifests/forgejo/pvc.yaml # 10Gi PersistentVolumeClaim
argocd/manifests/forgejo/secret-app.yaml.tpl # Secrets from 1Password
```
**Key values.yaml settings:**
```yaml
service:
ssh:
type: LoadBalancer
loadBalancerClass: tailscale
port: 22
annotations:
tailscale.com/hostname: "git-1" # Test hostname first
ingress:
enabled: true
className: tailscale
hosts:
- host: forge-1 # Test hostname first
gitea:
config:
server:
DOMAIN: forge-1.tail8d86e.ts.net
ROOT_URL: https://forge-1.tail8d86e.ts.net/
SSH_DOMAIN: git-1.tail8d86e.ts.net
SSH_PORT: 22
database:
DB_TYPE: sqlite3
PATH: /data/forgejo.db
```
---
### Phase B: Deploy to Test Hostnames
1. Create feature branch, push to forge
2. Sync ArgoCD apps: `argocd app sync apps`
3. Point forgejo app to feature branch: `argocd app set forgejo --revision feature/p7-forgejo`
4. Sync forgejo app: `argocd app sync forgejo`
5. Verify pods running (empty data initially)
---
### Phase C: Data Migration (~10 min downtime)
1. **Stop indri Forgejo**
```bash
ssh indri 'brew services stop forgejo'
```
2. **Copy data** (option A: rsync via NFS staging)
```bash
ssh indri 'rsync -avP /opt/homebrew/var/forgejo/ sifaka:/volume1/forgejo-migration/'
```
3. **Copy to PVC and fix permissions**
```bash
kubectl exec -n forgejo deployment/forgejo -- rsync -avP /staging/ /data/
kubectl exec -n forgejo deployment/forgejo -- chown -R 1000:1000 /data
```
4. **Restart Forgejo**
```bash
kubectl rollout restart deployment/forgejo -n forgejo
```
---
### Phase D: Validation (Critical)
- [ ] Web UI accessible at `forge-1.tail8d86e.ts.net`
- [ ] SSH works: `ssh -T forgejo@git-1.tail8d86e.ts.net`
- [ ] All 8 repos visible and accessible
- [ ] Git clone works
- [ ] Git push works (test on non-critical repo)
- [ ] eblume user preserved with correct permissions
- [ ] PR history intact
- [ ] Webhooks functioning
- [ ] GitHub mirror push still works
---
### Phase E: Hostname Cutover
1. **Clear indri Tailscale serve**
```bash
ssh indri 'tailscale serve clear svc:forge'
```
2. **User action**: Delete `svc:forge` and `forge-1` devices from Tailscale admin
3. **Update manifests**: Change `forge-1``forge`, `git-1``git`
4. **Sync ArgoCD**
5. **Verify hostnames claimed**
```bash
curl https://forge.tail8d86e.ts.net/api/v1/version
ssh -T forgejo@git.tail8d86e.ts.net
```
---
### Phase F: Update ArgoCD to Use New Forgejo
1. **Get SSH host key from k8s Forgejo**
```bash
kubectl exec -n forgejo deployment/forgejo -- cat /data/ssh/ssh_host_ed25519_key.pub
```
2. **Update known_hosts ConfigMap** with `git.tail8d86e.ts.net` key
3. **Update repo-creds-forge secret** (manual kubectl commands)
4. **Update all ArgoCD Application manifests** with new repoURL
5. **Delete egress-forge.yaml** (no longer needed)
6. **Sync ArgoCD** and verify all apps sync successfully
---
### Phase G: Update Local Git Remotes
```bash
cd ~/code/personal/blumeops
git remote set-url origin ssh://forgejo@git.tail8d86e.ts.net/eblume/blumeops.git
# Repeat for all 8 repos
```
---
### Phase H: Cleanup
1. Remove forgejo role from `ansible/playbooks/indri.yml`
2. Remove `svc:forge` from `ansible/roles/tailscale_serve/defaults/main.yml`
3. Remove forgejo log collection from `ansible/roles/alloy/defaults/main.yml`
4. Delete `argocd/manifests/tailscale-operator/egress-forge.yaml`
5. Update `mise-tasks/indri-services-check`
6. Run ansible to clean up indri: `mise run provision-indri -- --tags tailscale-serve,alloy`
7. Update zk documentation (forgejo, argocd, blumeops cards)
8. Merge PR
9. Reset ArgoCD to main
---
## Disaster Recovery Procedure
**Add to [[forgejo]] zk card:**
### When Forgejo is Unavailable
1. **Add GitHub repository to ArgoCD**
```bash
argocd repo add https://github.com/eblume/blumeops.git \
--username eblume \
--password $(op read "op://<vault>/<item>/github-pat")
```
2. **Point critical apps to GitHub**
```bash
argocd app set apps --repo https://github.com/eblume/blumeops.git
argocd app set forgejo --repo https://github.com/eblume/blumeops.git
argocd app sync forgejo
```
3. **Fix Forgejo** (restore from backup, fix config, etc.)
4. **Verify Forgejo is healthy**
```bash
curl https://forge.tail8d86e.ts.net/api/v1/version
ssh -T forgejo@git.tail8d86e.ts.net
```
5. **Switch back to Forgejo**
```bash
argocd app set apps --repo ssh://forgejo@git.tail8d86e.ts.net/eblume/blumeops.git
argocd app set forgejo --repo ssh://forgejo@git.tail8d86e.ts.net/eblume/blumeops.git
argocd app sync apps
argocd repo rm https://github.com/eblume/blumeops.git
```
---
## Files Summary
### New Files
| Path | Purpose |
|------|---------|
| `argocd/apps/forgejo.yaml` | ArgoCD Application (multi-source Helm) |
| `argocd/manifests/forgejo/values.yaml` | Helm chart values |
| `argocd/manifests/forgejo/kustomization.yaml` | Kustomize config |
| `argocd/manifests/forgejo/pvc.yaml` | 10Gi PersistentVolumeClaim |
| `argocd/manifests/forgejo/secret-app.yaml.tpl` | Secrets template |
### Modified Files
| Path | Change |
|------|--------|
| All `argocd/apps/*.yaml` | Update repoURL to `git.tail8d86e.ts.net` |
| `argocd/manifests/argocd/argocd-ssh-known-hosts-cm.yaml` | Add `git.tail8d86e.ts.net` |
| `argocd/manifests/argocd/repo-forge-secret.yaml.tpl` | Update URL |
| `ansible/playbooks/indri.yml` | Remove forgejo role |
| `ansible/roles/tailscale_serve/defaults/main.yml` | Remove `svc:forge` |
| `ansible/roles/alloy/defaults/main.yml` | Remove forgejo logs |
### Files to Delete
| Path | Reason |
|------|--------|
| `argocd/manifests/tailscale-operator/egress-forge.yaml` | No longer needed |
---
## Rollback
If migration fails at any point:
1. **Delete k8s resources**
```bash
argocd app delete forgejo --cascade
kubectl delete namespace forgejo
```
2. **Restart indri Forgejo**
```bash
ssh indri 'brew services start forgejo'
```
3. **Re-enable Tailscale serve**
```bash
mise run provision-indri -- --tags tailscale-serve
```
4. **Revert ArgoCD apps to indri URLs** (if changed)
---
## Verification Checklist
- [ ] GitHub mirror verified current
- [ ] Helm chart mirrored to forge
- [ ] Secrets extracted to 1Password
- [ ] k8s Forgejo pod running
- [ ] All 8 repos accessible
- [ ] SSH clone/push works via `git.tail8d86e.ts.net`
- [ ] HTTPS works via `forge.tail8d86e.ts.net`
- [ ] ArgoCD syncs from new URL
- [ ] All local remotes updated
- [ ] Indri cleanup complete
- [ ] zk docs updated
- [ ] DR procedure documented in [[forgejo]] card

View file

@ -1,32 +0,0 @@
# Phase 8: CI/CD (Woodpecker)
**Goal**: Deploy Woodpecker CI integrated with Forgejo
**Status**: Pending
**Prerequisites**: [Phase 7](P7_forgejo.md) complete
---
## Steps
### 1. Create Forgejo OAuth application
- Callback: https://ci.tail8d86e.ts.net/authorize
- Store in 1Password
---
### 2. Deploy Woodpecker Server + Agent
---
### 3. Configure Tailscale LoadBalancer
Tag: `svc:ci`
---
### 4. Test pipeline
Create `.woodpecker.yaml` in test repo

View file

@ -1,52 +0,0 @@
# Phase 9: Cleanup
**Goal**: Remove deprecated services, harden system
**Status**: Pending
**Prerequisites**: [Phase 8](P8_woodpecker.md) complete
---
## Steps
### 1. Stop/remove unused brew services
- postgresql@18
- grafana
- miniflux
- forgejo
---
### 2. Update ansible playbook
- Remove migrated service roles
- Add k8s deployment references
---
### 3. Configure Velero backups (optional)
- Install with MinIO on sifaka
- Schedule daily cluster backups
---
### 4. Update zk documentation
- New architecture
- Runbooks
- DR procedures
---
## Plan Completion
When all phases are complete and verified:
```bash
# Rename this folder to indicate completion
git mv plans/k8s-migration plans/k8s-migration.complete
git commit -m "Complete k8s migration plan"
```