diff --git a/docs/k8s-migration.md b/docs/k8s-migration.md index 140fb60..90e7ad9 100644 --- a/docs/k8s-migration.md +++ b/docs/k8s-migration.md @@ -66,92 +66,699 @@ This applies to all mcquack LaunchAgents (zot, devpi, kiwix, borgmatic, metrics **Goal**: Container registry + minikube cluster without disrupting existing services -### Steps +### Important: Tailscale Service Creation Order -1. **Install Podman on indri** - ```bash - # Add to Brewfile - brew "podman" - ``` - - Create ansible role `podman` for machine setup +> **WARNING**: You MUST create services in the Tailscale admin console BEFORE running `tailscale serve` commands via ansible. If you run `tailscale serve --service svc:foo` before the service exists in the admin console, the local config will be in a bad state. +> +> To fix a misconfigured service: +> ```bash +> tailscale serve --service svc:foo reset +> ``` +> Then create the service in admin console and try again. -2. **Install and configure Zot registry** - - Create ansible role `zot` - - Deploy as mcquack LaunchAgent (like devpi pattern) - - Bind to `localhost:5000` - - Configure pull-through for Docker Hub + GHCR - - Add Tailscale serve: `svc:registry` +--- -3. **Install minikube** - ```bash - # Add to Brewfile - brew "minikube" +### Step 0.1: Update Brewfile and Install Dependencies - # Start with podman driver - minikube start --driver=podman --container-runtime=containerd \ - --cpus=4 --memory=8192 --disk-size=100g - ``` - - Create ansible role `minikube` for initial setup +**Files to modify:** +- `Brewfile` -4. **Update Pulumi ACLs** - - Add `tag:registry` for registry service - - Add `tag:k8s` for cluster services +**Changes:** +```ruby +# Add to Brewfile +brew "podman" +brew "minikube" +brew "zot" # Check if available, otherwise install binary manually +``` -5. **Configure kubeconfig on gilbert** - - Add minikube context to `~/.kube/config` - - Keep work EKS config separate (already isolated) - - K9s will auto-discover contexts - -6. **Observability for new services** (follow existing patterns) - - **Zot Registry:** - - Create zk card `~/code/personal/zk/zot.md` (like devpi.md, forgejo.md) - - Add log collection to Alloy config (stdout/stderr from LaunchAgent) - - Create `zot_metrics` role with periodic script writing to textfile collector - - Create Grafana dashboard: cache hit rates, storage usage, pull/push counts - - **Minikube:** - - Create zk card `~/code/personal/zk/minikube.md` - - Metrics via kube-state-metrics (deployed in cluster) - - Node metrics already collected by Alloy - - Create Grafana dashboard: cluster health, resource usage - - **Note:** Backups not needed for these services: - - Zot cache is re-fetchable from upstream registries - - Minikube state is recreatable from ansible/k8s manifests - -### New Files -- `ansible/roles/zot/` - Registry role -- `ansible/roles/zot_metrics/` - Metrics collection -- `ansible/roles/podman/` - Podman setup -- `ansible/roles/minikube/` - Cluster setup -- `~/code/personal/zk/zot.md` - Registry management log -- `~/code/personal/zk/minikube.md` - Cluster management log - -### Verification +**Testing:** ```bash -# Registry working -curl http://localhost:5000/v2/_catalog +# On gilbert +brew bundle --file=Brewfile -# Minikube running -minikube status -kubectl get nodes +# On indri (via ansible or manual) +ssh indri 'brew install podman minikube' -# Metrics flowing +# Verify installations +ssh indri 'podman --version' +ssh indri 'minikube version' +``` + +--- + +### Step 0.2: Update Pulumi ACLs (BEFORE Tailscale serve) + +**Files to modify:** +- `pulumi/policy.hujson` + +**Changes:** +Add new tags and ACL rules: +```hujson +// In tagOwners section +"tag:registry": ["autogroup:admin"], +"tag:k8s": ["autogroup:admin"], + +// In acls section - add registry access +{ + "action": "accept", + "src": ["autogroup:member"], + "dst": ["tag:registry:443"], +}, +``` + +**Testing:** +```bash +mise run tailnet-preview # Review changes +mise run tailnet-up # Apply changes +``` + +--- + +### Step 0.3: Create Tailscale Services in Admin Console (MANUAL) + +> **CRITICAL**: Do this BEFORE running any ansible that calls `tailscale serve` + +1. Go to https://login.tailscale.com/admin/services +2. Create service `registry` with: + - Port: 443 (HTTPS) + - Host: indri +3. Apply tag `tag:registry` to indri if not already tagged + +**Verification:** +```bash +# Service should appear (even if not yet serving) +tailscale status | grep registry +``` + +--- + +### Step 0.4: Create Zot Registry Ansible Role + +**New files:** +``` +ansible/roles/zot/ +├── defaults/main.yml +├── tasks/main.yml +├── templates/ +│ ├── config.json.j2 +│ └── zot.plist.j2 +└── handlers/main.yml +``` + +**Key configuration (defaults/main.yml):** +```yaml +zot_version: "2.1.0" +zot_data_dir: "/Users/erichblume/zot" +zot_config_dir: "/Users/erichblume/.config/zot" +zot_port: 5000 +zot_log_dir: "/Users/erichblume/Library/Logs" + +# Pull-through cache configuration +zot_registries: + - name: docker.io + url: https://registry-1.docker.io + - name: ghcr.io + url: https://ghcr.io + - name: quay.io + url: https://quay.io +``` + +**LaunchAgent template (zot.plist.j2):** +```xml + + + + + Label + mcquack.eblume.zot + ProgramArguments + + + /opt/homebrew/bin/zot + serve + {{ zot_config_dir }}/config.json + + RunAtLoad + + KeepAlive + + StandardOutPath + {{ zot_log_dir }}/mcquack.zot.out.log + StandardErrorPath + {{ zot_log_dir }}/mcquack.zot.err.log + + +``` + +**Testing (after deploying role):** +```bash +# Check LaunchAgent is running +ssh indri 'launchctl list | grep zot' + +# Check zot is responding +ssh indri 'curl -s http://localhost:5000/v2/_catalog' +# Expected: {"repositories":[]} + +# Check logs for errors +ssh indri 'tail -20 ~/Library/Logs/mcquack.zot.err.log' +``` + +--- + +### Step 0.5: Add Zot to Tailscale Serve + +**Files to modify:** +- `ansible/roles/tailscale_serve/defaults/main.yml` + +**Changes:** +```yaml +# Add to tailscale_serve_services list +- name: svc:registry + https: + port: 443 + upstream: http://localhost:5000 +``` + +**Testing:** +```bash +# Deploy tailscale serve config +mise run provision-indri -- --tags tailscale-serve + +# Verify from gilbert (not indri - hairpinning doesn't work) +curl -s https://registry.tail8d86e.ts.net/v2/_catalog +# Expected: {"repositories":[]} +``` + +--- + +### Step 0.6: Create Zot Metrics Role + +**New files:** +``` +ansible/roles/zot_metrics/ +├── defaults/main.yml +├── tasks/main.yml +├── templates/ +│ ├── zot-metrics.sh.j2 +│ └── zot-metrics.plist.j2 +└── handlers/main.yml +``` + +**Metrics script pattern (zot-metrics.sh.j2):** +```bash +#!/bin/bash +# Collect Zot registry metrics for Prometheus textfile collector +set -euo pipefail + +METRICS_FILE="/opt/homebrew/var/node_exporter/textfile/zot.prom" +TEMP_FILE="${METRICS_FILE}.tmp" + +# Check if zot is up +if curl -sf http://localhost:5000/v2/_catalog > /dev/null 2>&1; then + echo "zot_up 1" > "$TEMP_FILE" +else + echo "zot_up 0" > "$TEMP_FILE" + mv "$TEMP_FILE" "$METRICS_FILE" + exit 0 +fi + +# Get metrics from zot's metrics endpoint (if enabled) +# Add storage metrics, cache hits, etc. +# ... + +mv "$TEMP_FILE" "$METRICS_FILE" +``` + +**Testing:** +```bash +# Deploy metrics role +mise run provision-indri -- --tags zot_metrics + +# Check metrics file exists and is updated ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/zot.prom' +# Expected: zot_up 1 -# Logs in Loki -# Query: {service="zot"} +# Verify metrics appear in Prometheus (after a scrape cycle) +curl -s "http://indri:9090/api/v1/query?query=zot_up" | jq '.data.result[0].value[1]' +# Expected: "1" ``` -### Rollback +--- + +### Step 0.7: Add Zot Log Collection to Alloy + +**Files to modify:** +- `ansible/roles/alloy/templates/config.alloy.j2` + +**Changes:** +Add to the mcquack services log collection section: +```alloy +// Zot registry logs +local.file_match "zot_logs" { + path_targets = [ + {__path__ = "/Users/erichblume/Library/Logs/mcquack.zot.out.log", service = "zot", stream = "stdout"}, + {__path__ = "/Users/erichblume/Library/Logs/mcquack.zot.err.log", service = "zot", stream = "stderr"}, + ] +} + +loki.source.file "zot_logs" { + targets = local.file_match.zot_logs.targets + forward_to = [loki.write.local.receiver] +} +``` + +**Testing:** ```bash -minikube stop && minikube delete -launchctl unload ~/Library/LaunchAgents/mcquack.eblume.zot.plist +# Deploy alloy config +mise run provision-indri -- --tags alloy + +# Restart alloy to pick up changes +ssh indri 'brew services restart grafana-alloy' + +# Wait a minute, then check Loki for zot logs +# In Grafana Explore, query: {service="zot"} ``` --- +### Step 0.8: Update indri-services-check Script + +**Files to modify:** +- `mise-tasks/indri-services-check` + +**Changes to add:** +```bash +# Add after existing service checks (around line 55) +check_service "zot" "ssh indri 'launchctl list | grep zot | grep -v \"^-\"'" +check_service "zot-metrics" "ssh indri 'launchctl list | grep zot-metrics | grep -v \"^-\"'" + +# Add to HTTP endpoints section (around line 65) +check_http "Zot Registry" "http://indri:5000/v2/_catalog" + +# Add metrics file check +check_service "Zot metrics" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/zot.prom'" +``` + +**Testing:** +```bash +# Run the health check +mise run indri-services-check + +# Expected output includes: +# zot... OK +# zot-metrics... OK +# Zot Registry... OK +# Zot metrics... OK +``` + +--- + +### Step 0.9: Install and Configure Podman on Indri + +**New files:** +``` +ansible/roles/podman/ +├── tasks/main.yml +└── handlers/main.yml +``` + +**Tasks (tasks/main.yml):** +```yaml +- name: Install podman via homebrew + community.general.homebrew: + name: podman + state: present + +- name: Initialize podman machine (if not exists) + ansible.builtin.command: + cmd: podman machine init --cpus 4 --memory 8192 --disk-size 100 + register: podman_init + changed_when: podman_init.rc == 0 + failed_when: false # May already exist + +- name: Start podman machine + ansible.builtin.command: + cmd: podman machine start + register: podman_start + changed_when: "'started' in podman_start.stdout" + failed_when: false # May already be running +``` + +**Testing:** +```bash +# Deploy podman role +mise run provision-indri -- --tags podman + +# Verify podman is working +ssh indri 'podman info' +ssh indri 'podman run --rm hello-world' +``` + +--- + +### Step 0.10: Install and Configure Minikube + +**New files:** +``` +ansible/roles/minikube/ +├── defaults/main.yml +├── tasks/main.yml +└── handlers/main.yml +``` + +**Defaults:** +```yaml +minikube_cpus: 4 +minikube_memory: 8192 +minikube_disk_size: "100g" +minikube_driver: podman +minikube_container_runtime: containerd +``` + +**Tasks:** +```yaml +- name: Install minikube via homebrew + community.general.homebrew: + name: minikube + state: present + +- name: Check if minikube cluster exists + ansible.builtin.command: + cmd: minikube status --format='{{.Host}}' + register: minikube_status + changed_when: false + failed_when: false + +- name: Start minikube cluster + ansible.builtin.command: + cmd: > + minikube start + --driver={{ minikube_driver }} + --container-runtime={{ minikube_container_runtime }} + --cpus={{ minikube_cpus }} + --memory={{ minikube_memory }} + --disk-size={{ minikube_disk_size }} + when: minikube_status.rc != 0 or 'Running' not in minikube_status.stdout +``` + +**Testing:** +```bash +# Deploy minikube role +mise run provision-indri -- --tags minikube + +# Verify cluster is running +ssh indri 'minikube status' +# Expected: host: Running, kubelet: Running, apiserver: Running + +# Test kubectl access from indri +ssh indri 'kubectl get nodes' +# Expected: minikube Ready control-plane ... +``` + +--- + +### Step 0.11: Configure Kubeconfig on Gilbert + +**Manual steps** (kubeconfig management is complex with work configs): + +```bash +# Copy minikube kubeconfig from indri +ssh indri 'cat ~/.kube/config' > /tmp/minikube-config.yaml + +# Merge into local kubeconfig (careful not to overwrite work configs!) +# Option A: Use KUBECONFIG env var to include multiple files +export KUBECONFIG=~/.kube/config:~/.kube/minikube.yaml + +# Option B: Manually merge contexts +kubectl config --kubeconfig=/tmp/minikube-config.yaml view --flatten > ~/.kube/minikube.yaml + +# Set minikube context +kubectl config use-context minikube + +# Verify connection from gilbert +kubectl get nodes +``` + +**Testing:** +```bash +# From gilbert, verify k8s access +kubectl cluster-info +kubectl get namespaces + +# Verify k9s can connect +k9s +# Should show the minikube cluster +``` + +--- + +### Step 0.12: Add Minikube to indri-services-check + +**Files to modify:** +- `mise-tasks/indri-services-check` + +**Changes:** +```bash +# Add new section for Kubernetes +echo "" +echo "Kubernetes cluster:" +check_service "minikube" "ssh indri 'minikube status --format={{.Host}} | grep -q Running'" +check_service "k8s-apiserver" "ssh indri 'kubectl get --raw /healthz'" +``` + +**Testing:** +```bash +mise run indri-services-check + +# Expected output includes: +# Kubernetes cluster: +# minikube... OK +# k8s-apiserver... OK +``` + +--- + +### Step 0.13: Create Zot Grafana Dashboard + +**New files:** +- `ansible/roles/grafana/files/dashboards/zot.json` + +**Dashboard panels:** +- `zot_up` - Service availability +- Storage usage (if zot exposes this metric) +- Cache hit/miss rates +- Pull/push request counts + +**Testing:** +```bash +# Deploy dashboard +mise run provision-indri -- --tags grafana + +# Verify in Grafana UI +# Navigate to Dashboards > Zot Registry +``` + +--- + +### Step 0.14: Create Minikube Grafana Dashboard + +**New files:** +- `ansible/roles/grafana/files/dashboards/minikube.json` + +**Dashboard panels:** +- Node CPU/Memory usage +- Pod count by namespace +- Container restart counts +- API server request latency + +**Note:** This may require deploying kube-state-metrics in the cluster first: +```bash +ssh indri 'kubectl apply -f https://raw.githubusercontent.com/kubernetes/kube-state-metrics/main/examples/standard/cluster-role.yaml' +# ... additional kube-state-metrics manifests +``` + +--- + +### Step 0.15: Create Zettelkasten Documentation + +**New files:** +- `~/code/personal/zk/zot.md` +- `~/code/personal/zk/minikube.md` + +**Template for zot.md:** +```markdown +--- +id: zot +aliases: + - zot + - container-registry +tags: + - blumeops +--- + +# Zot Registry Management Log + +Zot is an OCI-native container registry running on Indri, providing local caching and pull-through proxy for Docker Hub, GHCR, and Quay. + +## Service Details + +- URL: https://registry.tail8d86e.ts.net +- Local port: 5000 +- Data directory: ~/zot +- Config: ~/.config/zot/config.json +- Managed via: mcquack LaunchAgent + +## Pull-Through Cache + +Configured to proxy: +- docker.io (Docker Hub) +- ghcr.io (GitHub Container Registry) +- quay.io (Red Hat Quay) + +## Useful Commands + +\`\`\`bash +# List cached images +curl -s http://localhost:5000/v2/_catalog | jq + +# Check service status +launchctl list | grep zot + +# View logs +tail -f ~/Library/Logs/mcquack.zot.err.log +\`\`\` + +## Log + +### [DATE] +- Initial setup for k8s migration Phase 0 +``` + +--- + +### Step 0.16: Update Main Playbook + +**Files to modify:** +- `ansible/playbooks/indri.yml` + +**Changes:** +```yaml +# Add new roles to the roles list +- role: podman + tags: podman +- role: zot + tags: zot +- role: zot_metrics + tags: zot_metrics +- role: minikube + tags: minikube +``` + +--- + +### Phase 0 Verification Checklist + +Run after completing all steps: + +```bash +# 1. Full service health check +mise run indri-services-check +# All services should show OK, including new ones + +# 2. Registry functionality +curl -s https://registry.tail8d86e.ts.net/v2/_catalog +# Expected: {"repositories":[]} + +# 3. Pull through registry (test caching) +ssh indri 'podman pull localhost:5000/library/alpine:latest' +curl -s https://registry.tail8d86e.ts.net/v2/_catalog +# Expected: {"repositories":["library/alpine"]} + +# 4. Kubernetes cluster +ssh indri 'minikube status' +ssh indri 'kubectl get nodes' +kubectl get nodes # from gilbert + +# 5. Metrics in Prometheus +curl -s "http://indri:9090/api/v1/query?query=zot_up" +# Expected: value = 1 + +# 6. Logs in Loki +# In Grafana Explore: {service="zot"} +# Should see zot log entries + +# 7. Dashboards in Grafana +# Navigate to Zot Registry dashboard - panels should have data +# Navigate to Minikube dashboard - panels should have data + +# 8. k9s from gilbert +k9s +# Should connect and show minikube cluster +``` + +--- + +### Phase 0 Rollback + +If something goes wrong: + +```bash +# Stop and remove minikube +ssh indri 'minikube stop && minikube delete' + +# Stop and remove zot +ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.zot.plist' +ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.zot.plist' + +# Remove podman machine +ssh indri 'podman machine stop && podman machine rm' + +# Remove from tailscale serve +ssh indri 'tailscale serve --service svc:registry reset' + +# Remove tags from Pulumi (revert policy.hujson changes) +mise run tailnet-up + +# Revert ansible playbook changes +git checkout ansible/playbooks/indri.yml +git checkout ansible/roles/tailscale_serve/defaults/main.yml +git checkout ansible/roles/alloy/templates/config.alloy.j2 + +# Remove new roles +rm -rf ansible/roles/{zot,zot_metrics,podman,minikube} + +# Remove zk cards +rm ~/code/personal/zk/{zot,minikube}.md +``` + +--- + +### New Files Summary + +| File | Purpose | +|------|---------| +| `ansible/roles/zot/` | Zot registry deployment | +| `ansible/roles/zot_metrics/` | Metrics collection for Zot | +| `ansible/roles/podman/` | Podman installation and setup | +| `ansible/roles/minikube/` | Minikube cluster setup | +| `ansible/roles/grafana/files/dashboards/zot.json` | Zot monitoring dashboard | +| `ansible/roles/grafana/files/dashboards/minikube.json` | K8s monitoring dashboard | +| `~/code/personal/zk/zot.md` | Zot management documentation | +| `~/code/personal/zk/minikube.md` | Minikube management documentation | + +### Modified Files Summary + +| File | Changes | +|------|---------| +| `Brewfile` | Add podman, minikube | +| `pulumi/policy.hujson` | Add tag:registry, tag:k8s, ACL rules | +| `ansible/playbooks/indri.yml` | Add new roles | +| `ansible/roles/tailscale_serve/defaults/main.yml` | Add svc:registry | +| `ansible/roles/alloy/templates/config.alloy.j2` | Add zot log collection | +| `mise-tasks/indri-services-check` | Add zot and k8s checks | + +--- + ## Phase 1: Kubernetes Infrastructure **Goal**: Tailscale operator + CloudNativePG operator