P5.1: Migrate minikube from podman to QEMU2 driver #38

Merged
eblume merged 16 commits from feature/p5.1-qemu2-migration into main 2026-01-21 16:03:38 -08:00
2 changed files with 228 additions and 235 deletions
Showing only changes of commit 5724b61fb4 - Show all commits

save some work

Erich Blume 2026-01-21 13:27:27 -08:00

View file

@ -1,31 +0,0 @@
#!/bin/bash
# kubectl exec credential plugin for 1Password
# Usage: kubectl-credential-1password <vault-id> <item-id> <cert-field> <key-field>
#
# Fetches client certificate and key from 1Password and outputs
# ExecCredential JSON for kubectl authentication.
set -euo pipefail
VAULT_ID="$1"
ITEM_ID="$2"
CERT_FIELD="$3"
KEY_FIELD="$4"
# Fetch credentials from 1Password (strips surrounding quotes from text fields)
CLIENT_CERT=$(op --vault "$VAULT_ID" item get "$ITEM_ID" --fields "$CERT_FIELD" | sed 's/^"//; s/"$//')
CLIENT_KEY=$(op --vault "$VAULT_ID" item get "$ITEM_ID" --fields "$KEY_FIELD" | sed 's/^"//; s/"$//')
# Output ExecCredential JSON
# Note: jq is used to properly escape the PEM data for JSON
jq -n \
--arg cert "$CLIENT_CERT" \
--arg key "$CLIENT_KEY" \
'{
"apiVersion": "client.authentication.k8s.io/v1beta1",
"kind": "ExecCredential",
"status": {
"clientCertificateData": $cert,
"clientKeyData": $key
}
}'

View file

@ -1,8 +1,8 @@
# Phase 5.1: Migrate Minikube from Podman to QEMU2 Driver # Phase 5.1: Migrate Minikube from QEMU2 to Docker Driver
**Goal**: Replace the podman driver with qemu2 to enable proper volume mounts (hostPath, NFS, SMB CSI) **Goal**: Replace the qemu2 driver with docker to fix remote API access and simplify volume mounts
**Status**: Complete (2026-01-21) **Status**: In Progress (2026-01-21)
**Prerequisites**: [Phase 5](P5_devpi.complete.md) complete **Prerequisites**: [Phase 5](P5_devpi.complete.md) complete
@ -10,307 +10,331 @@
## Background ## Background
### Original Problem (Podman → QEMU2)
During Phase 6 (Kiwix/Transmission migration), we discovered that the **podman driver has fundamental limitations** that prevent mounting external volumes: During Phase 6 (Kiwix/Transmission migration), we discovered that the **podman driver has fundamental limitations** that prevent mounting external volumes:
1. **SMB CSI driver fails** with "Operation not permitted" - the rootless container lacks kernel-level mount capabilities 1. **SMB CSI driver fails** with "Operation not permitted" - the rootless container lacks kernel-level mount capabilities
2. **`minikube mount` fails** - 9p mount gets "permission denied" inside the podman VM 2. **`minikube mount` fails** - 9p mount gets "permission denied" inside the podman VM
3. **hostPath volumes** only work for paths inside the minikube container, not the macOS host 3. **hostPath volumes** only work for paths inside the minikube container, not the macOS host
These are documented limitations of the podman driver, which is labeled "experimental" in the [minikube documentation](https://minikube.sigs.k8s.io/docs/drivers/podman/). We migrated to QEMU2 to get a full VM with kernel capabilities.
### Failed P6 Attempt ### New Problem (QEMU2 → Docker)
Branch `feature/p6-kiwix-transmission` contains the P6 implementation that was blocked by these issues. The manifests are complete and tested, but couldn't mount the torrents volume. The QEMU2 driver introduced a **new problem**: the Kubernetes API server is inside the VM at `192.168.105.2:6443`, and Tailscale's TCP proxy cannot forward to it properly:
**What was tried:** - TCP connections succeed (nc -zv works)
- NFS volume mounts - failed due to missing CAP_SYS_ADMIN in podman container - TLS handshake times out
- SMB CSI driver (v1.17.0) - mount fails with EPERM (same root cause) - Root cause unknown, but likely related to Tailscale serve's handling of non-localhost upstreams
- `minikube mount /Volumes/torrents:/Volumes/torrents` - 9p mount permission denied
- hostPath PV pointing to `/Volumes/torrents` - path doesn't exist inside minikube container
- Installing cifs-utils in minikube VM - still fails at kernel level
All of these failures trace back to the same root cause: the podman driver runs minikube in a rootless container that lacks the kernel capabilities required for filesystem mounts. Additionally, the volume mount solution with QEMU2 was complex:
- Required NFS mount from sifaka → indri
- Then `minikube mount` to pass through to VM
- Two LaunchAgents/LaunchDaemons for persistence
- macOS GUI approval required for network access
### Why QEMU2? ### Why Docker?
Multiple sources recommend QEMU2 as the best driver for Apple Silicon Macs: The **docker driver** solves both problems:
> "Qemu emulator is the best option to run a Kubernetes Cluster using minikube on MAC arm64-based systems without any issues." 1. **API Server on localhost**: Docker Desktop handles port forwarding from container to localhost automatically, so `tailscale serve --tcp=443 tcp://localhost:PORT` will work (like podman did)
> — [DevOpsCube](https://devopscube.com/minikube-mac/)
QEMU2 creates an actual VM (not a container), which has: 2. **Simpler volume mounts**: Docker Desktop has built-in macOS file sharing. Paths shared with Docker are accessible inside containers, and minikube (running in Docker) can use those paths via hostPath.
- Full kernel capabilities for mounts
- Proper 9p/virtio filesystem support 3. **Official Tailscale recommendation**: Tailscale's own [Kubernetes guide](https://tailscale.com/learn/managing-access-to-kubernetes-with-tailscale) uses minikube with the docker driver.
- Native NFS client support
--- ---
## Prerequisites (Manual Steps) ## Prerequisites
### Create Synology User for Kubernetes Storage Access ### 1. Install Docker Desktop (Manual - Before Ansible)
Create a dedicated Synology user for k8s NFS/SMB access (do not use personal account): Docker Desktop requires GUI setup, so install manually first:
On Synology DSM (Control Panel → User & Group): ```bash
1. Create new user: `k8s-storage` # On indri:
- Set a strong password brew install --cask docker-desktop
- No admin privileges needed
- Deny access to all applications (only needs file services) # Then launch Docker Desktop from /Applications
2. Set permissions on the `torrents` share: # Complete the setup wizard (accept license, skip tutorial)
- Give `k8s-storage` user Read/Write access # Wait for Docker to be "Running" (green icon in menu bar)
3. Store credentials in 1Password:
- Vault: `vg6xf6vvfmoh5hqjjhlhbeoaie` (blumeops vault) # Verify:
- Item name: `synology-k8s-storage` docker version
- Fields: `username` (k8s-storage), `password` docker run hello-world
```
**File Sharing Configuration** (in Docker Desktop → Settings → Resources → File sharing):
- Ensure `/Volumes` is shared (for future NFS mounts from sifaka)
- Or add specific paths as needed for P6
### 2. Stop Current QEMU2 Minikube
```bash
# On indri:
minikube stop
minikube delete
# Verify QEMU resources are cleaned up
ps aux | grep qemu
```
--- ---
## Plan ## Plan
### 1. Export Current State ### 1. Update Ansible Role for Docker Driver
Before destroying the cluster, capture the current state: **Changes to `ansible/roles/minikube/defaults/main.yml`:**
```bash ```yaml
# List all ArgoCD apps and their sync status # Change from:
argocd app list minikube_driver: qemu2
minikube_network: socket_vmnet
minikube_container_runtime: containerd
# Backup any runtime state that matters (should be minimal - everything is in git) # To:
kubectl --context=minikube-indri get all --all-namespaces -o yaml > /tmp/k8s-backup.yaml minikube_driver: docker
minikube_container_runtime: docker # or containerd, both work
``` ```
### 2. Stop and Delete Podman Minikube **Remove from defaults:**
- `minikube_network` (not needed for docker driver)
```bash **Changes to `ansible/roles/minikube/tasks/main.yml`:**
# Stop the cluster - Remove qemu installation
minikube stop - Remove socket_vmnet installation and service management
- Remove NFS mount point creation
- Remove NFS LaunchDaemon installation
- Remove minikube mount LaunchAgent installation
- Keep containerd registry mirror config (adapting for docker if needed)
# Delete the cluster and all data **Remove files from `ansible/roles/minikube/files/`:**
minikube delete - `com.blumeops.nfs-torrents.plist`
- `com.blumeops.minikube-mount.plist`
# Verify podman VM is cleaned up **Changes to `ansible/roles/minikube/handlers/main.yml`:**
podman machine list - Remove `Load NFS mount LaunchDaemon`
- Remove `Load minikube mount LaunchAgent`
**Add to Brewfile:**
```ruby
cask "docker" # Docker Desktop
``` ```
### 3. Update Ansible Roles for QEMU2 ### 2. Update Tailscale Serve Configuration
The installation must be orchestrated via ansible, following the existing patterns for `podman` and `minikube` roles. **Changes to `ansible/roles/tailscale_serve/defaults/main.yml`:**
**Changes needed:** ```yaml
# Change svc:k8s upstream from VM IP back to localhost:
- name: svc:k8s
tcp:
port: 443
upstream: tcp://localhost:PORT # PORT will be dynamic, see below
```
1. **Update `ansible/roles/minikube/` role:** **Note on API server port**: With the docker driver, the API server port is dynamic (assigned by minikube). We need to either:
- Change driver from `podman` to `qemu2` - Use `--apiserver-port=6443` to fix it
- Add QEMU as a dependency (via Brewfile or role) - Or query and update the config after cluster creation
- Optionally add socket_vmnet for full networking support
- Update any driver-specific configuration
2. **Update `Brewfile`:** ### 3. Create Docker Minikube Cluster
```ruby
brew "qemu"
# Optional: brew "socket_vmnet"
```
3. **Update minikube start command in role:**
```bash
minikube start \
--driver=qemu2 \
--cpus=4 \
--memory=8192 \
--disk-size=50g \
--container-runtime=containerd \
--kubernetes-version=stable
```
4. **Remove or update podman role** (may still be useful for container builds)
### 4. Run Ansible to Create QEMU2 Cluster
```bash ```bash
# Run the updated minikube role # On indri (after Docker Desktop is running):
mise run provision-indri -- --tags minikube minikube start \
--driver=docker \
--cpus=6 \
--memory=12288 \
--disk-size=200g \
--apiserver-names=k8s.tail8d86e.ts.net,indri \
--apiserver-port=6443 \
--listen-address=0.0.0.0
# Verify cluster is running # Verify cluster
minikube status minikube status
kubectl get nodes kubectl get nodes
``` ```
### 5. Configure Host Path Access ### 4. Verify API Server is on Localhost
With QEMU2, we need to either:
**Option A: Use `minikube mount` (9p)**
```bash
# Start persistent mount (run in background or via launchd)
minikube mount /Volumes/torrents:/Volumes/torrents &
```
**Option B: Use NFS export from macOS**
```bash
# Add NFS export on macOS
echo "/Volumes/torrents -alldirs -mapall=$(id -u):$(id -g) -network 192.168.0.0 -mask 255.255.0.0" | sudo tee -a /etc/exports
sudo nfsd restart
# In k8s, use NFS volume type directly
```
### 6. Test Volume Mount with Test Pod
Create a test pod that mounts the torrents volume:
```yaml
apiVersion: v1
kind: Pod
metadata:
name: volume-test
namespace: default
spec:
containers:
- name: test
image: busybox
command: ["sh", "-c", "ls -la /data && sleep 3600"]
volumeMounts:
- name: torrents
mountPath: /data
volumes:
- name: torrents
hostPath:
path: /Volumes/torrents
type: Directory
```
Verify:
```bash
kubectl apply -f volume-test.yaml
kubectl logs volume-test
kubectl exec volume-test -- ls -la /data
```
### 7. Redeploy ArgoCD and Existing Apps
```bash ```bash
# Re-add ArgoCD # Check what port the API server is on
kubectl config view --minify -o jsonpath="{.clusters[0].cluster.server}"
# Should show https://127.0.0.1:PORT or similar
# Verify local access works
curl -k https://localhost:6443/healthz
# Should return "ok"
```
### 5. Update 1Password Credentials
After cluster recreation, update the credentials in 1Password:
```bash
# On indri, get the new certificates:
cat ~/.minikube/profiles/minikube/client.crt
cat ~/.minikube/profiles/minikube/client.key
cat ~/.minikube/ca.crt
```
Update in 1Password (vault: `vg6xf6vvfmoh5hqjjhlhbeoaie`, item: `3jo4f2hnzvwfmamudfsbbbec7e`).
### 6. Update Kubeconfig on Gilbert
```bash
# Fetch new CA cert from 1Password
op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get 3jo4f2hnzvwfmamudfsbbbec7e --fields ca-cert | sed 's/^"//; s/"$//' > ~/.kube/minikube-indri/ca.crt
```
### 7. Configure Tailscale Serve for K8s
```bash
# On indri:
tailscale serve --service="svc:k8s" --tcp=443 tcp://localhost:6443
```
### 8. Verify Remote Access
```bash
# From gilbert:
curl -k --connect-timeout 5 https://k8s.tail8d86e.ts.net/healthz
# Should return "ok"
kubectl --context=minikube-indri get nodes
# Should show the minikube node
```
### 9. Redeploy ArgoCD and Apps
Since this is a cluster recreation, we need to re-bootstrap:
```bash
# On indri - apply secrets first
op inject -i argocd/manifests/tailscale-operator/secret.yaml.tpl | kubectl apply -f -
# Create repo secret for ArgoCD
PRIV_KEY=$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/csjncynh6htjvnh2l2da65y32q/private key?ssh-format=openssh")$'\n'
kubectl create namespace argocd kubectl create namespace argocd
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml kubectl create secret generic repo-forge -n argocd \
--from-literal=type=git \
--from-literal=url='ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git' \
--from-literal=insecure=true \
--from-literal=sshPrivateKey="$PRIV_KEY"
kubectl label secret repo-forge -n argocd argocd.argoproj.io/secret-type=repository
# Wait for ArgoCD to be ready # Bootstrap operators
kubectl create namespace tailscale
kubectl apply -k argocd/manifests/tailscale-operator/
kubectl apply -k argocd/manifests/argocd/
# Wait for ArgoCD
kubectl wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s kubectl wait --for=condition=available deployment/argocd-server -n argocd --timeout=300s
# Re-configure ArgoCD (repo credentials, etc.) # Login and sync apps
# ... follow P1 setup steps ... argocd login argocd.tail8d86e.ts.net --username admin --grpc-web
# Sync all apps
argocd app sync apps argocd app sync apps
argocd app sync tailscale-operator
argocd app sync cloudnative-pg
argocd app sync blumeops-pg
argocd app sync grafana
argocd app sync grafana-config
argocd app sync miniflux
argocd app sync devpi
``` ```
### 8. Verify All Services ### 10. Verify All Services
```bash ```bash
# Run health check
mise run indri-services-check mise run indri-services-check
# Verify each k8s service
argocd app list argocd app list
kubectl get pods --all-namespaces kubectl get pods --all-namespaces
``` ```
### 9. Clean Up Test Pod ---
```bash ## Volume Mounts for P6 (Kiwix/Transmission)
kubectl delete pod volume-test
``` With the docker driver, volume mounts work differently than QEMU2:
**Option A: Docker Desktop File Sharing + hostPath**
1. Mount sifaka NFS share on indri: `/Volumes/torrents`
2. Add `/Volumes/torrents` to Docker Desktop file sharing
3. Pods use hostPath pointing to that path
**Option B: NFS directly from pods**
- Docker containers can make NFS mounts (unlike podman's rootless containers)
- May need to test if sifaka allows connections from the Docker network
This will be fully tested in Phase 6.
---
## Cleanup
After successful migration:
1. **Remove QEMU2 artifacts:**
```bash
brew uninstall qemu socket_vmnet
```
2. **Remove podman if no longer needed:**
```bash
podman machine stop
podman machine rm
brew uninstall podman
```
--- ---
## Verification Checklist ## Verification Checklist
- [ ] Podman minikube deleted - [ ] Docker Desktop installed and running on indri
- [ ] QEMU2 minikube running - [ ] QEMU2 minikube deleted
- [ ] `minikube mount` or NFS working - [ ] Docker minikube running
- [ ] Test pod can read `/Volumes/torrents` - [ ] API server accessible on localhost:6443
- [ ] Tailscale serve configured for svc:k8s → localhost:6443
- [ ] Remote kubectl access working from gilbert
- [ ] ArgoCD redeployed and synced - [ ] ArgoCD redeployed and synced
- [ ] All existing apps healthy (grafana, miniflux, devpi, etc.) - [ ] All existing apps healthy (grafana, miniflux, devpi, etc.)
- [ ] PostgreSQL cluster healthy - [ ] PostgreSQL cluster healthy
- [ ] Test pod deleted - [ ] `mise run indri-services-check` passes
- [ ] `mise run indri-services-check` passes (except intentionally offline services)
--- ---
## Rollback Plan ## Rollback Plan
If QEMU2 doesn't work: If Docker driver doesn't work:
1. Delete QEMU2 cluster: `minikube delete` 1. Delete Docker minikube: `minikube delete`
2. Recreate podman cluster following P0/P1 steps 2. Recreate QEMU2 cluster (restore old ansible config from git)
3. Redeploy apps from git 3. Accept the Tailscale TCP forwarding limitation and use SSH tunnel for remote kubectl
All state is in git, so cluster recreation is straightforward.
--- ---
## Notes ## Notes
- The QEMU2 VM will use more resources than podman (actual VM vs container) - Docker Desktop has resource overhead but provides better macOS integration
- First boot may be slower due to VM initialization - The docker driver is more widely used and tested than qemu2
- socket_vmnet provides better networking but requires sudo setup - File sharing permissions may need adjustment in Docker Desktop settings
- Consider creating a LaunchAgent for `minikube mount` if using that approach - First cluster start may be slow as Docker pulls the minikube base image
## Implementation Notes (2026-01-21) ## Implementation Notes (2026-01-21)
### What Actually Worked ### QEMU2 Cleanup Done
**Volume mounting solution**: NFS mount on indri (host) + `minikube mount` to pass through to VM Removed from indri:
- `/Library/LaunchDaemons/com.blumeops.nfs-torrents.plist` - NFS mount daemon
- `~/Library/LaunchAgents/com.blumeops.minikube-mount.plist` - minikube mount agent
- Unmounted `/Volumes/torrents-nfs` NFS mount
- Removed `/Volumes/torrents-nfs` mount point
1. Mount sifaka's torrents share on indri via NFS: `sudo mount -t nfs sifaka:/volume1/torrents /Volumes/torrents-nfs` ### Previous QEMU2 Issues
2. Run `minikube mount /Volumes/torrents-nfs:/mnt/torrents` from indri console (GUI session required due to macOS security)
3. Pods can access `/mnt/torrents` via hostPath
**Why NFS from inside VM didn't work**: Despite allowing 192.168.105.0/24 in Synology NFS settings, the VM got "access denied". Root cause unknown - may be Synology NFS quirk. The QEMU2 migration partially worked but had a critical issue:
- Volume mounts worked via NFS → indri → minikube mount chain
**Why SMB didn't work**: The minikube containerd kernel doesn't include the CIFS module. - But Tailscale TCP proxy to VM IP (192.168.105.2:6443) failed with TLS timeout
- Root cause unknown - TCP connected but TLS handshake never completed
### Zot Registry Mirror (Implemented)
The ansible role now configures containerd to redirect `registry.tail8d86e.ts.net` to `host.minikube.internal:5050`:
- Adds hosts file entry in VM
- Creates containerd registry mirror config at `/etc/containerd/certs.d/registry.tail8d86e.ts.net/hosts.toml`
### Passwordless Sudo on Indri
Configured passwordless sudo for `erichblume` user to allow ansible `become: true` tasks to run without `-K` flag:
```bash
echo "erichblume ALL=(ALL) NOPASSWD: ALL" | sudo tee /etc/sudoers.d/erichblume
```
This is acceptable given the security model - tailnet access is the trust boundary.
### macOS Network Permission
The first time `minikube mount` runs, macOS will show a GUI popup asking to allow network access. This must be approved from the indri console (not SSH). Once approved, subsequent runs won't prompt.
### Manual Steps Still Required
These steps cannot be fully automated via ansible and must be done manually:
1. **socket_vmnet service (once per reboot)**:
```bash
# On indri console:
sudo brew services start socket_vmnet
```
2. **NFS mount on indri (once per reboot)**:
```bash
# On indri console:
sudo mount -t nfs sifaka:/volume1/torrents /Volumes/torrents-nfs
```
3. **minikube mount (must run in GUI session)**:
```bash
# On indri console (not SSH - requires GUI session for macOS security):
minikube mount /Volumes/torrents-nfs:/mnt/torrents
# Keep this terminal open - the mount dies if process exits
```
### TODO: LaunchAgent for Persistent Mount
Create a LaunchAgent to run `minikube mount` at login. Challenge: must run in GUI session context for macOS security model.