Erich Blume a8f4d00294 K8s Migration Phase 1: Infrastructure Setup (#29 )

## Summary
- Split k8s migration plan into phases folder for easier navigation
- Added `tag:k8s` to Pulumi ACLs for Kubernetes workloads
- Phase 1 work in progress

## Phase 1 Goals
- Tailscale Kubernetes Operator
- CloudNativePG Operator
- PostgreSQL cluster for future app migrations

## Deployment and Testing
- [ ] Review Phase 1 plan
- [ ] `mise run tailnet-preview` to verify ACL changes
- [ ] `mise run tailnet-up` to apply ACL changes
- [ ] Create Tailscale OAuth client (manual)
- [ ] Deploy operators and PostgreSQL cluster

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/29

2026-01-19 09:49:52 -08:00

35 KiB

Raw Blame History

Phase 0: Foundation (Complete)

Goal: Container registry + minikube cluster without disrupting existing services

Status: Complete

Important: Tailscale Service Creation Order

WARNING: You MUST create services in the Tailscale admin console BEFORE running tailscale serve commands via ansible. If you run tailscale serve --service svc:foo before the service exists in the admin console, the local config will be in a bad state.

To fix a misconfigured service:
tailscale serve --service svc:foo reset
Then create the service in admin console and try again.

Step 0.1: Update Pulumi ACLs (BEFORE Tailscale serve)

Files to modify:

pulumi/policy.hujson

Changes:

Add new tag to tagOwners section (around line 104, after "tag:feed"):

"tag:registry": ["autogroup:admin", "tag:blumeops"],

Add test cases to tests section:

Update Erich's accept list (around line 111) to include registry:

"accept": ["tag:grafana:443", "tag:kiwix:443", "tag:feed:443", "tag:loki:3100", "tag:pg:5432", "tag:homelab:22", "tag:registry:443"],

Update Allison's deny list (around line 117) to deny registry:

"deny": ["tag:grafana:443", "tag:loki:3100", "tag:nas:445", "tag:registry:443"],

Note:

No member grant needed - admins have full access via wildcard, members don't need registry
tag:k8s is added later in Phase 1 when the Tailscale Kubernetes Operator is deployed
Zot supports htpasswd auth if we later need finer-grained control

Testing:

mise run tailnet-preview   # Review changes - should show new tag
mise run tailnet-up        # Apply changes

Implementation Details:

Also need to add "tag:registry" to indri's tags in pulumi/__main__.py (the DeviceTags resource), not just define it in policy.hujson. The policy file defines the tag ownership rules, but the device tags are managed separately in the Python code.

Step 0.2: Create Tailscale Services in Admin Console (MANUAL)

CRITICAL: Do this BEFORE running any ansible that calls tailscale serve

Go to https://login.tailscale.com/admin/services
Create service registry with:
- Port: 443 (HTTPS)
- Host: indri

Implementation Details:

Tag is applied to indri via Pulumi in Step 0.1, not manually in admin console.

Verification:

# Service should appear (even if not yet serving)
tailscale status | grep registry

Step 0.3: Create Zot Registry Ansible Role

Note: Zot is NOT in homebrew (no formula or tap). Clone to ~/code/3rd/ on indri and build from source (requires Go).

Prerequisites on indri (ALREADY COMPLETED):

# Clone zot from forge mirror (use localhost:3001 - hairpinning doesn't work on indri)
ssh indri 'git clone http://localhost:3001/eblume/zot.git ~/code/3rd/zot'

# Set up Go via mise (creates mise.toml in repo directory)
ssh indri 'cd ~/code/3rd/zot && mise use go@1.25'

# Build (creates bin/zot-darwin-arm64, ~183MB)
ssh indri 'cd ~/code/3rd/zot && mise x -- make binary'

# Verify binary exists
ssh indri 'ls -la ~/code/3rd/zot/bin/zot-darwin-arm64'

Build verified: Binary at ~/code/3rd/zot/bin/zot-darwin-arm64 (183MB, ARM64 native).

New files:

ansible/roles/zot/
├── defaults/main.yml
├── tasks/main.yml
├── templates/
│   ├── config.json.j2
│   └── zot.plist.j2
└── handlers/main.yml

Key configuration (defaults/main.yml):

zot_repo_dir: "/Users/erichblume/code/3rd/zot"
zot_binary: "{{ zot_repo_dir }}/bin/zot-darwin-arm64"
zot_data_dir: "/Users/erichblume/zot"
zot_config_dir: "/Users/erichblume/.config/zot"
zot_port: 5000
zot_log_dir: "/Users/erichblume/Library/Logs"

# Pull-through cache registries (on-demand sync)
zot_sync_registries:
  - name: docker.io
    url: https://registry-1.docker.io
  - name: ghcr.io
    url: https://ghcr.io
  - name: quay.io
    url: https://quay.io

Zot config.json template (key sections):

{
  "storage": {
    "rootDirectory": "/Users/erichblume/zot"
  },
  "http": {
    "address": "0.0.0.0",
    "port": "5000"
  },
  "extensions": {
    "sync": {
      "enable": true,
      "registries": [
        {
          "urls": ["https://registry-1.docker.io"],
          "content": [{"prefix": "**"}],
          "onDemand": true,
          "tlsVerify": true
        },
        {
          "urls": ["https://ghcr.io"],
          "content": [{"prefix": "**"}],
          "onDemand": true,
          "tlsVerify": true
        },
        {
          "urls": ["https://quay.io"],
          "content": [{"prefix": "**"}],
          "onDemand": true,
          "tlsVerify": true
        }
      ]
    }
  }
}

Two modes of operation:

Pull-through cache (automatic): When you pull registry.tail8d86e.ts.net/docker.io/library/nginx:latest, Zot fetches from Docker Hub and caches locally. Subsequent pulls are local.

Private images (manual push): Push your own images to any path NOT matching a sync prefix:

# From gilbert (after building)
podman push myapp:v1 registry.tail8d86e.ts.net/blumeops/myapp:v1

Namespace convention:

registry.tail8d86e.ts.net/docker.io/* → cached from Docker Hub
registry.tail8d86e.ts.net/ghcr.io/* → cached from GHCR
registry.tail8d86e.ts.net/quay.io/* → cached from Quay
registry.tail8d86e.ts.net/blumeops/* → private images (built by you/Woodpecker)

LaunchAgent template (zot.plist.j2):

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "...">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>mcquack.eblume.zot</string>
    <key>ProgramArguments</key>
    <array>
        <!-- ABSOLUTE PATH to built binary in ~/code/3rd/zot -->
        <string>{{ zot_binary }}</string>
        <string>serve</string>
        <string>{{ zot_config_dir }}/config.json</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>{{ zot_log_dir }}/mcquack.zot.out.log</string>
    <key>StandardErrorPath</key>
    <string>{{ zot_log_dir }}/mcquack.zot.err.log</string>
</dict>
</plist>

Handlers (handlers/main.yml):

- name: Restart zot
  ansible.builtin.shell: |
    launchctl unload ~/Library/LaunchAgents/mcquack.eblume.zot.plist 2>/dev/null || true
    launchctl load ~/Library/LaunchAgents/mcquack.eblume.zot.plist
  changed_when: true

Tasks should notify handler on config change:

- name: Deploy zot config
  ansible.builtin.template:
    src: config.json.j2
    dest: "{{ zot_config_dir }}/config.json"
  notify: Restart zot

Testing (after deploying role):

# Check LaunchAgent is running
ssh indri 'launchctl list | grep zot'

# Check zot is responding
ssh indri 'curl -s http://localhost:5000/v2/_catalog'
# Expected: {"repositories":[]}

# Check logs for errors
ssh indri 'tail -20 ~/Library/Logs/mcquack.zot.err.log'

# Test pull-through cache via curl (podman not installed until Step 0.8)
ssh indri 'curl -s http://localhost:5000/v2/docker.io/library/alpine/manifests/latest -H "Accept: application/vnd.oci.image.manifest.v1+json"'
# Should return manifest JSON (triggers cache fetch from Docker Hub)
ssh indri 'curl -s http://localhost:5000/v2/_catalog'
# Expected: {"repositories":["docker.io/library/alpine"]}

Implementation Details:

Changed port from 5000 to 5050 because macOS ControlCenter (AirPlay Receiver) uses port 5000 by default.
Fixed sync config: use "content": [{"prefix": "**", "destination": "/{{ registry.name }}"}] instead of "prefix": "{{ registry.name }}/**". The destination rewrites the local path, while prefix ** matches all upstream repos.

Step 0.4: Add Zot to Tailscale Serve

Files to modify:

ansible/roles/tailscale_serve/defaults/main.yml

Changes:

# Add to tailscale_serve_services list
- name: svc:registry
  https:
    port: 443
    upstream: http://localhost:5000

Testing:

# Deploy tailscale serve config
mise run provision-indri -- --tags tailscale-serve

# Verify from gilbert (not indri - hairpinning doesn't work)
curl -s https://registry.tail8d86e.ts.net/v2/_catalog
# Expected: {"repositories":["docker.io/library/alpine"]} (from Step 0.3 test)

# Test private image push from gilbert
podman pull alpine:latest
podman tag alpine:latest registry.tail8d86e.ts.net/blumeops/test:v1
podman push registry.tail8d86e.ts.net/blumeops/test:v1
curl -s https://registry.tail8d86e.ts.net/v2/_catalog
# Expected: {"repositories":["blumeops/test","docker.io/library/alpine"]}

Implementation Details:

Changed upstream port from 5000 to 5050 (see Step 0.3 implementation details).
After running tailscale serve, the service must be approved in Tailscale admin console at https://login.tailscale.com/admin/services before it becomes accessible.
Podman needed on gilbert for testing - added to Brewfile. Requires podman machine init && podman machine start after install.

Step 0.5: Create Zot Metrics Role

New files:

ansible/roles/zot_metrics/
├── defaults/main.yml
├── tasks/main.yml
├── templates/
│   ├── zot-metrics.sh.j2
│   └── zot-metrics.plist.j2
└── handlers/main.yml

Metrics script pattern (zot-metrics.sh.j2):

#!/bin/bash
# Collect Zot registry metrics for Prometheus textfile collector
set -euo pipefail

METRICS_FILE="/opt/homebrew/var/node_exporter/textfile/zot.prom"
TEMP_FILE="${METRICS_FILE}.tmp"

# Check if zot is up
if curl -sf http://localhost:5000/v2/_catalog > /dev/null 2>&1; then
    echo "zot_up 1" > "$TEMP_FILE"
else
    echo "zot_up 0" > "$TEMP_FILE"
fi

mv "$TEMP_FILE" "$METRICS_FILE"

Note: Start with just zot_up for now. Additional metrics (storage usage, cache stats) can be added later after reviewing zot's metrics endpoint.

Testing:

# Deploy metrics role
mise run provision-indri -- --tags zot_metrics

# Check metrics file exists and is updated
ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/zot.prom'
# Expected: zot_up 1

# Verify metrics appear in Prometheus (after a scrape cycle)
curl -s "http://indri:9090/api/v1/query?query=zot_up" | jq '.data.result[0].value[1]'
# Expected: "1"

Step 0.6: Add Zot Log Collection to Alloy

Files to modify:

ansible/roles/alloy/defaults/main.yml

Changes: Add to the alloy_mcquack_logs list:

  - path: /Users/erichblume/Library/Logs/mcquack.zot.out.log
    service: zot
    stream: stdout
  - path: /Users/erichblume/Library/Logs/mcquack.zot.err.log
    service: zot
    stream: stderr

Testing:

# Deploy alloy config (handler restarts alloy automatically if config changed)
mise run provision-indri -- --tags alloy

# Wait a minute, then check Loki for zot logs
# In Grafana Explore, query: {service="zot"}

Step 0.7: Update indri-services-check Script

Files to modify:

mise-tasks/indri-services-check

Changes to add:

# Add after existing service checks (around line 55)
check_service "zot" "ssh indri 'launchctl list | grep zot | grep -v \"^-\"'"
check_service "zot-metrics" "ssh indri 'launchctl list | grep zot-metrics | grep -v \"^-\"'"

# Add to HTTP endpoints section (around line 65)
check_http "Zot Registry" "http://indri:5000/v2/_catalog"

# Add metrics file check
check_service "Zot metrics" "ssh indri 'test -f /opt/homebrew/var/node_exporter/textfile/zot.prom'"

Testing:

# Run the health check
mise run indri-services-check

# Expected output includes:
# zot...               OK
# zot-metrics...       OK
# Zot Registry...      OK
# Zot metrics...       OK

Implementation Details:

Used Tailscale service URL (https://registry.tail8d86e.ts.net/v2/_catalog) instead of internal endpoint to verify full path works.

Step 0.8: Install and Configure Podman on Indri

New files:

ansible/roles/podman/
├── tasks/main.yml
└── handlers/main.yml

Tasks (tasks/main.yml):

- name: Install podman via homebrew
  community.general.homebrew:
    name: podman
    state: present

- name: Initialize podman machine (if not exists)
  ansible.builtin.command:
    cmd: podman machine init --cpus 4 --memory 8192 --disk-size 220
  register: podman_init
  changed_when: podman_init.rc == 0
  failed_when: podman_init.rc not in [0, 125]  # 125 = already exists

- name: Start podman machine
  ansible.builtin.command:
    cmd: podman machine start
  register: podman_start
  changed_when: "'started successfully' in podman_start.stdout"
  failed_when: podman_start.rc not in [0, 125]  # 125 = already running

Testing:

# Deploy podman role
mise run provision-indri -- --tags podman

# Verify podman is working
ssh indri 'podman info'
ssh indri 'podman run --rm hello-world'

Implementation Details:

KNOWN ISSUE: podman machine init and podman machine start have reliability issues when run via Ansible/SSH. The machine sometimes gets stuck in "Starting" state due to a race condition (see https://github.com/containers/podman/issues/16945). Apple Hypervisor may also require GUI session context.

WORKAROUND: If the machine fails to start via Ansible, manually run on indri:

podman machine rm -f podman-machine-default
podman machine init --cpus 4 --memory 8192 --disk-size 220
podman machine start

LaunchAgent approach was attempted but didn't resolve the issue reliably.
TODO: Investigate proper automation solution for reliable podman machine management.

Step 0.9: Install and Configure Minikube

New files:

ansible/roles/minikube/
├── defaults/main.yml
├── tasks/main.yml
└── handlers/main.yml

Defaults:

minikube_cpus: 4
minikube_memory: 8192
minikube_disk_size: "200g"
minikube_driver: podman
minikube_container_runtime: cri-o

Note on storage: The disk-size is for node-local storage only (container images, emptyDir, local PVs). Pods can also mount external storage:

hostPath - indri filesystem (e.g., ~/transmission/ for kiwix ZIM files)
NFS - sifaka volumes (Synology supports NFS natively, easiest for k8s)
SMB/CIFS - requires csi-driver-smb; sifaka currently uses SMB for desktop mounts

Tasks:

- name: Install minikube via homebrew
  community.general.homebrew:
    name: minikube
    state: present

- name: Check if minikube cluster exists
  ansible.builtin.command:
    cmd: minikube status --format='{{.Host}}'
  register: minikube_status
  changed_when: false
  failed_when: false

- name: Start minikube cluster
  ansible.builtin.command:
    cmd: >
      minikube start
      --driver={{ minikube_driver }}
      --container-runtime={{ minikube_container_runtime }}
      --cpus={{ minikube_cpus }}
      --memory={{ minikube_memory }}
      --disk-size={{ minikube_disk_size }}
  when: minikube_status.rc != 0 or 'Running' not in minikube_status.stdout

Testing:

# Deploy minikube role
mise run provision-indri -- --tags minikube

# Verify cluster is running
ssh indri 'minikube status'
# Expected: host: Running, kubelet: Running, apiserver: Running

# Test kubectl access from indri
ssh indri 'kubectl get nodes'
# Expected: minikube   Ready    control-plane   ...

Implementation Details:

Changed minikube_memory from 8192 to 7800 because podman machine reports slightly less available memory (7908MB) due to VM overhead. Minikube rejects memory requests exceeding what podman reports.
Deployed with Kubernetes v1.34.0 and CRI-O 1.24.6.

Step 0.10: Configure Kubeconfig on Gilbert

Goal: Enable kubectl and k9s on gilbert to connect to the minikube cluster running on indri.

Considerations:

Minikube runs inside a podman VM on indri, so the API server isn't directly exposed on indri's network interface
Admin users have full Tailscale access to indri via autogroup:admin → * → *
Be careful not to overwrite existing work kubeconfigs

Possible approaches:

SSH tunneling to forward the API server port
minikube tunnel running on indri (exposes LoadBalancer services)
Configure minikube with --apiserver-names=indri at cluster creation time
Use kubectl via SSH wrapper: ssh indri kubectl ...

Verification:

# From gilbert, these should work:
kubectl get nodes
kubectl get namespaces
k9s  # Should show the minikube cluster

The exact approach will be determined during implementation based on what works best with the podman driver.

Implementation Details:

Chose Option 3: Recreate cluster with --apiserver-names after researching alternatives:

SSH tunneling - Requires keeping a tunnel running or complex on-demand setup
SOCKS5 proxy with kubeconfig proxy-url - Kubeconfig supports proxy-url: socks5://localhost:1080 per-context, but still requires managing the proxy
--apiserver-names + --listen-address - Native minikube support, cleanest solution

Cluster Setup: Recreated the minikube cluster with additional flags:

minikube delete
minikube start \
  --driver=podman \
  --container-runtime=cri-o \
  --cpus=4 --memory=7800 --disk-size=200g \
  --apiserver-names=indri \
  --listen-address=0.0.0.0

--apiserver-names=indri adds "indri" to the API server certificate SAN
--listen-address=0.0.0.0 tells podman to expose the API port on all interfaces
API server port is dynamic (check with kubectl config view --minify -o jsonpath="{.clusters[0].cluster.server}" on indri)

Credential Management with 1Password:

Rather than copying private keys between machines, credentials are stored in 1Password and fetched on-demand using kubectl's exec credential plugin. This mirrors the 1Password SSH agent pattern for biometric-protected key access.

Store credentials in 1Password (vault: vg6xf6vvfmoh5hqjjhlhbeoaie, item: 3jo4f2hnzvwfmamudfsbbbec7e):
- client-cert - Contents of ~/.minikube/profiles/minikube/client.crt (text field)
- client-key - Contents of ~/.minikube/profiles/minikube/client.key (text field)
- ca-cert - Contents of ~/.minikube/ca.crt (text field, not secret but stored for convenience)

Created credential helper script at bin/kubectl-credential-1password:

#!/bin/bash
# Fetches client cert/key from 1Password, outputs ExecCredential JSON
# Usage: kubectl-credential-1password <vault-id> <item-id> <cert-field> <key-field>

Symlinked to ~/.local/bin/kubectl-credential-1password

Kubeconfig setup on gilbert:

# Store CA cert locally (not secret - public key for server verification)
mkdir -p ~/.kube/minikube-indri
op --vault <vault> item get <item> --fields ca-cert | sed 's/^"//; s/"$//' > ~/.kube/minikube-indri/ca.crt

# Configure cluster
kubectl config set-cluster minikube-indri \
  --server=https://indri:<port> \
  --certificate-authority=/Users/eblume/.kube/minikube-indri/ca.crt

# Configure credentials with exec plugin
kubectl config set-credentials minikube-indri \
  --exec-api-version=client.authentication.k8s.io/v1beta1 \
  --exec-command=kubectl-credential-1password \
  --exec-arg=<vault-id> \
  --exec-arg=<item-id> \
  --exec-arg=client-cert \
  --exec-arg=client-key

# Create context
kubectl config set-context minikube-indri \
  --cluster=minikube-indri \
  --user=minikube-indri

Usage:

kubectl --context=minikube-indri get nodes
# or
kubectl config use-context minikube-indri
kubectl get nodes

Security Notes:

Client private key never stored on disk - fetched from 1Password on each kubectl command
CA cert stored on disk (not secret - it's a public key for server verification)
1Password biometric/password prompt required for credential access
op command strips quotes from text fields with sed 's/^"//; s/"$//'

References:

Step 0.11: Add Minikube to indri-services-check

Files to modify:

mise-tasks/indri-services-check

Changes:

# Add new section for Kubernetes
echo ""
echo "Kubernetes cluster:"
check_service "minikube" "ssh indri 'minikube status --format={{.Host}} | grep -q Running'"
check_service "k8s-apiserver" "ssh indri 'kubectl get --raw /healthz'"

Testing:

mise run indri-services-check

# Expected output includes:
# Kubernetes cluster:
# minikube...          OK
# k8s-apiserver...     OK

Implementation Notes:

Added a third check k8s-apiserver (remote) that verifies kubectl access from gilbert, not just via SSH to indri. This ensures the 1Password credential flow and remote API server access are working.
The remote check uses both --kubeconfig and --context flags explicitly since the script runs in bash (not fish) and doesn't inherit the KUBECONFIG environment variable from fish config.

Step 0.12: Create Zettelkasten Documentation

New files:

~/code/personal/zk/zot.md
~/code/personal/zk/minikube.md

Files to update:

~/code/personal/zk/1767747119-YCPO.md (main blumeops card)

Updates to main blumeops card:

Add to Device Tags table: | tag:registry | indri | Container registry access |
Add to Services table: | Registry | https://registry.tail8d86e.ts.net | OCI container registry (Zot) | zot | | Kubernetes | https://indri: | Minikube cluster | minikube |
Add to Port Map (Indri) table: | 5050 | Zot | HTTP | localhost | Container registry | | | K8s API | HTTPS | 0.0.0.0 | Minikube API server |

Add new section Remote Kubernetes Access:

## Remote Kubernetes Access (from Gilbert)

The minikube cluster on indri is accessible from gilbert via direct connection.
Cluster was created with `--apiserver-names=indri --listen-address=0.0.0.0`.

**Fish abbreviations** (in `~/.config/fish/config.fish`):
- `ki` → `kubectl --context=minikube-indri`
- `k9i` → `k9s --context=minikube-indri`
- `k9` → `k9s`

```bash
# Quick access via abbreviations
ki get nodes
k9i

# Or explicitly set context
kubectl config use-context minikube-indri
kubectl get nodes

Template for zot.md:

---
id: zot
aliases:
  - zot
  - container-registry
tags:
  - blumeops
---

# Zot Registry Management Log

Zot is an OCI-native container registry running on Indri, providing:
1. Pull-through cache for Docker Hub, GHCR, Quay (avoids rate limits)
2. Private image storage for custom-built containers

## Service Details

- URL: https://registry.tail8d86e.ts.net
- Local port: 5050
- Data directory: ~/zot
- Config: ~/.config/zot/config.json
- Managed via: mcquack LaunchAgent

## Namespace Convention

| Path | Source |
|------|--------|
| `registry.../docker.io/*` | Cached from Docker Hub |
| `registry.../ghcr.io/*` | Cached from GHCR |
| `registry.../quay.io/*` | Cached from Quay |
| `registry.../blumeops/*` | Private images (yours) |

## Useful Commands

\`\`\`bash
# List all images
curl -s http://localhost:5050/v2/_catalog | jq

# Pull via cache (from indri or k8s)
podman pull localhost:5050/docker.io/library/nginx:latest

# Build and push private image (from gilbert)
podman build -t registry.tail8d86e.ts.net/blumeops/myapp:v1 .
podman push registry.tail8d86e.ts.net/blumeops/myapp:v1

# Check service status
launchctl list | grep zot

# View logs
tail -f ~/Library/Logs/mcquack.zot.err.log
\`\`\`

## Log

### [DATE]
- Initial setup for k8s migration Phase 0

Template for minikube.md:

---
id: minikube
aliases:
  - minikube
  - kubernetes
  - k8s
tags:
  - blumeops
---

# Minikube Management Log

Minikube provides a single-node Kubernetes cluster on Indri for running containerized services.

## Cluster Details

- Driver: podman (rootless)
- Container runtime: CRI-O
- Kubernetes version: v1.34.0
- Resources: 4 CPUs, 7800MB RAM, 200GB disk
- API server: https://indri:<port> (accessible from gilbert via Tailscale)

## Remote Access from Gilbert

Cluster was created with `--apiserver-names=indri --listen-address=0.0.0.0` to allow remote kubectl access.

\`\`\`bash
# Switch context
kubectl config use-context minikube-indri

# Verify
kubectl get nodes
kubectl get namespaces

# Use k9s
k9s --context minikube-indri
\`\`\`

## Useful Commands (on indri)

\`\`\`bash
# Cluster status
minikube status

# Start/stop cluster
minikube start
minikube stop

# Access dashboard
minikube dashboard

# SSH into node
minikube ssh

# View logs
minikube logs
\`\`\`

## Podman Machine (prerequisite)

Minikube uses podman as the container runtime. The podman machine must be running:

\`\`\`bash
# Check podman machine
podman machine list

# Start if needed
podman machine start
\`\`\`

## Log

### [DATE]
- Initial cluster setup for k8s migration Phase 0
- Configured for remote access with --apiserver-names=indri

Implementation Notes:

Created zot.md and minikube.md in ~/code/personal/zk/
Updated 1767747119-YCPO.md (main blumeops card) with all specified changes
Added 1Password credential plugin reference to minikube docs
K8s API port is 39535 (dynamically assigned by minikube, may change on cluster recreation)

Step 0.13: Update Main Playbook

Files to modify:

ansible/playbooks/indri.yml

Changes:

# Add new roles to the roles list
- role: podman
  tags: podman
- role: zot
  tags: zot
- role: zot_metrics
  tags: zot_metrics
- role: minikube
  tags: minikube

Implementation Notes:

Roles were added incrementally during Steps 0.3, 0.5, 0.8, and 0.9
All four roles (zot, zot_metrics, podman, minikube) confirmed present in indri.yml

Step 0.14: Expose K8s API as Tailscale Service (Added Post-Completion)

Note

: This step was added after Phase 0 was otherwise complete, to provide a stable, named endpoint for the Kubernetes API server.

Goal: Expose the minikube API server as k8s.tail8d86e.ts.net instead of using indri:<dynamic-port>.

Current state:

Minikube API server on port 39535 (dynamic, could change on cluster recreation)
Accessed via https://indri:39535
Certificate SANs include "indri"

Target state:

Stable Tailscale service at k8s.tail8d86e.ts.net:443
Fixed API server port (6443, the k8s standard)
Certificate SANs include both hostnames for compatibility

Step 0.14.1: Update Pulumi ACLs

Files to modify:

pulumi/policy.hujson
pulumi/__main__.py

Changes to policy.hujson:

Add tag to tagOwners:

"tag:k8s-api": ["autogroup:admin", "tag:blumeops"],

Update Erich's test case accept list to include k8s-api:

"accept": ["tag:grafana:443", "tag:kiwix:443", "tag:feed:443", "tag:loki:3100", "tag:pg:5432", "tag:homelab:22", "tag:registry:443", "tag:k8s-api:443"],

Update Allison's deny list:

"deny": ["tag:grafana:443", "tag:loki:3100", "tag:nas:445", "tag:registry:443", "tag:k8s-api:443"],

Changes to main.py:

Add "tag:k8s-api" to indri's DeviceTags

Testing:

mise run tailnet-preview   # Review changes
mise run tailnet-up        # Apply changes

Step 0.14.2: Create Tailscale Service in Admin Console (MANUAL)

CRITICAL: Do this BEFORE running ansible that calls tailscale serve

Go to https://login.tailscale.com/admin/services
Create service k8s with:
- Port: 443 (TCP)
- Host: indri

Step 0.14.3: Recreate Minikube Cluster

The cluster needs to be recreated to:

Add k8s.tail8d86e.ts.net to the API server certificate SANs
Fix the API server port to 6443 (standard k8s port)

On indri:

# Stop and delete existing cluster
minikube stop
minikube delete

# Recreate with new settings
minikube start \
  --driver=podman \
  --container-runtime=cri-o \
  --cpus=4 --memory=7800 --disk-size=200g \
  --apiserver-names=k8s.tail8d86e.ts.net,indri \
  --apiserver-port=6443 \
  --listen-address=0.0.0.0

# Verify certificate SANs include both names
kubectl config view --minify -o jsonpath="{.clusters[0].cluster.server}"
# Expected: https://127.0.0.1:6443 or similar

# Verify cluster is running
minikube status
kubectl get nodes

Update ansible role defaults (ansible/roles/minikube/defaults/main.yml):

minikube_apiserver_names:
  - k8s.tail8d86e.ts.net
  - indri
minikube_apiserver_port: 6443

Step 0.14.4: Add K8s Service to Tailscale Serve

Files to modify:

ansible/roles/tailscale_serve/defaults/main.yml

Add to services list:

- name: svc:k8s
  tcp:
    port: 443
    upstream: tcp://localhost:6443

Note: Using TCP passthrough (not HTTPS termination) because k8s uses mTLS authentication.

Deploy:

mise run provision-indri -- --tags tailscale-serve

Step 0.14.5: Update 1Password Credentials

After cluster recreation, the client certificates have changed.

On indri, get the new credentials:

# Display new certificates (copy to 1Password)
cat ~/.minikube/profiles/minikube/client.crt
cat ~/.minikube/profiles/minikube/client.key
cat ~/.minikube/ca.crt

In 1Password (vault: vg6xf6vvfmoh5hqjjhlhbeoaie, item: 3jo4f2hnzvwfmamudfsbbbec7e):

Update client-cert field with new certificate
Update client-key field with new key
Update ca-cert field with new CA certificate

Step 0.14.6: Update Kubeconfig on Gilbert

Update CA certificate:

# Fetch new CA cert from 1Password
op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get 3jo4f2hnzvwfmamudfsbbbec7e --fields ca-cert | sed 's/^"//; s/"$//' > ~/.kube/minikube-indri/ca.crt

Update kubeconfig (~/.kube/minikube-indri/config.yml):

clusters:
- cluster:
    certificate-authority: /Users/eblume/.kube/minikube-indri/ca.crt
    server: https://k8s.tail8d86e.ts.net  # Changed from https://indri:39535
  name: minikube-indri

Verification:

# Test connection via new hostname
kubectl --context=minikube-indri get nodes

# Test via abbreviation
ki get nodes

Step 0.14.7: Update Documentation

Files to update:

~/code/personal/zk/minikube.md - Update API server URL and port info
~/code/personal/zk/1767747119-YCPO.md - Update Services table and Port Map

Changes to blumeops card:

Update Services table: | Kubernetes | https://k8s.tail8d86e.ts.net | Minikube cluster | minikube |
Update Port Map: | 6443 | K8s API | HTTPS/TCP | 0.0.0.0 | Minikube API server (via Tailscale) |
Add tag:k8s-api to Device Tags table

Step 0.14.8: Update indri-services-check

Files to modify:

mise-tasks/indri-services-check

Changes:

# Update remote k8s check to use new URL
check_service "k8s-apiserver (remote)" "kubectl --kubeconfig=$HOME/.kube/minikube-indri/config.yml --context=minikube-indri get --raw /healthz"
# (No change needed - uses kubeconfig which now points to k8s.tail8d86e.ts.net)

Step 0.14 Verification

# 1. Service health check
mise run indri-services-check
# All services should be OK

# 2. Test k8s access via Tailscale hostname
curl -k https://k8s.tail8d86e.ts.net/healthz
# Expected: ok (or certificate error if mTLS required - that's fine)

# 3. kubectl via Tailscale
ki get nodes
ki get namespaces

# 4. k9s via Tailscale
k9i

Phase 0 Verification Checklist

Run after completing all steps:

# 1. Full service health check
mise run indri-services-check
# All services should show OK, including new ones

# 2. Registry functionality - pull-through cache
ssh indri 'podman pull localhost:5000/docker.io/library/alpine:latest'
curl -s https://registry.tail8d86e.ts.net/v2/_catalog
# Expected: {"repositories":["docker.io/library/alpine"]}

# 3. Registry functionality - private image push (from gilbert)
podman pull alpine:latest
podman tag alpine:latest registry.tail8d86e.ts.net/blumeops/test:v1
podman push registry.tail8d86e.ts.net/blumeops/test:v1
curl -s https://registry.tail8d86e.ts.net/v2/_catalog
# Expected: {"repositories":["blumeops/test","docker.io/library/alpine"]}

# 4. Kubernetes cluster
ssh indri 'minikube status'
ssh indri 'kubectl get nodes'
kubectl get nodes  # from gilbert

# 5. Metrics in Prometheus
curl -s "http://indri:9090/api/v1/query?query=zot_up"
# Expected: value = 1

# 6. Logs in Loki
# In Grafana Explore: {service="zot"}
# Should see zot log entries

# 7. k9s from gilbert
k9s
# Should connect and show minikube cluster

Phase 0 Rollback

If something goes wrong:

# Stop and remove minikube
ssh indri 'minikube stop && minikube delete'

# Stop and remove zot
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.zot.plist'
ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.zot.plist'

# Remove podman machine
ssh indri 'podman machine stop && podman machine rm'

# Remove from tailscale serve
ssh indri 'tailscale serve --service svc:registry reset'

# Remove tags from Pulumi (revert policy.hujson changes)
mise run tailnet-up

# Revert ansible playbook changes
git checkout ansible/playbooks/indri.yml
git checkout ansible/roles/tailscale_serve/defaults/main.yml
git checkout ansible/roles/alloy/templates/config.alloy.j2

# Remove new roles
rm -rf ansible/roles/{zot,zot_metrics,podman,minikube}

# Remove zk cards
rm ~/code/personal/zk/{zot,minikube}.md

Phase 0 Follow-up: Grafana Dashboards

After Phase 0 is running and stable, create monitoring dashboards:

Zot Dashboard (ansible/roles/grafana/files/dashboards/zot.json):

Check what metrics zot exposes: ssh indri 'curl -s http://localhost:5000/metrics'
Review community dashboards for inspiration (copy permitted if license allows)
Create dashboard with available metrics (at minimum: zot_up)

Minikube Dashboard (ansible/roles/grafana/files/dashboards/minikube.json):

Deploy kube-state-metrics if needed for additional cluster metrics
Review what Prometheus can scrape from the cluster
Review community dashboards for inspiration (copy permitted if license allows)
Create dashboard with relevant panels (node usage, pod counts, etc.)

New Files Summary

File	Purpose
`ansible/roles/zot/`	Zot registry deployment
`ansible/roles/zot_metrics/`	Metrics collection for Zot
`ansible/roles/podman/`	Podman installation and setup
`ansible/roles/minikube/`	Minikube cluster setup
`~/code/personal/zk/zot.md`	Zot management documentation
`~/code/personal/zk/minikube.md`	Minikube management documentation

Modified Files Summary

File	Changes
`pulumi/policy.hujson`	Add tag:registry
`ansible/playbooks/indri.yml`	Add new roles
`ansible/roles/tailscale_serve/defaults/main.yml`	Add svc:registry
`ansible/roles/alloy/templates/config.alloy.j2`	Add zot log collection
`mise-tasks/indri-services-check`	Add zot and k8s checks

35 KiB Raw Blame History

Phase 0: Foundation (Complete)

Important: Tailscale Service Creation Order

Step 0.1: Update Pulumi ACLs (BEFORE Tailscale serve)

Step 0.2: Create Tailscale Services in Admin Console (MANUAL)

Step 0.3: Create Zot Registry Ansible Role

Step 0.4: Add Zot to Tailscale Serve

Step 0.5: Create Zot Metrics Role

Step 0.6: Add Zot Log Collection to Alloy

Step 0.7: Update indri-services-check Script

Step 0.8: Install and Configure Podman on Indri

Step 0.9: Install and Configure Minikube

Step 0.10: Configure Kubeconfig on Gilbert

Step 0.11: Add Minikube to indri-services-check

Step 0.12: Create Zettelkasten Documentation

Step 0.13: Update Main Playbook

Step 0.14: Expose K8s API as Tailscale Service (Added Post-Completion)

Step 0.14.1: Update Pulumi ACLs

Step 0.14.2: Create Tailscale Service in Admin Console (MANUAL)

Step 0.14.3: Recreate Minikube Cluster

Step 0.14.4: Add K8s Service to Tailscale Serve

Step 0.14.5: Update 1Password Credentials

Step 0.14.6: Update Kubeconfig on Gilbert

Step 0.14.7: Update Documentation

Step 0.14.8: Update indri-services-check

Step 0.14 Verification

Phase 0 Verification Checklist

Phase 0 Rollback

Phase 0 Follow-up: Grafana Dashboards

New Files Summary

Modified Files Summary

35 KiB

Raw Blame History