From 41dfae1f80eb1be7d0b40605e51756cf6e20e773 Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Tue, 10 Feb 2026 07:24:42 -0800
Subject: [PATCH] Add CNI conflict troubleshooting to restart-indri how-to
 (#139)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

## Summary
- Documents a troubleshooting procedure for broken pod networking after unclean shutdown
- During minikube recovery, a stale `1-k8s.conflist` CNI config can override kindnet's `10-kindnet.conflist`, causing new pods to use bridge+firewall networking instead of kindnet's ptp — breaking pod-to-pod communication
- Covers symptoms (DNS failures, liveness probe timeouts), diagnosis steps, and the fix

## Context
Encountered this during the 2026-02-10 power outage. Immich, kiwix, and transmission were all crash-looping for ~8 hours due to the CNI conflict. The minikube ansible role's clean boot detection has been improved (#137) so this may not recur, but the troubleshooting guide is valuable if it does.

## Test plan
- [x] Documentation only — no code changes
- [x] Pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/139
---
 .../doc-cni-conflict-troubleshooting.doc.md   |  1 +
 docs/how-to/restart-indri.md                  | 60 +++++++++++++++++++
 2 files changed, 61 insertions(+)
 create mode 100644 docs/changelog.d/doc-cni-conflict-troubleshooting.doc.md

diff --git a/docs/changelog.d/doc-cni-conflict-troubleshooting.doc.md b/docs/changelog.d/doc-cni-conflict-troubleshooting.doc.md
new file mode 100644
index 0000000..cee815a
--- /dev/null
+++ b/docs/changelog.d/doc-cni-conflict-troubleshooting.doc.md
@@ -0,0 +1 @@
+Add troubleshooting guide for CNI conflict after unclean shutdown to restart-indri how-to.
diff --git a/docs/how-to/restart-indri.md b/docs/how-to/restart-indri.md
index 881b135..6e9f522 100644
--- a/docs/how-to/restart-indri.md
+++ b/docs/how-to/restart-indri.md
@@ -122,6 +122,66 @@ mise run services-check
 
 All checks should pass. If any fail, see [[troubleshooting]].
 
+## Troubleshooting: CNI Conflict After Unclean Shutdown
+
+After a power loss or unclean reboot, minikube may come up with broken pod networking. The symptom is that **new pods cannot reach CoreDNS** — services crash-loop with DNS errors (`EAI_AGAIN`, `connection timed out; no servers could be reached`) or fail liveness probes because their event loops hang on blocked network calls.
+
+Existing pods that were restarted (not recreated) may appear healthy because the kubelet reuses their cached network namespaces.
+
+### Cause
+
+During minikube recovery from a bad state, the CRI-O / Docker networking bootstrap can regenerate a default CNI config file (`1-k8s.conflist`) that conflicts with kindnet's config (`10-kindnet.conflist`). Since `1-` sorts before `10-`, the stale bridge+firewall config takes precedence, and new pods get attached to a different network topology than existing pods.
+
+### Diagnosis
+
+**1. Check if new pods can resolve DNS:**
+
+```bash
+kubectl --context=minikube-indri run dns-test --image=alpine:3.21 --restart=Never \
+  --command -- sh -c 'nslookup kubernetes.default.svc.cluster.local'
+sleep 10
+kubectl --context=minikube-indri logs dns-test
+kubectl --context=minikube-indri delete pod dns-test
+```
+
+If this shows `connection timed out; no servers could be reached`, pod networking is broken.
+
+**2. Check for conflicting CNI configs:**
+
+```bash
+ssh indri 'minikube ssh "ls -la /etc/cni/net.d/"'
+```
+
+You should see **only** `10-kindnet.conflist` (plus `200-loopback.conf` and disabled `.mk_disabled` files). If `1-k8s.conflist` or any other active config exists alongside `10-kindnet.conflist`, that's the conflict.
+
+**3. Confirm the conflict by inspecting the stale config:**
+
+```bash
+ssh indri 'minikube ssh "cat /etc/cni/net.d/1-k8s.conflist"'
+```
+
+If it uses a `bridge` plugin with a `firewall` plugin (instead of kindnet's `ptp` plugin), it's the culprit.
+
+### Fix
+
+**1. Remove the stale CNI config:**
+
+```bash
+ssh indri 'minikube ssh "sudo rm /etc/cni/net.d/1-k8s.conflist"'
+```
+
+**2. Delete all pods that were created while the bad config was active.** The simplest approach is to restart all deployments:
+
+```bash
+kubectl --context=minikube-indri get deployments -A --no-headers | \
+  awk '{print "-n " $1 " " $2}' | \
+  xargs -L1 kubectl --context=minikube-indri rollout restart deployment
+```
+
+StatefulSets managed by operators (CNPG, Tailscale) generally survive because the kubelet restarts their containers in-place rather than creating new pods.
+
+**3. Verify with the DNS test above**, then run `mise run services-check`.
+
 ## Related
 
 - [[indri]] - Server specifications