From 947e4310c306c36e1096f98f5431cf910554d823 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Wed, 13 May 2026 16:46:17 -0700
Subject: [PATCH 01/52] C2: migrate immich from minikube to ringtail (mikado
chain) (#356)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
C2 Mikado chain to move the entire Immich stack (server, ML, valkey,
postgres) off `minikube-indri` and onto `k3s-ringtail`. Immich is the
largest single tenant on minikube (~1.5 GiB resident) and minikube is
currently memory-saturated (97% RAM, swapping). This is the first
concrete chain in the broader indri-k8s decommission effort.
This PR contains the planning layer only — 7 cards (1 goal + 6
prerequisites). Implementation cycles follow per the Mikado Branch
Invariant.
## Goal end-state
- Immich `server`, `machine-learning`, `valkey` on ringtail.
- ML pod uses ringtail's RTX 4080 (performance win — currently
CPU-only).
- CNPG `immich-pg` (PG17 + VectorChord) runs on ringtail.
- Library still on sifaka NFS — ringtail mounts the same path.
- `photos.ops.eblu.me` reroutes through Caddy → ringtail ingress.
- Minikube `immich` and `immich-pg` are removed.
## Cards
| Card | Depends on |
|---|---|
| `migrate-immich-to-ringtail` (goal) | all six below |
| `cnpg-on-ringtail` | — |
| `immich-pg-on-ringtail` | cnpg-on-ringtail |
| `immich-pg-data-migration` | immich-pg-on-ringtail |
| `sifaka-nfs-from-ringtail` | — |
| `immich-app-on-ringtail` | immich-pg-on-ringtail, sifaka-nfs-from-ringtail |
| `immich-cutover-and-decommission` | immich-pg-data-migration, immich-app-on-ringtail |
## Key constraints
- **No data loss.** Downtime is acceptable; data loss is not. Two
surfaces matter: postgres (ML embeddings, face data — slow to
re-derive) and the library files (don't move, but NFS access from
ringtail must be verified).
- **Migration method:** Option A is a CNPG `externalCluster`
basebackup → promote. Option B is `pg_dump`/`pg_restore` as a
documented fallback. Either way, dry-run against a scratch
cluster first.
- **Why pg moves too** (not cross-cluster): keeping pg on minikube
would block the whole decommission, and Immich is chatty with pg
so tailnet round-trips would hurt.
## Test plan
- [ ] Plan review — does the dependency graph make sense?
- [ ] `mise run docs-mikado migrate-immich-to-ringtail` shows the
chain correctly.
- [ ] Per-card implementation cycles land separately (commit
convention enforced by hook).
Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/356
---
argocd/apps/cloudnative-pg-ringtail.yaml | 27 ++++
argocd/apps/databases-ringtail.yaml | 26 ++++
argocd/apps/immich-ringtail.yaml | 31 ++++
argocd/apps/immich.yaml | 30 ----
.../external-secret-immich-borgmatic.yaml | 15 +-
.../databases-ringtail/immich-pg.yaml | 53 +++++++
.../databases-ringtail/kustomization.yaml | 9 ++
.../service-immich-pg-tailscale.yaml | 8 +-
argocd/manifests/databases/immich-pg.yaml | 69 ---------
argocd/manifests/databases/kustomization.yaml | 3 -
.../deployment-ml.yaml | 6 +
.../deployment-server.yaml | 0
.../deployment-valkey.yaml | 0
.../ingress-tailscale.yaml | 15 +-
.../kustomization.yaml | 13 +-
argocd/manifests/immich-ringtail/pv-nfs.yaml | 29 ++++
.../pvc-ml-cache.yaml | 0
.../{immich => immich-ringtail}/pvc.yaml | 6 +-
.../service-ml.yaml | 0
.../service-valkey.yaml | 0
.../{immich => immich-ringtail}/service.yaml | 0
argocd/manifests/immich/README.md | 115 ---------------
argocd/manifests/immich/pv-nfs.yaml | 22 ---
.../time-slicing-config.yaml | 2 +-
.../migrate-immich-to-ringtail.infra.md | 13 ++
docs/how-to/immich/cnpg-on-ringtail.md | 52 +++++++
docs/how-to/immich/immich-app-on-ringtail.md | 91 ++++++++++++
.../immich/immich-cutover-and-decommission.md | 103 ++++++++++++++
.../how-to/immich/immich-pg-data-migration.md | 79 +++++++++++
docs/how-to/immich/immich-pg-on-ringtail.md | 69 +++++++++
.../immich/migrate-immich-to-ringtail.md | 132 ++++++++++++++++++
.../how-to/immich/sifaka-nfs-from-ringtail.md | 67 +++++++++
32 files changed, 820 insertions(+), 265 deletions(-)
create mode 100644 argocd/apps/cloudnative-pg-ringtail.yaml
create mode 100644 argocd/apps/databases-ringtail.yaml
create mode 100644 argocd/apps/immich-ringtail.yaml
delete mode 100644 argocd/apps/immich.yaml
rename argocd/manifests/{databases => databases-ringtail}/external-secret-immich-borgmatic.yaml (65%)
create mode 100644 argocd/manifests/databases-ringtail/immich-pg.yaml
create mode 100644 argocd/manifests/databases-ringtail/kustomization.yaml
rename argocd/manifests/{databases => databases-ringtail}/service-immich-pg-tailscale.yaml (57%)
delete mode 100644 argocd/manifests/databases/immich-pg.yaml
rename argocd/manifests/{immich => immich-ringtail}/deployment-ml.yaml (83%)
rename argocd/manifests/{immich => immich-ringtail}/deployment-server.yaml (100%)
rename argocd/manifests/{immich => immich-ringtail}/deployment-valkey.yaml (100%)
rename argocd/manifests/{immich => immich-ringtail}/ingress-tailscale.yaml (62%)
rename argocd/manifests/{immich => immich-ringtail}/kustomization.yaml (61%)
create mode 100644 argocd/manifests/immich-ringtail/pv-nfs.yaml
rename argocd/manifests/{immich => immich-ringtail}/pvc-ml-cache.yaml (100%)
rename argocd/manifests/{immich => immich-ringtail}/pvc.yaml (54%)
rename argocd/manifests/{immich => immich-ringtail}/service-ml.yaml (100%)
rename argocd/manifests/{immich => immich-ringtail}/service-valkey.yaml (100%)
rename argocd/manifests/{immich => immich-ringtail}/service.yaml (100%)
delete mode 100644 argocd/manifests/immich/README.md
delete mode 100644 argocd/manifests/immich/pv-nfs.yaml
create mode 100644 docs/changelog.d/migrate-immich-to-ringtail.infra.md
create mode 100644 docs/how-to/immich/cnpg-on-ringtail.md
create mode 100644 docs/how-to/immich/immich-app-on-ringtail.md
create mode 100644 docs/how-to/immich/immich-cutover-and-decommission.md
create mode 100644 docs/how-to/immich/immich-pg-data-migration.md
create mode 100644 docs/how-to/immich/immich-pg-on-ringtail.md
create mode 100644 docs/how-to/immich/migrate-immich-to-ringtail.md
create mode 100644 docs/how-to/immich/sifaka-nfs-from-ringtail.md
diff --git a/argocd/apps/cloudnative-pg-ringtail.yaml b/argocd/apps/cloudnative-pg-ringtail.yaml
new file mode 100644
index 0000000..fa7bba0
--- /dev/null
+++ b/argocd/apps/cloudnative-pg-ringtail.yaml
@@ -0,0 +1,27 @@
+# CloudNativePG Operator for ringtail k3s cluster
+# Deploys the operator only; PostgreSQL clusters are created separately
+#
+# Sibling of cloudnative-pg.yaml (minikube). Same mirror, same release,
+# different destination. Both apps will coexist during the immich
+# migration; the minikube one is removed at the end of the broader
+# indri-k8s decommission.
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+ name: cloudnative-pg-ringtail
+ namespace: argocd
+spec:
+ project: default
+ source:
+ repoURL: ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git
+ targetRevision: v1.27.1
+ path: releases
+ directory:
+ include: 'cnpg-1.27.1.yaml'
+ destination:
+ server: https://ringtail.tail8d86e.ts.net:6443
+ namespace: cnpg-system
+ syncPolicy:
+ syncOptions:
+ - CreateNamespace=true
+ - ServerSideApply=true # Required for large CRDs that exceed annotation size limit
diff --git a/argocd/apps/databases-ringtail.yaml b/argocd/apps/databases-ringtail.yaml
new file mode 100644
index 0000000..00de4e3
--- /dev/null
+++ b/argocd/apps/databases-ringtail.yaml
@@ -0,0 +1,26 @@
+# Databases on ringtail k3s.
+#
+# Today: only immich-pg (CNPG Cluster) + its borgmatic ExternalSecret.
+# More databases may move here as the indri-k8s decommission proceeds.
+#
+# Prerequisites:
+# - cloudnative-pg-ringtail (operator must exist before the Cluster CR)
+# - external-secrets-ringtail + 1password-connect-ringtail (for the
+# immich-pg-borgmatic ExternalSecret to sync)
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+ name: databases-ringtail
+ namespace: argocd
+spec:
+ project: default
+ source:
+ repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git
+ targetRevision: main
+ path: argocd/manifests/databases-ringtail
+ destination:
+ server: https://ringtail.tail8d86e.ts.net:6443
+ namespace: databases
+ syncPolicy:
+ syncOptions:
+ - CreateNamespace=true
diff --git a/argocd/apps/immich-ringtail.yaml b/argocd/apps/immich-ringtail.yaml
new file mode 100644
index 0000000..c93cbee
--- /dev/null
+++ b/argocd/apps/immich-ringtail.yaml
@@ -0,0 +1,31 @@
+# Immich on ringtail k3s.
+#
+# Staging deployment; the minikube `immich` app remains in parallel
+# until cutover. See [[immich-cutover-and-decommission]] for the
+# routing flip + minikube cleanup.
+#
+# Prerequisites:
+# - cnpg-on-ringtail + databases-ringtail (postgres)
+# - 1password-connect-ringtail + external-secrets-ringtail (not used
+# by this app today — immich-db Secret is created manually,
+# matching the minikube pattern)
+# - The immich-db Secret in the immich namespace, holding the
+# password for the `immich` postgres role (copied from the source
+# immich-pg-app Secret at migration time).
+apiVersion: argoproj.io/v1alpha1
+kind: Application
+metadata:
+ name: immich-ringtail
+ namespace: argocd
+spec:
+ project: default
+ source:
+ repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git
+ targetRevision: main
+ path: argocd/manifests/immich-ringtail
+ destination:
+ server: https://ringtail.tail8d86e.ts.net:6443
+ namespace: immich
+ syncPolicy:
+ syncOptions:
+ - CreateNamespace=true
diff --git a/argocd/apps/immich.yaml b/argocd/apps/immich.yaml
deleted file mode 100644
index 7efd263..0000000
--- a/argocd/apps/immich.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-# Immich - Self-hosted photo and video management
-# High-performance Google Photos/iCloud alternative with AI features
-#
-# Kustomize manifests in argocd/manifests/immich/
-# Components: server, machine-learning, valkey (Redis)
-#
-# Prerequisites:
-# 1. Create immich namespace and secrets:
-# kubectl create namespace immich
-# kubectl --context=minikube-indri create secret generic immich-db -n immich \
-# --from-literal=password="$(kubectl --context=minikube-indri -n databases get secret immich-pg-app -o jsonpath='{.data.password}' | base64 -d)"
-# 2. Create immich-pg database and user (see immich-pg app)
-# 3. NFS share on sifaka at /volume1/photos with read/write for indri
-apiVersion: argoproj.io/v1alpha1
-kind: Application
-metadata:
- name: immich
- namespace: argocd
-spec:
- project: default
- source:
- repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git
- targetRevision: main
- path: argocd/manifests/immich
- destination:
- server: https://kubernetes.default.svc
- namespace: immich
- syncPolicy:
- syncOptions:
- - CreateNamespace=true
diff --git a/argocd/manifests/databases/external-secret-immich-borgmatic.yaml b/argocd/manifests/databases-ringtail/external-secret-immich-borgmatic.yaml
similarity index 65%
rename from argocd/manifests/databases/external-secret-immich-borgmatic.yaml
rename to argocd/manifests/databases-ringtail/external-secret-immich-borgmatic.yaml
index 8801c1a..3d1fc14 100644
--- a/argocd/manifests/databases/external-secret-immich-borgmatic.yaml
+++ b/argocd/manifests/databases-ringtail/external-secret-immich-borgmatic.yaml
@@ -1,9 +1,12 @@
# ExternalSecret for borgmatic backup user password on immich-pg cluster
+# (ringtail k3s).
+#
+# Mirror of argocd/manifests/databases/external-secret-immich-borgmatic.yaml.
+# The onepassword-blumeops ClusterSecretStore exists on ringtail via the
+# external-secrets-ringtail app.
#
-# Reuses the same 1Password item as blumeops-pg-borgmatic.
# 1Password item: "borgmatic" in blumeops vault
# Field: "db-password"
-#
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
@@ -23,7 +26,7 @@ spec:
username: borgmatic
password: "{{ .password }}"
data:
- - secretKey: password
- remoteRef:
- key: borgmatic
- property: db-password
+ - secretKey: password
+ remoteRef:
+ key: borgmatic
+ property: db-password
diff --git a/argocd/manifests/databases-ringtail/immich-pg.yaml b/argocd/manifests/databases-ringtail/immich-pg.yaml
new file mode 100644
index 0000000..982bc43
--- /dev/null
+++ b/argocd/manifests/databases-ringtail/immich-pg.yaml
@@ -0,0 +1,53 @@
+# PostgreSQL Cluster for Immich on ringtail k3s.
+#
+# Initially bootstrapped via CNPG pg_basebackup from the minikube
+# immich-pg cluster on 2026-05-13, then promoted to primary. The
+# externalClusters + bootstrap.pg_basebackup blocks have been pruned
+# from this manifest now that the migration is complete — leaving
+# them around is a footgun (re-enabling replica.enabled=true would
+# try to demote this cluster against a stale source). See
+# [[immich-pg-data-migration]] for the procedure used.
+apiVersion: postgresql.cnpg.io/v1
+kind: Cluster
+metadata:
+ name: immich-pg
+ namespace: databases
+spec:
+ instances: 1
+ imageName: ghcr.io/tensorchord/cloudnative-vectorchord:17-0.5.0
+
+ storage:
+ size: 10Gi
+ storageClass: local-path
+
+ # Managed roles
+ managed:
+ roles:
+ - name: borgmatic
+ login: true
+ connectionLimit: -1
+ ensure: present
+ inherit: true
+ inRoles:
+ - pg_read_all_data
+ passwordSecret:
+ name: immich-pg-borgmatic
+
+ resources:
+ requests:
+ memory: "256Mi"
+ cpu: "100m"
+ limits:
+ memory: "1Gi"
+ cpu: "500m"
+
+ postgresql:
+ shared_preload_libraries:
+ - "vchord.so"
+ parameters:
+ max_connections: "50"
+ shared_buffers: "128MB"
+ password_encryption: "scram-sha-256"
+ pg_hba:
+ - host all all 0.0.0.0/0 scram-sha-256
+ - host all all ::/0 scram-sha-256
diff --git a/argocd/manifests/databases-ringtail/kustomization.yaml b/argocd/manifests/databases-ringtail/kustomization.yaml
new file mode 100644
index 0000000..971e2d4
--- /dev/null
+++ b/argocd/manifests/databases-ringtail/kustomization.yaml
@@ -0,0 +1,9 @@
+apiVersion: kustomize.config.k8s.io/v1beta1
+kind: Kustomization
+
+namespace: databases
+
+resources:
+ - immich-pg.yaml
+ - external-secret-immich-borgmatic.yaml
+ - service-immich-pg-tailscale.yaml
diff --git a/argocd/manifests/databases/service-immich-pg-tailscale.yaml b/argocd/manifests/databases-ringtail/service-immich-pg-tailscale.yaml
similarity index 57%
rename from argocd/manifests/databases/service-immich-pg-tailscale.yaml
rename to argocd/manifests/databases-ringtail/service-immich-pg-tailscale.yaml
index 78891dd..92deb14 100644
--- a/argocd/manifests/databases/service-immich-pg-tailscale.yaml
+++ b/argocd/manifests/databases-ringtail/service-immich-pg-tailscale.yaml
@@ -1,6 +1,8 @@
-# Tailscale LoadBalancer for immich-pg PostgreSQL access
-# Canonical hostname: immich-pg.tail8d86e.ts.net
-# Caddy L4 proxies pg.ops.eblu.me:5433 → this service for borgmatic backups
+# Tailscale LoadBalancer for immich-pg PostgreSQL access on ringtail.
+# Canonical hostname: immich-pg.tail8d86e.ts.net (claimed from the
+# minikube side after the minikube service was removed during the
+# immich-to-ringtail migration). Borgmatic on indri uses this
+# hostname for nightly backups.
apiVersion: v1
kind: Service
metadata:
diff --git a/argocd/manifests/databases/immich-pg.yaml b/argocd/manifests/databases/immich-pg.yaml
deleted file mode 100644
index 74c6f4e..0000000
--- a/argocd/manifests/databases/immich-pg.yaml
+++ /dev/null
@@ -1,69 +0,0 @@
-# PostgreSQL Cluster for Immich
-# Uses VectorChord (successor to pgvecto.rs) for AI-powered vector search
-# See: https://github.com/immich-app/immich/discussions/9060
-# Managed by CloudNativePG operator
-apiVersion: postgresql.cnpg.io/v1
-kind: Cluster
-metadata:
- name: immich-pg
- namespace: databases
-spec:
- instances: 1
- # VectorChord image for PostgreSQL 17 with VectorChord 0.5.0
- # Immich v2.4.1 requires VectorChord >=0.3 <0.6
- # See: https://github.com/tensorchord/VectorChord
- imageName: ghcr.io/tensorchord/cloudnative-vectorchord:17-0.5.0
-
- storage:
- size: 10Gi
- storageClass: standard
-
- # Bootstrap creates initial database and owner
- bootstrap:
- initdb:
- database: immich
- owner: immich
- postInitSQL:
- # Extensions required by Immich
- - CREATE EXTENSION IF NOT EXISTS vector;
- - CREATE EXTENSION IF NOT EXISTS vchord CASCADE;
- - CREATE EXTENSION IF NOT EXISTS cube CASCADE;
- - CREATE EXTENSION IF NOT EXISTS earthdistance CASCADE;
-
- # Managed roles
- # Note: connectionLimit, ensure, inherit are CNPG defaults added to prevent ArgoCD drift
- managed:
- roles:
- # borgmatic read-only user for backups
- - name: borgmatic
- login: true
- connectionLimit: -1
- ensure: present
- inherit: true
- inRoles:
- - pg_read_all_data
- passwordSecret:
- name: immich-pg-borgmatic
-
- # Resource limits for minikube environment
- resources:
- requests:
- memory: "256Mi"
- cpu: "100m"
- limits:
- memory: "1Gi"
- cpu: "500m"
-
- # PostgreSQL configuration
- postgresql:
- # VectorChord requires vchord.so in shared_preload_libraries
- shared_preload_libraries:
- - "vchord.so"
- parameters:
- max_connections: "50"
- shared_buffers: "128MB"
- password_encryption: "scram-sha-256"
- pg_hba:
- # Allow connections from k8s pods
- - host all all 0.0.0.0/0 scram-sha-256
- - host all all ::/0 scram-sha-256
diff --git a/argocd/manifests/databases/kustomization.yaml b/argocd/manifests/databases/kustomization.yaml
index b25e09e..692285a 100644
--- a/argocd/manifests/databases/kustomization.yaml
+++ b/argocd/manifests/databases/kustomization.yaml
@@ -5,13 +5,10 @@ namespace: databases
resources:
- blumeops-pg.yaml
- - immich-pg.yaml
- service-tailscale.yaml
- - service-immich-pg-tailscale.yaml
- service-metrics-tailscale.yaml
- external-secret-eblume.yaml
- external-secret-borgmatic.yaml
- - external-secret-immich-borgmatic.yaml
- external-secret-teslamate.yaml
- external-secret-authentik.yaml
- external-secret-paperless.yaml
diff --git a/argocd/manifests/immich/deployment-ml.yaml b/argocd/manifests/immich-ringtail/deployment-ml.yaml
similarity index 83%
rename from argocd/manifests/immich/deployment-ml.yaml
rename to argocd/manifests/immich-ringtail/deployment-ml.yaml
index 57c4242..5ea8035 100644
--- a/argocd/manifests/immich/deployment-ml.yaml
+++ b/argocd/manifests/immich-ringtail/deployment-ml.yaml
@@ -16,11 +16,16 @@ spec:
app: immich
component: machine-learning
spec:
+ runtimeClassName: nvidia
securityContext:
seccompProfile:
type: RuntimeDefault
containers:
- name: machine-learning
+ # ringtail uses the -cuda tag (set in kustomization.yaml)
+ # to take advantage of the RTX 4080 via the nvidia
+ # device plugin. Time-slicing is configured for 4 replicas
+ # so frigate + ollama + this pod can share.
image: ghcr.io/immich-app/immich-machine-learning:kustomized
ports:
- name: http
@@ -57,6 +62,7 @@ spec:
cpu: "100m"
limits:
memory: "4Gi"
+ nvidia.com/gpu: "1"
volumes:
- name: cache
persistentVolumeClaim:
diff --git a/argocd/manifests/immich/deployment-server.yaml b/argocd/manifests/immich-ringtail/deployment-server.yaml
similarity index 100%
rename from argocd/manifests/immich/deployment-server.yaml
rename to argocd/manifests/immich-ringtail/deployment-server.yaml
diff --git a/argocd/manifests/immich/deployment-valkey.yaml b/argocd/manifests/immich-ringtail/deployment-valkey.yaml
similarity index 100%
rename from argocd/manifests/immich/deployment-valkey.yaml
rename to argocd/manifests/immich-ringtail/deployment-valkey.yaml
diff --git a/argocd/manifests/immich/ingress-tailscale.yaml b/argocd/manifests/immich-ringtail/ingress-tailscale.yaml
similarity index 62%
rename from argocd/manifests/immich/ingress-tailscale.yaml
rename to argocd/manifests/immich-ringtail/ingress-tailscale.yaml
index 59a4c05..f0b5fe1 100644
--- a/argocd/manifests/immich/ingress-tailscale.yaml
+++ b/argocd/manifests/immich-ringtail/ingress-tailscale.yaml
@@ -1,6 +1,9 @@
-# Tailscale Ingress for Immich
-# Exposes Immich at photos.tail8d86e.ts.net
-# Caddy will proxy photos.ops.eblu.me to this endpoint
+# Tailscale ProxyGroup Ingress for Immich on ringtail.
+#
+# Production hostname: photos.tail8d86e.ts.net
+# (during the cutover window this was photos-ringtail; the minikube
+# ingress was torn down before this was renamed to photos to avoid
+# the Tailscale device-name collision.)
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
@@ -16,12 +19,6 @@ metadata:
gethomepage.dev/description: "Photo management"
gethomepage.dev/href: "https://photos.ops.eblu.me"
gethomepage.dev/pod-selector: "app=immich,component=server"
- # TODO: Add Immich widget - requires API key from Account Settings > API Keys
- # See: https://gethomepage.dev/widgets/services/immich/
- # gethomepage.dev/widget.type: "immich"
- # gethomepage.dev/widget.url: "https://photos.ops.eblu.me"
- # gethomepage.dev/widget.key: "{{HOMEPAGE_VAR_IMMICH_API_KEY}}"
- # gethomepage.dev/widget.version: "2"
spec:
ingressClassName: tailscale
rules:
diff --git a/argocd/manifests/immich/kustomization.yaml b/argocd/manifests/immich-ringtail/kustomization.yaml
similarity index 61%
rename from argocd/manifests/immich/kustomization.yaml
rename to argocd/manifests/immich-ringtail/kustomization.yaml
index 5f8d02b..c1f639e 100644
--- a/argocd/manifests/immich/kustomization.yaml
+++ b/argocd/manifests/immich-ringtail/kustomization.yaml
@@ -1,7 +1,8 @@
----
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
+
namespace: immich
+
resources:
- deployment-server.yaml
- deployment-ml.yaml
@@ -13,11 +14,15 @@ resources:
- pv-nfs.yaml
- pvc.yaml
- ingress-tailscale.yaml
+
images:
- name: ghcr.io/immich-app/immich-server
newTag: v2.6.3
- name: ghcr.io/immich-app/immich-machine-learning
- newTag: v2.6.3
+ # CUDA variant of the same release — ringtail has an RTX 4080
+ newTag: v2.6.3-cuda
+ # Using upstream multi-arch valkey image directly; the
+ # registry.ops.eblu.me/blumeops/valkey mirror is arm64-only (built
+ # on indri) and would crashloop on ringtail.
- name: docker.io/valkey/valkey
- newName: registry.ops.eblu.me/blumeops/valkey
- newTag: v8.1.6-r0-fabca04
+ newTag: "8.1.6"
diff --git a/argocd/manifests/immich-ringtail/pv-nfs.yaml b/argocd/manifests/immich-ringtail/pv-nfs.yaml
new file mode 100644
index 0000000..3d5a682
--- /dev/null
+++ b/argocd/manifests/immich-ringtail/pv-nfs.yaml
@@ -0,0 +1,29 @@
+# NFS PersistentVolume for Immich photo library on ringtail k3s.
+#
+# Mirror of argocd/manifests/immich/pv-nfs.yaml (minikube) but with
+# a distinct name (minikube and ringtail are separate clusters, so PV
+# names don't collide cluster-side, but using the same name in two
+# manifests is confusing).
+#
+# The sifaka NFS export for /volume1/photos already permits
+# 192.168.1.0/24 + 100.64.0.0/10. Ringtail's wired IP (192.168.1.21)
+# falls in the first CIDR, so no DSM rule changes are needed.
+#
+# Verified 2026-05-13: ringtail pod can read existing dirs, write
+# new files, and delete them. DNS resolves sifaka to 192.168.1.203
+# (LAN), so NFS traffic stays off the tailnet — avoids the known
+# sifaka-tailscale-userspace bite.
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+ name: immich-library-nfs-pv-ringtail
+spec:
+ capacity:
+ storage: 2Ti
+ accessModes:
+ - ReadWriteMany
+ persistentVolumeReclaimPolicy: Retain
+ storageClassName: ""
+ nfs:
+ server: sifaka
+ path: /volume1/photos
diff --git a/argocd/manifests/immich/pvc-ml-cache.yaml b/argocd/manifests/immich-ringtail/pvc-ml-cache.yaml
similarity index 100%
rename from argocd/manifests/immich/pvc-ml-cache.yaml
rename to argocd/manifests/immich-ringtail/pvc-ml-cache.yaml
diff --git a/argocd/manifests/immich/pvc.yaml b/argocd/manifests/immich-ringtail/pvc.yaml
similarity index 54%
rename from argocd/manifests/immich/pvc.yaml
rename to argocd/manifests/immich-ringtail/pvc.yaml
index c764636..5bfc052 100644
--- a/argocd/manifests/immich/pvc.yaml
+++ b/argocd/manifests/immich-ringtail/pvc.yaml
@@ -1,5 +1,5 @@
-# PersistentVolumeClaim for Immich photo library
-# Binds to the NFS PV for sifaka:/volume1/photos
+# PersistentVolumeClaim for Immich photo library on ringtail.
+# Binds to immich-library-nfs-pv-ringtail (sifaka:/volume1/photos).
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
@@ -9,7 +9,7 @@ spec:
accessModes:
- ReadWriteMany
storageClassName: ""
- volumeName: immich-library-nfs-pv
+ volumeName: immich-library-nfs-pv-ringtail
resources:
requests:
storage: 2Ti
diff --git a/argocd/manifests/immich/service-ml.yaml b/argocd/manifests/immich-ringtail/service-ml.yaml
similarity index 100%
rename from argocd/manifests/immich/service-ml.yaml
rename to argocd/manifests/immich-ringtail/service-ml.yaml
diff --git a/argocd/manifests/immich/service-valkey.yaml b/argocd/manifests/immich-ringtail/service-valkey.yaml
similarity index 100%
rename from argocd/manifests/immich/service-valkey.yaml
rename to argocd/manifests/immich-ringtail/service-valkey.yaml
diff --git a/argocd/manifests/immich/service.yaml b/argocd/manifests/immich-ringtail/service.yaml
similarity index 100%
rename from argocd/manifests/immich/service.yaml
rename to argocd/manifests/immich-ringtail/service.yaml
diff --git a/argocd/manifests/immich/README.md b/argocd/manifests/immich/README.md
deleted file mode 100644
index a82a856..0000000
--- a/argocd/manifests/immich/README.md
+++ /dev/null
@@ -1,115 +0,0 @@
-# Immich
-
-Self-hosted photo and video management solution with AI-powered search and face recognition.
-
-## Prerequisites
-
-1. **NFS Share**: Create `/volume1/photos` on sifaka with NFS permissions for indri
-2. **PostgreSQL**: The `immich-pg` cluster (with pgvecto.rs) must be healthy
-3. **Secrets**: Create the database password secret
-
-## Deployment Order
-
-1. Sync `blumeops-pg` (to get CloudNativePG operator if not already running)
-2. Wait for `immich-pg` cluster to be healthy
-3. Create secrets (see below)
-4. Sync `immich` (deploys all resources: storage, services, deployments)
-5. Run `mise run provision-indri -- --tags caddy` to update Caddy config
-
-## Components
-
-| Component | Deployment | Service | Port |
-|-----------|------------|---------|------|
-| Server (web/API) | `immich-server` | `immich-server` | 2283 |
-| Machine Learning | `immich-machine-learning` | `immich-machine-learning` | 3003 |
-| Valkey (Redis) | `immich-valkey` | `immich-valkey` | 6379 |
-
-## Secret Setup
-
-The `immich-db` secret contains the database password, which is auto-generated by CloudNativePG
-in the `immich-pg-app` secret. To create or regenerate the secret:
-
-```bash
-# Create namespace if needed
-kubectl --context=minikube-indri create namespace immich
-
-# Copy password from CNPG secret to immich namespace
-kubectl --context=minikube-indri create secret generic immich-db -n immich \
- --from-literal=password="$(kubectl --context=minikube-indri -n databases get secret immich-pg-app -o jsonpath='{.data.password}' | base64 -d)"
-```
-
-Note: This secret is not managed by ExternalSecrets since the source of truth is the CNPG-generated secret.
-
-## Access
-
-- **URL**: https://photos.ops.eblu.me (after Caddy is updated)
-- **Tailscale**: https://photos.tail8d86e.ts.net (direct)
-
-## First-Time Setup
-
-1. Navigate to https://photos.ops.eblu.me
-2. Create an admin account
-3. Configure external library (optional - for importing existing photos)
-
-## External Library (iCloud Photos)
-
-To import existing photos from iCloud sync on indri:
-
-1. In Immich Admin > External Libraries, create a new library
-2. Set the import path to the location where iCloud photos sync
-3. Configure scan schedule or trigger manual scan
-
-## Architecture
-
-```
-┌─────────────────┐ ┌─────────────────┐
-│ immich-server │────▶│ immich-pg │
-│ (web/api) │ │ (PostgreSQL │
-└────────┬────────┘ │ + pgvecto.rs) │
- │ └─────────────────┘
- │
-┌────────▼────────┐ ┌─────────────────┐
-│ immich-ml │ │ valkey │
-│ (ML inference) │ │ (Redis cache) │
-└─────────────────┘ └─────────────────┘
- │
-┌────────▼────────┐
-│ sifaka NFS │
-│ /volume1/photos│
-└─────────────────┘
-```
-
-## Version Management
-
-Image versions are controlled via `kustomization.yaml`:
-
-```yaml
-images:
- - name: ghcr.io/immich-app/immich-server
- newTag: v2.6.3
- - name: ghcr.io/immich-app/immich-machine-learning
- newTag: v2.6.3
- - name: docker.io/valkey/valkey
- newTag: "8.1-alpine"
-```
-
-To upgrade, update `newTag` values and sync via ArgoCD.
-
-## Troubleshooting
-
-```bash
-# Check pods
-kubectl --context=minikube-indri -n immich get pods
-
-# Check immich-pg cluster
-kubectl --context=minikube-indri -n databases get cluster immich-pg
-
-# View server logs
-kubectl --context=minikube-indri -n immich logs -l app=immich,component=server
-
-# View ML logs
-kubectl --context=minikube-indri -n immich logs -l app=immich,component=machine-learning
-
-# Check PVC binding
-kubectl --context=minikube-indri -n immich get pvc
-```
diff --git a/argocd/manifests/immich/pv-nfs.yaml b/argocd/manifests/immich/pv-nfs.yaml
deleted file mode 100644
index 0bd6ee2..0000000
--- a/argocd/manifests/immich/pv-nfs.yaml
+++ /dev/null
@@ -1,22 +0,0 @@
-# NFS PersistentVolume for Immich photo library
-# Requires: NFS share on sifaka at /volume1/photos with NFS permissions for indri
-#
-# To create on Synology:
-# 1. Control Panel > Shared Folder > Create
-# 2. Name: photos, Location: Volume 1
-# 3. Control Panel > File Services > NFS > NFS Rules
-# 4. Add rule for "photos" share: Hostname=indri, Privilege=Read/Write, Squash=No mapping
-apiVersion: v1
-kind: PersistentVolume
-metadata:
- name: immich-library-nfs-pv
-spec:
- capacity:
- storage: 2Ti
- accessModes:
- - ReadWriteMany
- persistentVolumeReclaimPolicy: Retain
- storageClassName: ""
- nfs:
- server: sifaka
- path: /volume1/photos
diff --git a/argocd/manifests/nvidia-device-plugin/time-slicing-config.yaml b/argocd/manifests/nvidia-device-plugin/time-slicing-config.yaml
index dee2fd7..100e7a9 100644
--- a/argocd/manifests/nvidia-device-plugin/time-slicing-config.yaml
+++ b/argocd/manifests/nvidia-device-plugin/time-slicing-config.yaml
@@ -11,4 +11,4 @@ data:
timeSlicing:
resources:
- name: nvidia.com/gpu
- replicas: 2
+ replicas: 4
diff --git a/docs/changelog.d/migrate-immich-to-ringtail.infra.md b/docs/changelog.d/migrate-immich-to-ringtail.infra.md
new file mode 100644
index 0000000..b47742f
--- /dev/null
+++ b/docs/changelog.d/migrate-immich-to-ringtail.infra.md
@@ -0,0 +1,13 @@
+Move the entire Immich stack — server, machine-learning, valkey,
+and the PostgreSQL+VectorChord cluster — off `minikube-indri` and
+onto `k3s-ringtail`. Postgres data migrated zero-loss via CNPG
+`pg_basebackup` (replica catch-up then promote); row counts on
+`asset`, `user`, `album`, `smart_search`, `activity`, `asset_face`
+verified equal between source and replica before cutover. The ML
+pod now uses ringtail's RTX 4080 via the nvidia-device-plugin
+(time-slicing bumped 2 → 4 to share with frigate + ollama). Caddy
+routing at `photos.ops.eblu.me` is unchanged (still
+`photos.tail8d86e.ts.net`, the device just lives on ringtail now).
+Borgmatic backups continue against the same `immich-pg` tailnet
+hostname. First concrete chain in the broader indri-k8s
+decommission effort.
diff --git a/docs/how-to/immich/cnpg-on-ringtail.md b/docs/how-to/immich/cnpg-on-ringtail.md
new file mode 100644
index 0000000..153e674
--- /dev/null
+++ b/docs/how-to/immich/cnpg-on-ringtail.md
@@ -0,0 +1,52 @@
+---
+title: CNPG Operator on Ringtail
+modified: 2026-05-13
+last-reviewed: 2026-05-13
+tags:
+ - how-to
+ - operations
+ - postgres
+ - ringtail
+---
+
+# CNPG Operator on Ringtail
+
+Bring up the `cloudnative-pg` operator on `k3s-ringtail`. Today the
+operator only exists on `minikube-indri` (see
+`argocd/apps/cloudnative-pg.yaml`, destination `kubernetes.default.svc`).
+
+Prerequisite of [[migrate-immich-to-ringtail]]; consumed by
+[[immich-pg-on-ringtail]].
+
+## What to do
+
+- Add a sibling `argocd/apps/cloudnative-pg-ringtail.yaml` pointing
+ at the same mirror (`mirrors/cloudnative-pg`, tag `v1.27.1`),
+ destination `https://ringtail.tail8d86e.ts.net:6443`,
+ namespace `cnpg-system`.
+- Mirror the `ServerSideApply=true` and `CreateNamespace=true` sync
+ options (the CRDs exceed the annotation size limit).
+- Sync `apps` then `cloudnative-pg-ringtail`. Verify the operator
+ pod is running on ringtail.
+
+## Verification
+
+```fish
+kubectl --context=k3s-ringtail -n cnpg-system get pods
+kubectl --context=k3s-ringtail get crd clusters.postgresql.cnpg.io
+```
+
+## Why a separate app
+
+Each ArgoCD app targets a single cluster via `destination.server`.
+We could parameterize with ApplicationSets, but blumeops' convention
+is to duplicate the manifest with a `-ringtail` suffix (see
+`alloy-ringtail`, `external-secrets-ringtail`, etc.). Keep the
+convention.
+
+## Out of scope
+
+- Postgres clusters themselves (`immich-pg`, etc.) — those come from
+ [[immich-pg-on-ringtail]].
+- Removing the minikube cnpg operator. That happens at the very end
+ of the indri-k8s decommission, not in this chain.
diff --git a/docs/how-to/immich/immich-app-on-ringtail.md b/docs/how-to/immich/immich-app-on-ringtail.md
new file mode 100644
index 0000000..51b619d
--- /dev/null
+++ b/docs/how-to/immich/immich-app-on-ringtail.md
@@ -0,0 +1,91 @@
+---
+title: Immich App on Ringtail
+modified: 2026-05-13
+last-reviewed: 2026-05-13
+tags:
+ - how-to
+ - operations
+ - immich
+---
+
+# Immich App on Ringtail
+
+Bring up `immich-server`, `immich-machine-learning`, and
+`immich-valkey` on ringtail. This card stands the stack up against
+the *new* pg cluster — it does not move user traffic. Cutover lives
+in [[immich-cutover-and-decommission]].
+
+## What to do
+
+- New manifest dir `argocd/manifests/immich-ringtail/` (the suffix
+ matches the `-ringtail` convention used by other apps). Port from
+ `argocd/manifests/immich/`:
+ - `deployment-server.yaml` — point `DB_HOSTNAME` at the ringtail
+ pg service.
+ - `deployment-ml.yaml` — use `runtimeClassName: nvidia` + a
+ `resources.limits` for `nvidia.com/gpu: 1`. Use the `-cuda` tag
+ of the immich-ml image (set in kustomization). Ringtail is
+ single-node, so no node selector needed. See
+ `argocd/manifests/frigate/` for the existing GPU pod pattern.
+
+ **GPU contention discovery:** ringtail's `nvidia-device-plugin`
+ is configured with `timeSlicing.replicas: 2`. Frigate + Ollama
+ already consume both virtual slices. Adding immich-ml requires
+ bumping the count to >= 3. Edit
+ `argocd/manifests/nvidia-device-plugin/configmap.yaml` (or
+ wherever the device-plugin config lives) and re-sync the
+ `nvidia-device-plugin` ArgoCD app. The plugin pod restarts and
+ the new advertised count appears as the node's
+ `nvidia.com/gpu` allocatable.
+ - `deployment-valkey.yaml` — straight port, BUT use the upstream
+ multi-arch `docker.io/valkey/valkey:` image — do NOT
+ use the `registry.ops.eblu.me/blumeops/valkey` rewrite in the
+ kustomization. That mirror was built on indri (arm64) and is
+ single-arch; pulling it on ringtail (amd64) gets `exec format
+ error` in CrashLoopBackOff. The mirror should eventually carry
+ a multi-arch tag, at which point the rewrite can return.
+ - `service*.yaml` — straight port.
+ - `pvc-ml-cache.yaml` — straight port (empty `local-path` PVC).
+ - `pv-nfs.yaml` + `pvc.yaml` — already covered by
+ [[sifaka-nfs-from-ringtail]] (may live in this dir or theirs).
+ - `ingress-tailscale.yaml` — ProxyGroup ingress, **must not** set
+ an explicit `host:` (or use `host: *`) per the lesson on
+ ProxyGroup VIP routing.
+ **Hostname collision warning:** the minikube ingress claims the
+ Tailscale device name `photos` (`tls.hosts: [photos]`). Two
+ devices on the tailnet cannot share that name. While the
+ ringtail deployment is being staged it must use a *different*
+ `tls.hosts` value (e.g. `photos-ringtail`) so it can coexist
+ with the running minikube one. The flip to `photos` happens at
+ cutover time, *after* the minikube ingress has been removed.
+ See [[immich-cutover-and-decommission#Cutover sequence]].
+ - `kustomization.yaml` — same `images:` block (server, ML, valkey).
+- New ArgoCD app `argocd/apps/immich-ringtail.yaml` targeting
+ ringtail, namespace `immich`. **Manual sync only** until the
+ cutover.
+- Existing `argocd/apps/immich.yaml` (minikube) stays untouched
+ during this card — both apps exist briefly.
+
+## Bring it up against a copy of the DB
+
+Use the throwaway/test path from [[immich-pg-data-migration#Dry run
+before real cutover]]: point the ringtail immich at the *test* pg
+cluster first, verify the pod boots, the web UI loads (via
+`kubectl port-forward`), assets list, ML embeddings query. Then
+tear it down.
+
+## Verification
+
+- All three pods Ready.
+- ML pod has a GPU attached: `nvidia-smi` inside the container shows
+ the 4080.
+- `immich-server` connects to pg and valkey (no `ECONNREFUSED` in
+ logs).
+- A `kubectl port-forward` to the server service shows the Immich
+ web UI.
+
+## Out of scope
+
+- Public/tailnet routing flip. Caddy still points at the minikube
+ Tailscale ingress until [[immich-cutover-and-decommission]].
+- Removing the minikube immich. Same.
diff --git a/docs/how-to/immich/immich-cutover-and-decommission.md b/docs/how-to/immich/immich-cutover-and-decommission.md
new file mode 100644
index 0000000..b44fddd
--- /dev/null
+++ b/docs/how-to/immich/immich-cutover-and-decommission.md
@@ -0,0 +1,103 @@
+---
+title: Immich Cutover and Decommission
+modified: 2026-05-13
+last-reviewed: 2026-05-13
+tags:
+ - how-to
+ - operations
+ - immich
+ - migration
+---
+
+# Immich Cutover and Decommission
+
+The user-visible flip. By the time this card opens, the ringtail
+stack has been proven against a copy of the data. This card does the
+real cutover.
+
+## Pre-cutover checklist
+
+- [[immich-pg-data-migration]] dry-run succeeded; method is chosen.
+- Ringtail immich stack has been brought up against the test pg,
+ pods healthy, UI loaded ([[immich-app-on-ringtail#Verification]]).
+- Borgmatic just ran successfully (a fresh nightly archive is a
+ belt-and-suspenders fallback, on top of the live source pg).
+- User has been told to stop uploading from the iOS app for the
+ cutover window.
+
+## Cutover sequence
+
+1. **Quiesce source.** `kubectl --context=minikube-indri -n immich
+ scale deploy/immich-server --replicas=0` and same for ML. Leave
+ valkey + pg running. Confirm no client traffic on the source pg
+ via `pg_stat_activity`.
+2. **Tear down the minikube Tailscale ingress.** The `photos`
+ Tailscale device name must be freed before ringtail's ingress can
+ claim it (Tailscale enforces uniqueness across the tailnet).
+ `kubectl --context=minikube-indri -n immich delete ingress
+ immich-tailscale` and wait for the corresponding `tailscale`-LB
+ StatefulSet pod to terminate. Verify the `photos` device is gone:
+ `tailscale status | grep -i photos` from any tailnet host.
+3. **Final sync.** Per chosen method in
+ [[immich-pg-data-migration]]:
+ - Option A: promote the ringtail replica.
+ - Option B: take final `pg_dump`, restore to ringtail
+ `immich-pg`.
+4. **Verify.** Run the row-count and schema-diff checks from
+ [[immich-pg-data-migration#Verification on the real run]].
+5. **Flip the ringtail ingress to `photos`.** Update
+ `argocd/manifests/immich-ringtail/ingress-tailscale.yaml`:
+ `tls.hosts: [photos]` (was `[photos-ringtail]` during staging per
+ [[immich-app-on-ringtail]]). Commit, `argocd app sync
+ immich-ringtail`. Wait for the `photos` device to register on the
+ tailnet again.
+6. **Bring up ringtail immich** against the now-promoted pg
+ (`argocd app sync immich-ringtail`). Wait for Ready.
+7. **Flip routing.** Update Caddy on indri
+ (`ansible/roles/caddy/defaults/main.yml`): `photos.ops.eblu.me`
+ upstream changes to the ringtail Tailscale ingress hostname
+ (`photos` — same MagicDNS name, now pointing to the ringtail
+ proxy). `mise run provision-indri -- --tags caddy`.
+8. **Smoke test.** Open `photos.ops.eblu.me` in a browser. Sign in.
+ Scroll the timeline. Open an album. Trigger an ML search.
+9. **Update borgmatic.** If the Tailscale hostname for pg changed,
+ update `borgmatic.cfg` on indri to point at the ringtail
+ `immich-pg-tailscale` service. Run a manual backup to verify.
+
+## After cutover
+
+- `argocd app set immich --revision ` is no longer relevant;
+ the minikube `immich` app gets deleted entirely.
+- Delete `argocd/apps/immich.yaml`, `argocd/manifests/immich/`, and
+ the minikube `argocd/manifests/databases/immich-pg.yaml` +
+ `external-secret-immich-borgmatic.yaml` +
+ `service-immich-pg-tailscale.yaml`.
+- Rename `immich-ringtail` back to `immich` (the `-ringtail` suffix
+ was scaffolding for the dual-cluster window; once minikube is
+ empty of immich, the unsuffixed name is clean).
+- Confirm the minikube `immich-pg` PVC is no longer used, then
+ delete it (the PV with `Retain` policy will persist — clean that
+ up too).
+
+## Verification (definition of done)
+
+- `photos.ops.eblu.me` works for a real session, including ML search.
+- Source minikube has no `immich` pods, no `immich-pg`, no PVCs.
+- Memory pressure on minikube has dropped (≥1.5 GiB reclaimed). Check
+ `docker stats minikube` on indri.
+- Nightly borgmatic run after the cutover completes successfully,
+ with the immich-pg archive showing the new source.
+
+## Rollback (within the cutover window)
+
+If smoke test fails: flip Caddy back, scale ringtail immich to 0,
+scale source immich back up. Source pg was never destroyed. File a
+plan reset on the relevant prerequisite card and try again next
+session.
+
+## Out of scope
+
+- Decommissioning all of minikube. This chain just removes immich.
+ Other tenants migrate in their own chains as part of the broader
+ indri-k8s decommission. See [[migrate-immich-to-ringtail]] for
+ context.
diff --git a/docs/how-to/immich/immich-pg-data-migration.md b/docs/how-to/immich/immich-pg-data-migration.md
new file mode 100644
index 0000000..fb87783
--- /dev/null
+++ b/docs/how-to/immich/immich-pg-data-migration.md
@@ -0,0 +1,79 @@
+---
+title: Immich Postgres Data Migration
+modified: 2026-05-13
+last-reviewed: 2026-05-13
+tags:
+ - how-to
+ - operations
+ - postgres
+ - immich
+ - critical
+---
+
+# Immich Postgres Data Migration
+
+**This is the data-loss surface of the migration.** Pick a method,
+prove it on a throwaway copy first, then run the real cutover.
+
+## Decision: pick one
+
+### Option A — CNPG `externalCluster` bootstrap (preferred)
+
+Stand the ringtail cluster up as a streaming replica of the minikube
+cluster via `bootstrap.pg_basebackup.source`. Replica catches up
+online; when ready, promote it and point Immich at it. This is
+CNPG's documented PG-to-PG migration path and gives near-zero data
+loss (the WAL position at promote == the position at app stop).
+
+Requires: network path from ringtail to minikube's pg over the
+tailnet (the existing `immich-pg-tailscale` Service works), and a
+superuser secret minikube-side exposed to ringtail's basebackup.
+
+Pitfall to plan around: the ringtail Cluster CR will need its
+`bootstrap` block rewritten *after* promotion (CNPG doesn't
+gracefully drop the externalCluster reference). Account for this in
+[[immich-pg-on-ringtail]] — it may force a reset of that card.
+
+### Option B — pg_dump / pg_restore
+
+Stop immich, `pg_dump -Fc` from minikube, scp to ringtail, restore.
+Simpler but full downtime for the whole dump+restore window
+(measure on a copy first — VectorChord indexes are slow to rebuild).
+Smaller blast radius; no streaming-replication moving parts.
+
+Use this if Option A hits any blocker. Data loss should still be
+zero if the source is stopped first.
+
+### Option C — leave pg on minikube
+
+Rejected. See goal card [[migrate-immich-to-ringtail#Why postgres on
+ringtail (not cross-cluster)]].
+
+## Dry run before real cutover
+
+Whichever option wins:
+
+1. Snapshot the minikube `immich-pg` PVC or take a fresh `pg_dump`
+ into a scratch location.
+2. Restore into a *separate* ringtail CNPG cluster (different name,
+ e.g. `immich-pg-test`) and point a scratch immich-server pod at
+ it.
+3. Verify: pod boots, can list assets, ML embeddings query without
+ error, face thumbnails render. VectorChord-backed queries should
+ not error.
+4. Tear the scratch cluster down before doing the real one.
+
+## Verification on the real run
+
+- Row counts match for `assets`, `albums`, `users`, `face`,
+ `asset_face`, `smart_search` (the embedding table) — script this.
+- `pg_dump --schema-only --no-owner` diff between source and dest
+ should be empty modulo CNPG-managed roles.
+- Immich `/api/server-info/version` and `/api/server-info/statistics`
+ return sane numbers.
+
+## Rollback
+
+If the cutover fails verification: stop the ringtail immich, repoint
+ArgoCD `immich.destination` back to minikube, re-sync. Source pg was
+never deleted. Document what failed and reset the chain.
diff --git a/docs/how-to/immich/immich-pg-on-ringtail.md b/docs/how-to/immich/immich-pg-on-ringtail.md
new file mode 100644
index 0000000..10c7072
--- /dev/null
+++ b/docs/how-to/immich/immich-pg-on-ringtail.md
@@ -0,0 +1,69 @@
+---
+title: Immich Postgres Cluster on Ringtail
+modified: 2026-05-13
+last-reviewed: 2026-05-13
+tags:
+ - how-to
+ - operations
+ - postgres
+ - immich
+---
+
+# Immich Postgres Cluster on Ringtail
+
+Stand up a fresh `immich-pg` CNPG Cluster on ringtail, ready to receive
+data. **No data import yet** — that's [[immich-pg-data-migration]].
+
+## What to do
+
+- Create `argocd/manifests/databases-ringtail/` (or pick another
+ namespace name — verify what other ringtail pg clusters will use;
+ if none yet, `databases` is fine).
+- Port these from the minikube side:
+ - `immich-pg.yaml` — CNPG Cluster CR. Same image
+ (`ghcr.io/tensorchord/cloudnative-vectorchord:17-0.5.0`), same
+ extensions, same managed `borgmatic` role. Bump `storage.size` if
+ the minikube 10 GiB looks tight (check actual usage first).
+ `storageClass: local-path` on ringtail (default).
+ - `external-secret-immich-borgmatic.yaml` — same 1Password item,
+ same field, but referencing the ringtail `ClusterSecretStore`
+ (`onepassword-blumeops` already exists per the
+ `external-secrets-ringtail` app).
+ - Service for in-cluster access (the operator creates `immich-pg-rw`
+ etc. automatically; verify the app deployment uses those names).
+ - A Tailscale Service if we want backups to keep working via the
+ same hostname during the transition — see "Borgmatic" below.
+- New ArgoCD app `argocd/apps/databases-ringtail.yaml` pointing at
+ the new path, destination ringtail.
+
+## Verification
+
+- Cluster reaches `Ready`.
+- `borgmatic` role exists, `rolcanlogin=t`, and is a member of
+ `pg_read_all_data` (via `managed.roles[].inRoles`).
+- ExternalSecret `immich-pg-borgmatic` syncs from 1Password
+ (`Ready: True`) and the rendered Secret has `username=borgmatic`.
+- The `vchord`, `vector`, `cube`, `earthdistance` extensions show
+ installed in the `postgres` database (`\dx` from
+ `psql -U postgres`). They are NOT installed in the `immich`
+ database at this point — `postInitSQL` in CNPG's `initdb` block
+ runs against the `postgres` superuser database. The Immich app
+ itself creates the extensions in its own `immich` database at
+ startup; do not be alarmed by their absence pre-immich-deploy.
+ The `vchord.so` library is preloaded via
+ `shared_preload_libraries` regardless, so `CREATE EXTENSION` at
+ app startup just registers it in the right database.
+
+## Borgmatic implications
+
+`borgmatic.cfg` on indri targets `immich-pg-tailscale` over the
+tailnet. During migration both clusters will exist briefly. Decide
+upfront: backup the *source* pg until cutover, then flip borgmatic
+to the ringtail Tailscale service. Document the flip in
+[[immich-cutover-and-decommission]].
+
+## Out of scope
+
+- Importing data. That is [[immich-pg-data-migration]], which may
+ drive a reset on this card if the migration approach (e.g. CNPG
+ `externalCluster` bootstrap) requires changes to this Cluster CR.
diff --git a/docs/how-to/immich/migrate-immich-to-ringtail.md b/docs/how-to/immich/migrate-immich-to-ringtail.md
new file mode 100644
index 0000000..cd23384
--- /dev/null
+++ b/docs/how-to/immich/migrate-immich-to-ringtail.md
@@ -0,0 +1,132 @@
+---
+title: Migrate Immich to Ringtail
+modified: 2026-05-13
+last-reviewed: 2026-05-13
+tags:
+ - how-to
+ - operations
+ - immich
+ - migration
+---
+
+# Migrate Immich to Ringtail
+
+Move the entire Immich stack (server, ML, valkey, postgres) off
+`minikube-indri` and onto `k3s-ringtail`. This is the first concrete
+chain in the broader indri-k8s decommission: minikube is
+memory-saturated (97% RAM, swapping), and Immich is the single
+largest tenant (~1.5 GiB resident).
+
+## End state
+
+- Immich `server`, `machine-learning`, and `valkey` Deployments run on
+ ringtail k3s in the `immich` namespace.
+- The `immich-machine-learning` pod uses ringtail's RTX 4080 via the
+ `nvidia-device-plugin` (performance win — currently CPU-only on
+ minikube).
+- A CNPG `immich-pg` Cluster (PostgreSQL 17 + VectorChord) runs in a
+ `databases` namespace on ringtail, owned by the `cnpg-system`
+ operator on ringtail.
+- The photo library still lives on [[sifaka]] at `/volume1/photos`,
+ mounted via NFS from ringtail pods (RWX).
+- Routing: `photos.ops.eblu.me` (Caddy on indri) proxies to a
+ Tailscale ProxyGroup ingress on ringtail. No public surface today.
+- The ArgoCD `immich` app's `destination.server` points at
+ `https://ringtail.tail8d86e.ts.net:6443`. The old minikube
+ manifests are removed.
+
+## Non-goals
+
+- Public exposure via Fly. Immich stays tailnet-only.
+- Changing the immich version or runtime configuration. This is a
+ lift-and-shift; bumps come later.
+- Backing up to a different target. [[borgmatic]] keeps running on
+ indri (it pulls via Tailscale and uses sifaka SMB for the library).
+
+## Critical constraint: no data loss
+
+Downtime is acceptable (Immich is a single-user system; we can take
+it offline for the cutover). **Data loss is not.** Two surfaces matter:
+
+1. **Postgres** — face data, ML embeddings (vectors), album state,
+ sharing, etc. Re-derivable in theory; weeks of recompute in
+ practice. See [[immich-pg-data-migration]].
+2. **Library files** — `/volume1/photos`. Not moving, but the NFS
+ path must be verified accessible from ringtail before cutover.
+ See [[sifaka-nfs-from-ringtail]].
+
+[[borgmatic]] backs both up to sifaka + BorgBase nightly; restore is
+possible but slow. Treat it as a fallback, not a plan.
+
+## Why postgres on ringtail (not cross-cluster)
+
+`immich-pg` already has a Tailscale Service we could point ringtail
+at, leaving the DB on minikube. We're not doing that because:
+
+- The whole goal is to retire minikube — keeping pg there blocks it.
+- Immich is chatty against pg; tailnet round-trips would hurt.
+- CNPG is the same operator on both sides — a Cluster CR on ringtail
+ is mechanically equivalent.
+
+## Approach
+
+This is a C2 Mikado chain. The prerequisite cards each represent a
+distinct surface that has to work before cutover. See
+[[agent-change-process#C2 — Mikado Chain]] for the discipline.
+
+## Workflow note: registering new ArgoCD apps during the chain
+
+This chain adds three new ArgoCD `Application` definitions in
+`argocd/apps/`: `cloudnative-pg-ringtail`, `databases-ringtail`,
+and (later) `immich-ringtail`. The usual C1/C2 pattern of
+`argocd app set --revision && argocd app sync `
+does NOT work for the app-of-apps `apps` Application itself, because
+`apps` self-manages: it re-reads `apps.yaml` (which declares
+`targetRevision: main`) on every sync and reverts the override. As a
+result, new app definitions added on a feature branch are never
+visible to the cluster via `apps`.
+
+**Use `kubectl apply` to register each new Application directly:**
+
+```fish
+kubectl --context=minikube-indri apply -f argocd/apps/.yaml
+```
+
+This creates the Application resource out-of-band, bypassing `apps`.
+
+For apps whose source lives in **this** repo (e.g.
+`databases-ringtail`, `immich-ringtail` — manifest paths exist only
+on the branch until merge), follow the apply with a branch override:
+
+```fish
+argocd app set --revision mikado/migrate-immich-to-ringtail
+argocd app sync
+```
+
+For apps whose source is an **external** repo at a pinned tag (e.g.
+`cloudnative-pg-ringtail` → `mirrors/cloudnative-pg` `v1.27.1`), no
+override is needed — the source revision is independent of this PR.
+
+After PR merge:
+
+```fish
+argocd app set --revision main
+argocd app sync
+```
+
+`apps` itself, on its next sync from `main`, will discover the new
+Application definitions in `argocd/apps/` and adopt the already-running
+resources without disruption — provided their in-cluster spec matches
+the on-disk definitions (which it does because we applied the same
+file).
+
+## Related
+
+- [[shower-on-ringtail]] — a previous migration to ringtail (simpler:
+ no upstream cluster, SQLite, no GPU)
+- [[connect-to-postgres]] — getting a psql session against CNPG
+- [[ringtail]] — the target cluster
+- [[cnpg-on-ringtail]], [[immich-pg-on-ringtail]],
+ [[immich-pg-data-migration]], [[sifaka-nfs-from-ringtail]],
+ [[immich-app-on-ringtail]], [[immich-cutover-and-decommission]] —
+ the prerequisite cards
diff --git a/docs/how-to/immich/sifaka-nfs-from-ringtail.md b/docs/how-to/immich/sifaka-nfs-from-ringtail.md
new file mode 100644
index 0000000..2c490c1
--- /dev/null
+++ b/docs/how-to/immich/sifaka-nfs-from-ringtail.md
@@ -0,0 +1,67 @@
+---
+title: Sifaka NFS Photos from Ringtail
+modified: 2026-05-13
+last-reviewed: 2026-05-13
+tags:
+ - how-to
+ - operations
+ - storage
+ - nfs
+ - sifaka
+---
+
+# Sifaka NFS Photos from Ringtail
+
+The Immich library lives at `sifaka:/volume1/photos` and is mounted
+into the pod via an NFS PV (see `argocd/manifests/immich/pv-nfs.yaml`).
+That PV is currently scoped to indri. We need ringtail to mount the
+same path with the same RWX semantics, without breaking the existing
+indri mount during the transition.
+
+## What to verify / do
+
+- Check `sifaka` DSM NFS rules for the `photos` share. Per
+ [[shower-on-ringtail#NFS + SMB share on sifaka]] convention, rules
+ use `192.168.1.0/24` + `100.64.0.0/10` with
+ `all_squash`/`Map all users to admin`. The existing rule may
+ already cover ringtail (it's on `192.168.1.21` per the recent
+ static-IP pin). If so this card is a verification card.
+- If the rule is locked to indri's IP: add an entry for ringtail
+ (192.168.1.21) or widen to the subnet pattern above.
+- Test mount from a ringtail debug pod (busybox or alpine with
+ nfs-utils) against the `photos` share. Read a file. Write a temp
+ file. Delete it.
+- Watch for the known sifaka NFS-over-Tailscale gotcha: sifaka's
+ Tailscale must be in TUN mode (not userspace) for NFS to work
+ reliably over the tailnet. The NFS path here goes over the LAN
+ (not tailnet), so this shouldn't bite, but worth confirming the
+ NFS traffic is on `192.168.1.x` not `100.x`.
+
+## PV + PVC on ringtail
+
+- New `pv-nfs.yaml` mirroring the minikube one (name can be shared
+ if the PV is cluster-scoped — but PVs are per-cluster, so just
+ duplicate). Same `server: sifaka`, same path, same
+ `accessModes: [ReadWriteMany]`, `persistentVolumeReclaimPolicy:
+ Retain`.
+- New `pvc.yaml` in the ringtail `immich` namespace bound to it.
+- The minikube PVC stays bound and active until cutover — both
+ clusters can have the share NFS-mounted simultaneously (NFS RWX
+ permits this). Immich itself must not be running on both sides
+ at once.
+
+## Verification
+
+- A pod on ringtail can `ls /mnt/photos/` and see the same files
+ as the indri pod.
+- File written from ringtail pod is visible from indri pod and
+ vice versa (proves there's no caching surprise).
+
+## Out of scope
+
+- Migrating photo files. Nothing moves; this is just adding a second
+ NFS client.
+- The `pvc-ml-cache.yaml` PVC (a separate ML model cache). That's
+ not on NFS — it's a regular PVC. Recreated empty on ringtail in
+ [[immich-app-on-ringtail]]; the first ML pod boot will repopulate
+ it.
From dc69b8c68be6d158f15178a08f9f09603de50381 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Wed, 13 May 2026 18:55:50 -0700
Subject: [PATCH 02/52] C1: fix borgmatic shower SQLite dump (ssh to ringtail)
(#357)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Nightly borgmatic backups have been failing for 2 days. Root cause: the
shower SQLite dump `before_backup` hook (added in PR #349) referenced
`kubectl --context=k3s-ringtail`, but indri's kubeconfig deliberately
doesn't carry the ringtail credentials. The hook's failure aborted the
entire run, taking out *both* the local sifaka repo and the BorgBase
offsite. Verified the last good archive was `indri-2026-05-11T02:00`.
## Approach
ssh into ringtail and run `k3s kubectl` there — no indri-side
kubeconfig needed. `/etc/rancher/k3s/k3s.yaml` is mode 644 so no sudo
required, and the existing ssh access from indri to ringtail works.
Inline-shell quoting got hairy fast (fish on ringtail rejected `POD=...`
bash syntax; the nix shower image lacks `tar` so `kubectl cp` fails).
Pulled the dump logic into `~/bin/borgmatic-k8s-sqlite-dump`, deployed
by the ansible role. Each dump entry now declares a `target`:
- `local:` — local kubectl with explicit context (mealie)
- `ssh:` — ssh + `k3s kubectl` on the cluster host (shower)
Bytes come back via `kubectl exec ... -- cat` instead of `kubectl cp`
since `cp` needs `tar` in the pod (nix-built containers don't bundle it).
## Test plan
- [x] `mise run provision-indri -- --tags borgmatic --check --diff` shows expected diff
- [x] Apply, helper script deployed at `~/bin/borgmatic-k8s-sqlite-dump`
- [x] Helper invoked directly with `ssh:eblume@ringtail` produces a valid 288 KB SQLite file
- [x] Full `borgmatic create` completes without errors — both mealie.db (1.7 MB) and shower.db (288 KB) appear in `~/.local/share/borgmatic/k8s-dumps/`, archive `indri-2026-05-13T17:31:02` written to sifaka borg repo
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/357
---
ansible/roles/borgmatic/defaults/main.yml | 8 ++-
ansible/roles/borgmatic/tasks/main.yml | 14 ++++
.../roles/borgmatic/templates/config.yaml.j2 | 14 +++-
.../borgmatic/templates/k8s-sqlite-dump.sh.j2 | 71 +++++++++++++++++++
.../fix-borgmatic-shower-via-ssh.bugfix.md | 14 ++++
5 files changed, 116 insertions(+), 5 deletions(-)
create mode 100644 ansible/roles/borgmatic/templates/k8s-sqlite-dump.sh.j2
create mode 100644 docs/changelog.d/fix-borgmatic-shower-via-ssh.bugfix.md
diff --git a/ansible/roles/borgmatic/defaults/main.yml b/ansible/roles/borgmatic/defaults/main.yml
index 123cb0f..3a89a09 100644
--- a/ansible/roles/borgmatic/defaults/main.yml
+++ b/ansible/roles/borgmatic/defaults/main.yml
@@ -56,12 +56,16 @@ borgmatic_k8s_sqlite_dumps:
namespace: mealie
label_selector: app=mealie
db_path: /app/data/mealie.db
- context: minikube
+ # local kubectl, --context=minikube (indri's only configured ctx)
+ target: local:minikube
- name: shower
namespace: shower
label_selector: app=shower
db_path: /app/data/db.sqlite3
- context: k3s-ringtail
+ # ssh to ringtail and run k3s kubectl there — avoids needing a
+ # ringtail kubeconfig on indri. k3s.yaml on ringtail is
+ # world-readable (mode 644), so no sudo required.
+ target: ssh:eblume@ringtail
# Exclude patterns
borgmatic_exclude_patterns: []
diff --git a/ansible/roles/borgmatic/tasks/main.yml b/ansible/roles/borgmatic/tasks/main.yml
index eacefa5..4ac242c 100644
--- a/ansible/roles/borgmatic/tasks/main.yml
+++ b/ansible/roles/borgmatic/tasks/main.yml
@@ -49,6 +49,20 @@
mode: '0700'
when: borgmatic_k8s_sqlite_dumps | length > 0
+- name: Ensure ~/bin exists
+ ansible.builtin.file:
+ path: "{{ ansible_env.HOME }}/bin"
+ state: directory
+ mode: '0755'
+ when: borgmatic_k8s_sqlite_dumps | length > 0
+
+- name: Deploy k8s SQLite dump helper script
+ ansible.builtin.template:
+ src: k8s-sqlite-dump.sh.j2
+ dest: "{{ ansible_env.HOME }}/bin/borgmatic-k8s-sqlite-dump"
+ mode: '0755'
+ when: borgmatic_k8s_sqlite_dumps | length > 0
+
- name: Deploy borgmatic configuration
ansible.builtin.template:
src: config.yaml.j2
diff --git a/ansible/roles/borgmatic/templates/config.yaml.j2 b/ansible/roles/borgmatic/templates/config.yaml.j2
index 85804b7..0893dbc 100644
--- a/ansible/roles/borgmatic/templates/config.yaml.j2
+++ b/ansible/roles/borgmatic/templates/config.yaml.j2
@@ -32,12 +32,20 @@ exclude_patterns:
encryption_passcommand: {{ borgmatic_encryption_passcommand }}
{% if borgmatic_k8s_sqlite_dumps %}
-# Pre-backup: dump SQLite databases from k8s pods
-# Uses sqlite3 .backup for a safe, consistent copy (no corruption from concurrent writes)
+# Pre-backup: dump SQLite databases from k8s pods.
+# Uses sqlite3.backup() for a safe, consistent copy.
+#
+# Quoting/escaping is delegated to ~/bin/borgmatic-k8s-sqlite-dump
+# (deployed by the borgmatic ansible role). Each entry's `target`
+# is either:
+# - local: -> local kubectl with --context (mealie etc.)
+# - ssh: -> ssh + k3s kubectl on the cluster host,
+# used for ringtail since indri's kubeconfig
+# deliberately doesn't carry that context.
before_backup:
- mkdir -p {{ borgmatic_k8s_dump_dir }}
{% for db in borgmatic_k8s_sqlite_dumps %}
- - /opt/homebrew/bin/kubectl --context={{ db.context }} exec -n {{ db.namespace }} deploy/{{ db.name }} -- python3 -c "import sqlite3; sqlite3.connect('{{ db.db_path }}').backup(sqlite3.connect('/tmp/{{ db.name }}-backup.db'))" && /opt/homebrew/bin/kubectl --context={{ db.context }} cp {{ db.namespace }}/$(/opt/homebrew/bin/kubectl --context={{ db.context }} get pod -n {{ db.namespace }} -l {{ db.label_selector }} -o jsonpath='{.items[0].metadata.name}'):/tmp/{{ db.name }}-backup.db {{ borgmatic_k8s_dump_dir }}/{{ db.name }}.db
+ - {{ ansible_env.HOME }}/bin/borgmatic-k8s-sqlite-dump {{ db.target }} {{ db.namespace }} {{ db.label_selector }} {{ db.db_path }} {{ db.name }} {{ borgmatic_k8s_dump_dir }}/{{ db.name }}.db
{% endfor %}
{% endif %}
diff --git a/ansible/roles/borgmatic/templates/k8s-sqlite-dump.sh.j2 b/ansible/roles/borgmatic/templates/k8s-sqlite-dump.sh.j2
new file mode 100644
index 0000000..323e717
--- /dev/null
+++ b/ansible/roles/borgmatic/templates/k8s-sqlite-dump.sh.j2
@@ -0,0 +1,71 @@
+#!/usr/bin/env bash
+# {{ ansible_managed }}
+#
+# Helper script invoked by borgmatic's before_backup hook to capture a
+# k8s pod's SQLite database. Keeps the borgmatic config readable by
+# pulling all the quoting out of YAML.
+#
+# Usage:
+# borgmatic-k8s-sqlite-dump \
+#
+#
+# is one of:
+# local: - run local kubectl with --context=
+# ssh: - ssh to host and run k3s kubectl there
+# (no indri-side kubeconfig needed)
+#
+# - k8s namespace of the pod
+# - label selector to find the pod (e.g. app=shower)
+# - absolute path inside the pod to the SQLite DB
+# - short name used for temp filenames
+# - file on this host to receive the dump
+set -euo pipefail
+
+target=${1:?missing target}
+namespace=${2:?missing namespace}
+selector=${3:?missing selector}
+db_path=${4:?missing db path}
+name=${5:?missing name}
+dump_target=${6:?missing dump target}
+
+pod_tmp="/tmp/${name}-backup.db"
+
+python_backup='import sqlite3; sqlite3.connect("'"$db_path"'").backup(sqlite3.connect("'"$pod_tmp"'"))'
+
+mode=${target%%:*}
+ref=${target#*:}
+
+case "$mode" in
+ local)
+ # Pulls dump bytes out via "kubectl exec -- cat" rather than
+ # "kubectl cp", which would otherwise need tar inside the pod
+ # (nix-built images like shower don't bundle tar).
+ context=$ref
+ kubectl="/opt/homebrew/bin/kubectl --context=$context -n $namespace"
+ pod=$($kubectl get pod -l "$selector" \
+ -o jsonpath='{.items[0].metadata.name}')
+ $kubectl exec "$pod" -- python3 -c "$python_backup"
+ $kubectl exec "$pod" -- cat "$pod_tmp" > "$dump_target"
+ $kubectl exec "$pod" -- rm -f "$pod_tmp"
+ ;;
+ ssh)
+ host=$ref
+ # Force bash on the remote (user's login shell on ringtail is
+ # fish). Pipe the script via stdin to dodge nested quoting.
+ # The dump bytes come back over the ssh stdout stream — no
+ # intermediate scp, no tar requirement in the pod.
+ ssh "$host" bash < "$dump_target"
+set -euo pipefail
+export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
+pod=\$(k3s kubectl -n "$namespace" get pod -l "$selector" -o jsonpath='{.items[0].metadata.name}')
+k3s kubectl -n "$namespace" exec "\$pod" -- python3 -c '$python_backup' 1>&2
+k3s kubectl -n "$namespace" exec "\$pod" -- cat "$pod_tmp"
+k3s kubectl -n "$namespace" exec "\$pod" -- rm -f "$pod_tmp" 1>&2
+EOF
+ ;;
+ *)
+ echo "borgmatic-k8s-sqlite-dump: unknown target mode: $mode" >&2
+ echo " expected local: or ssh:" >&2
+ exit 1
+ ;;
+esac
diff --git a/docs/changelog.d/fix-borgmatic-shower-via-ssh.bugfix.md b/docs/changelog.d/fix-borgmatic-shower-via-ssh.bugfix.md
new file mode 100644
index 0000000..e18272c
--- /dev/null
+++ b/docs/changelog.d/fix-borgmatic-shower-via-ssh.bugfix.md
@@ -0,0 +1,14 @@
+Fix nightly borgmatic backups failing for 2 days. The shower SQLite
+dump hook referenced `kubectl --context=k3s-ringtail`, but indri's
+kubeconfig deliberately doesn't carry the ringtail credentials. The
+`before_backup` hook's failure aborted the entire run, taking out
+*both* the local sifaka repo and the BorgBase offsite. Replaced
+the inline-shell dump with a `~/bin/borgmatic-k8s-sqlite-dump`
+helper deployed by the ansible role. Each dump entry now declares a
+`target` of either `local:` (mealie — kubectl uses indri's
+kubeconfig) or `ssh:` (shower — ssh into ringtail and
+run `k3s kubectl` there, no indri-side kubeconfig needed; k3s.yaml
+on ringtail is mode 644 so no sudo required). Bytes stream back via
+`kubectl exec ... -- cat` rather than `kubectl cp`, since `kubectl
+cp` requires `tar` inside the pod and nix-built images like shower
+don't bundle it.
From 6e90c4c3631ec593b0b59d97ecad9bc5b92aea15 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Wed, 13 May 2026 20:12:00 -0700
Subject: [PATCH 03/52] C0: bump shower to v1.1.1 (probe FOD hash)
Co-Authored-By: Claude Opus 4.7 (1M context)
---
containers/shower/default.nix | 8 ++++----
docs/changelog.d/+shower-1.1.1.infra.md | 1 +
2 files changed, 5 insertions(+), 4 deletions(-)
create mode 100644 docs/changelog.d/+shower-1.1.1.infra.md
diff --git a/containers/shower/default.nix b/containers/shower/default.nix
index e2d369d..242d873 100644
--- a/containers/shower/default.nix
+++ b/containers/shower/default.nix
@@ -25,7 +25,7 @@
{ pkgs ? import { } }:
let
- version = "1.1.0";
+ version = "1.1.1";
python = pkgs.python314;
@@ -43,7 +43,7 @@ let
showerSdist = pkgs.fetchurl {
name = "adelaide_baby_shower_app-${version}.tar.gz";
url = "https://forge.ops.eblu.me/api/packages/eblume/pypi/files/adelaide-baby-shower-app/${version}/adelaide_baby_shower_app-${version}.tar.gz";
- hash = "sha256-5dp+0u4metOIC6s6/nPlT4cdpFBCV6S3+Z/3RO0sX5U=";
+ hash = "sha256-muvjkcKnLrrQTb8HZ4cH9SD0pab05JSFSgwheqb0AyM=";
};
# Wheel pulled from forge.ops.eblu.me (tailnet) for the same reason the
@@ -53,7 +53,7 @@ let
showerWheel = pkgs.fetchurl {
name = "adelaide_baby_shower_app-${version}-py3-none-any.whl";
url = "https://forge.ops.eblu.me/api/packages/eblume/pypi/files/adelaide-baby-shower-app/${version}/adelaide_baby_shower_app-${version}-py3-none-any.whl";
- hash = "sha256-7orFbycON9dQxEIb6q45Xx2rFlEZ8xXSrC2tnrO5uug=";
+ hash = "sha256-dorrwHhZhOn9Qq6Wk3Su24HckgaWtWbkMY7RtAvomv4=";
};
staticAssets = pkgs.runCommand "shower-static-assets-${version}" { } ''
@@ -148,7 +148,7 @@ let
outputHashAlgo = "sha256";
# Pinned dep closure — reproducible until version bumps. To recompute,
# set to pkgs.lib.fakeHash and read the failure.
- outputHash = "sha256-kTNOswobtkgyQmmqbQM8XO4vvaGg57nCuuZGbNXb0NM=";
+ outputHash = pkgs.lib.fakeHash;
dontFixup = true;
};
diff --git a/docs/changelog.d/+shower-1.1.1.infra.md b/docs/changelog.d/+shower-1.1.1.infra.md
new file mode 100644
index 0000000..eb9476c
--- /dev/null
+++ b/docs/changelog.d/+shower-1.1.1.infra.md
@@ -0,0 +1 @@
+Bump shower container to v1.1.1 (probe FOD hash).
From 4e117dc921f4106e7c243e8eed86953bb1f025b4 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Wed, 13 May 2026 20:40:22 -0700
Subject: [PATCH 04/52] C0: pin shower v1.1.1 FOD outputHash (probed on
ringtail)
Co-Authored-By: Claude Opus 4.7 (1M context)
---
containers/shower/default.nix | 2 +-
docs/changelog.d/+shower-1.1.1-fod-pin.infra.md | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+shower-1.1.1-fod-pin.infra.md
diff --git a/containers/shower/default.nix b/containers/shower/default.nix
index 242d873..4f807ed 100644
--- a/containers/shower/default.nix
+++ b/containers/shower/default.nix
@@ -148,7 +148,7 @@ let
outputHashAlgo = "sha256";
# Pinned dep closure — reproducible until version bumps. To recompute,
# set to pkgs.lib.fakeHash and read the failure.
- outputHash = pkgs.lib.fakeHash;
+ outputHash = "sha256-HTTmAldIijG03pYZNyO72LBNPCrjmyJQKgW+gU9NplI=";
dontFixup = true;
};
diff --git a/docs/changelog.d/+shower-1.1.1-fod-pin.infra.md b/docs/changelog.d/+shower-1.1.1-fod-pin.infra.md
new file mode 100644
index 0000000..a19b578
--- /dev/null
+++ b/docs/changelog.d/+shower-1.1.1-fod-pin.infra.md
@@ -0,0 +1 @@
+Pin shower v1.1.1 FOD outputHash (probed locally on ringtail).
From 4d2bc9975fc8c0ab18294d71cd5be790bfb8b926 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Wed, 13 May 2026 20:51:10 -0700
Subject: [PATCH 05/52] C0: deploy shower v1.1.1 (kustomize newTag bump)
Co-Authored-By: Claude Opus 4.7 (1M context)
---
argocd/manifests/shower/kustomization.yaml | 2 +-
docs/changelog.d/+shower-1.1.1-deploy.infra.md | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+shower-1.1.1-deploy.infra.md
diff --git a/argocd/manifests/shower/kustomization.yaml b/argocd/manifests/shower/kustomization.yaml
index b6de844..c0cf4c8 100644
--- a/argocd/manifests/shower/kustomization.yaml
+++ b/argocd/manifests/shower/kustomization.yaml
@@ -14,4 +14,4 @@ resources:
images:
- name: registry.ops.eblu.me/blumeops/shower
- newTag: v1.1.0-3c7967e-nix
+ newTag: v1.1.1-4e117dc-nix
diff --git a/docs/changelog.d/+shower-1.1.1-deploy.infra.md b/docs/changelog.d/+shower-1.1.1-deploy.infra.md
new file mode 100644
index 0000000..61244ac
--- /dev/null
+++ b/docs/changelog.d/+shower-1.1.1-deploy.infra.md
@@ -0,0 +1 @@
+Deploy shower v1.1.1 to ringtail (kustomize newTag bump).
From 12314857d8b9fdc17c5dd97b1b92a36d8463c386 Mon Sep 17 00:00:00 2001
From: Erich Blume <725328+eblume@users.noreply.github.com>
Date: Fri, 15 May 2026 06:27:43 -0700
Subject: [PATCH 06/52] C0: add GE-Proton to ringtail Steam extraCompatPackages
Lets Subnautica 2 (and any other game) opt into the GE-Proton
build via Steam's per-game compatibility tool override, as a
workaround for the Proton Experimental + DXVK D3D12 Mercuna hang.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
docs/changelog.d/+ringtail-proton-ge.infra.md | 4 ++++
nixos/ringtail/gaming.nix | 1 +
2 files changed, 5 insertions(+)
create mode 100644 docs/changelog.d/+ringtail-proton-ge.infra.md
diff --git a/docs/changelog.d/+ringtail-proton-ge.infra.md b/docs/changelog.d/+ringtail-proton-ge.infra.md
new file mode 100644
index 0000000..0d8bc04
--- /dev/null
+++ b/docs/changelog.d/+ringtail-proton-ge.infra.md
@@ -0,0 +1,4 @@
+Add GE-Proton (`pkgs.proton-ge-bin`) to `programs.steam.extraCompatPackages`
+on ringtail. Subnautica 2 hangs at Mercuna plugin init under Proton
+Experimental + DXVK D3D12; GE-Proton is available as a Steam per-game
+compatibility option to work around it.
diff --git a/nixos/ringtail/gaming.nix b/nixos/ringtail/gaming.nix
index d84ef9b..c526857 100644
--- a/nixos/ringtail/gaming.nix
+++ b/nixos/ringtail/gaming.nix
@@ -5,6 +5,7 @@
programs.steam = {
enable = true;
dedicatedServer.openFirewall = true;
+ extraCompatPackages = [ pkgs.proton-ge-bin ];
};
# Proton Experimental ships an accessibility bridge (xalia) that hangs during
From a33fa47b8063f7ae47ada6f10feb8030f2c69426 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 15 May 2026 06:50:46 -0700
Subject: [PATCH 07/52] C1: deploy shower v1.1.2 (#358)
## Summary
Deploys `adelaide-baby-shower-app` **v1.1.2** to ringtail k3s.
- Bumps `containers/shower/default.nix` `version` to 1.1.2.
- Refreshes sdist + wheel `fetchurl` hashes against the forge PyPI artifacts.
- Re-probed FOD `outputHash` on the nix-container-builder runner (ringtail) and pinned the new closure hash.
- Bumps kustomize `newTag` to `v1.1.2-b8c7783-nix` (built from this branch's tip).
- Bumps `service-versions.yaml` entry for shower to `1.1.2` / `last-reviewed: 2026-05-15`.
## Build provenance
Built by Forgejo Actions run #553 on `nix-container-builder` (ringtail) at commit `b8c7783`. After merge a C0 follow-on will rebuild from main and retag so future provenance points at main history.
## Test plan
- [ ] `argocd app set shower --revision shower-v1.1.2 && argocd app sync shower` deploys cleanly
- [ ] Pod migrates the SQLite PV and serves at `shower.ops.eblu.me` / `shower.eblu.me`
- [ ] No new errors in pod logs after `collectstatic` + gunicorn boot
Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/358
---
argocd/manifests/shower/kustomization.yaml | 2 +-
containers/shower/default.nix | 8 ++++----
docs/changelog.d/shower-v1.1.2.infra.md | 1 +
service-versions.yaml | 4 ++--
4 files changed, 8 insertions(+), 7 deletions(-)
create mode 100644 docs/changelog.d/shower-v1.1.2.infra.md
diff --git a/argocd/manifests/shower/kustomization.yaml b/argocd/manifests/shower/kustomization.yaml
index c0cf4c8..2c4dadb 100644
--- a/argocd/manifests/shower/kustomization.yaml
+++ b/argocd/manifests/shower/kustomization.yaml
@@ -14,4 +14,4 @@ resources:
images:
- name: registry.ops.eblu.me/blumeops/shower
- newTag: v1.1.1-4e117dc-nix
+ newTag: v1.1.2-b8c7783-nix
diff --git a/containers/shower/default.nix b/containers/shower/default.nix
index 4f807ed..f7115bc 100644
--- a/containers/shower/default.nix
+++ b/containers/shower/default.nix
@@ -25,7 +25,7 @@
{ pkgs ? import { } }:
let
- version = "1.1.1";
+ version = "1.1.2";
python = pkgs.python314;
@@ -43,7 +43,7 @@ let
showerSdist = pkgs.fetchurl {
name = "adelaide_baby_shower_app-${version}.tar.gz";
url = "https://forge.ops.eblu.me/api/packages/eblume/pypi/files/adelaide-baby-shower-app/${version}/adelaide_baby_shower_app-${version}.tar.gz";
- hash = "sha256-muvjkcKnLrrQTb8HZ4cH9SD0pab05JSFSgwheqb0AyM=";
+ hash = "sha256-U00259dlvHSo0c9I/W0kSThyhNKUT8ukG6X+vzj0k9c=";
};
# Wheel pulled from forge.ops.eblu.me (tailnet) for the same reason the
@@ -53,7 +53,7 @@ let
showerWheel = pkgs.fetchurl {
name = "adelaide_baby_shower_app-${version}-py3-none-any.whl";
url = "https://forge.ops.eblu.me/api/packages/eblume/pypi/files/adelaide-baby-shower-app/${version}/adelaide_baby_shower_app-${version}-py3-none-any.whl";
- hash = "sha256-dorrwHhZhOn9Qq6Wk3Su24HckgaWtWbkMY7RtAvomv4=";
+ hash = "sha256-lF79G9SiCuxG9LcyDJkTeTeJL72qTJTDVE196At1Ods=";
};
staticAssets = pkgs.runCommand "shower-static-assets-${version}" { } ''
@@ -148,7 +148,7 @@ let
outputHashAlgo = "sha256";
# Pinned dep closure — reproducible until version bumps. To recompute,
# set to pkgs.lib.fakeHash and read the failure.
- outputHash = "sha256-HTTmAldIijG03pYZNyO72LBNPCrjmyJQKgW+gU9NplI=";
+ outputHash = "sha256-B5INpydOP3DmlgHfgpzKf+2mv0y9Wr2YNK7/5kh0hOc=";
dontFixup = true;
};
diff --git a/docs/changelog.d/shower-v1.1.2.infra.md b/docs/changelog.d/shower-v1.1.2.infra.md
new file mode 100644
index 0000000..aa2db0d
--- /dev/null
+++ b/docs/changelog.d/shower-v1.1.2.infra.md
@@ -0,0 +1 @@
+Deploy shower v1.1.2 — bump container build to new app release.
diff --git a/service-versions.yaml b/service-versions.yaml
index 63bc5df..02f2979 100644
--- a/service-versions.yaml
+++ b/service-versions.yaml
@@ -46,8 +46,8 @@ services:
- name: shower
type: argocd
- last-reviewed: 2026-05-11
- current-version: "1.1.0"
+ last-reviewed: 2026-05-15
+ current-version: "1.1.2"
upstream-source: https://forge.eblu.me/eblume/adelaide-baby-shower-app
notes: |
Django app for Adelaide / Heidi / Addie's baby shower. Wheel
From 815a0cc6e6d2dc7579633853fd8d06b94afddb26 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 15 May 2026 06:57:24 -0700
Subject: [PATCH 08/52] =?UTF-8?q?C0:=20shower=20=E2=80=94=20rebuild=20from?=
=?UTF-8?q?=20main=20SHA=20(post-merge=20retag)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
PR #358 was squash-merged so the branch commit b8c7783 baked into the
prior image tag isn't reachable from main's history. Rebuild from main
HEAD (a33fa47) and retag. Image content is byte-identical (FOD is
content-addressed, inputs unchanged); only the SHA in the tag changes
so future provenance tracing stays on main.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
argocd/manifests/shower/kustomization.yaml | 2 +-
docs/changelog.d/+shower-v1.1.2-rebuild-from-main-sha.misc.md | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+shower-v1.1.2-rebuild-from-main-sha.misc.md
diff --git a/argocd/manifests/shower/kustomization.yaml b/argocd/manifests/shower/kustomization.yaml
index 2c4dadb..6d4628c 100644
--- a/argocd/manifests/shower/kustomization.yaml
+++ b/argocd/manifests/shower/kustomization.yaml
@@ -14,4 +14,4 @@ resources:
images:
- name: registry.ops.eblu.me/blumeops/shower
- newTag: v1.1.2-b8c7783-nix
+ newTag: v1.1.2-a33fa47-nix
diff --git a/docs/changelog.d/+shower-v1.1.2-rebuild-from-main-sha.misc.md b/docs/changelog.d/+shower-v1.1.2-rebuild-from-main-sha.misc.md
new file mode 100644
index 0000000..9355a54
--- /dev/null
+++ b/docs/changelog.d/+shower-v1.1.2-rebuild-from-main-sha.misc.md
@@ -0,0 +1 @@
+Rebuild shower v1.1.2 from main HEAD (a33fa47) and retag — PR #358 was squash-merged so the branch SHA baked into the prior image tag isn't reachable from main. FOD is content-addressed, so image bytes are identical; only provenance changes.
From 96dbbb3cbe7d8a9f695c3bc0bf7006367d1181a4 Mon Sep 17 00:00:00 2001
From: Erich Blume <725328+eblume@users.noreply.github.com>
Date: Fri, 15 May 2026 12:11:54 -0700
Subject: [PATCH 09/52] C0: add sn2-prelaunch wrapper to clear SN2 stale
lockfiles
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
UE5 writes Saved/running.dat as a "session in progress" marker. If
the previous session exited uncleanly (SIGKILL, crash), it lingers,
and SN2 pops up an invisible 0×0 Error dialog at next launch that
the GameThread blocks on forever — visible only as a black screen
with a spinning loader. Wrap the Steam command to clear the marker
files before each launch.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
.../+ringtail-sn2-prelaunch.infra.md | 6 ++++++
nixos/ringtail/gaming.nix | 17 +++++++++++++++++
2 files changed, 23 insertions(+)
create mode 100644 docs/changelog.d/+ringtail-sn2-prelaunch.infra.md
diff --git a/docs/changelog.d/+ringtail-sn2-prelaunch.infra.md b/docs/changelog.d/+ringtail-sn2-prelaunch.infra.md
new file mode 100644
index 0000000..f9c68e2
--- /dev/null
+++ b/docs/changelog.d/+ringtail-sn2-prelaunch.infra.md
@@ -0,0 +1,6 @@
+Add `sn2-prelaunch` Steam launch wrapper on ringtail that removes
+Subnautica 2's stale `Saved/running.dat` and `Saved/beforelobby.dat`
+lockfiles before each launch. SN2 pops up an invisible (0×0-sized)
+Error dialog when it detects an unclean exit, blocking GameThread
+forever; this is observable only as a black screen with a spinning
+loader. Use via Steam launch option: `sn2-prelaunch %command%`.
diff --git a/nixos/ringtail/gaming.nix b/nixos/ringtail/gaming.nix
index c526857..7c00378 100644
--- a/nixos/ringtail/gaming.nix
+++ b/nixos/ringtail/gaming.nix
@@ -13,6 +13,23 @@
# so disable xalia globally to avoid wedging iscriptevaluator.exe.
environment.sessionVariables.PROTON_USE_XALIA = "0";
+ # Subnautica 2 pre-launch wrapper. SN2 (UE5) writes Saved/running.dat as a
+ # "currently running" lockfile. If the prior session exited uncleanly (SIGKILL
+ # via Steam's Stop button, crash, etc.), the file persists and on next launch
+ # SN2 pops up an invisible (0x0-sized) Error dialog ("Your game might not have
+ # exited correctly last time...") that the GameThread blocks on forever —
+ # observable only as a black screen with a spinning loader. This wrapper
+ # removes the stale lockfiles before exec'ing the actual game command.
+ # Use as Steam launch option for Subnautica 2:
+ # sn2-prelaunch %command%
+ environment.systemPackages = [
+ (pkgs.writeShellScriptBin "sn2-prelaunch" ''
+ saved="/mnt/games/SteamLibrary/steamapps/compatdata/1962700/pfx/drive_c/users/steamuser/AppData/Local/Subnautica2/Saved"
+ rm -f "$saved/running.dat" "$saved/beforelobby.dat"
+ exec "$@"
+ '')
+ ];
+
# Gamescope — micro-compositor for game fullscreen/resolution management.
# Use as Steam launch option: gamescope -W 2560 -H 1440 -f -- %command%
programs.gamescope = {
From 3645098bf1d64afb46ab562faae1a8aabeee1501 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 15 May 2026 19:56:08 -0700
Subject: [PATCH 10/52] C0: bump shower to v1.1.3
Wheel/sdist + FOD hashes probed on ringtail. Full nix-build verified
end-to-end before commit.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
containers/shower/default.nix | 8 ++++----
docs/changelog.d/+shower-1.1.3.infra.md | 1 +
2 files changed, 5 insertions(+), 4 deletions(-)
create mode 100644 docs/changelog.d/+shower-1.1.3.infra.md
diff --git a/containers/shower/default.nix b/containers/shower/default.nix
index f7115bc..c5bd41e 100644
--- a/containers/shower/default.nix
+++ b/containers/shower/default.nix
@@ -25,7 +25,7 @@
{ pkgs ? import { } }:
let
- version = "1.1.2";
+ version = "1.1.3";
python = pkgs.python314;
@@ -43,7 +43,7 @@ let
showerSdist = pkgs.fetchurl {
name = "adelaide_baby_shower_app-${version}.tar.gz";
url = "https://forge.ops.eblu.me/api/packages/eblume/pypi/files/adelaide-baby-shower-app/${version}/adelaide_baby_shower_app-${version}.tar.gz";
- hash = "sha256-U00259dlvHSo0c9I/W0kSThyhNKUT8ukG6X+vzj0k9c=";
+ hash = "sha256-a3rCwEdOB+rnYXqsWDifyltpyKUgkOj0ikWB+WGQYKE=";
};
# Wheel pulled from forge.ops.eblu.me (tailnet) for the same reason the
@@ -53,7 +53,7 @@ let
showerWheel = pkgs.fetchurl {
name = "adelaide_baby_shower_app-${version}-py3-none-any.whl";
url = "https://forge.ops.eblu.me/api/packages/eblume/pypi/files/adelaide-baby-shower-app/${version}/adelaide_baby_shower_app-${version}-py3-none-any.whl";
- hash = "sha256-lF79G9SiCuxG9LcyDJkTeTeJL72qTJTDVE196At1Ods=";
+ hash = "sha256-a6j91gBigG4IzE2DVTBntnZ46Yrx9b5PgHn+Uro98Tk=";
};
staticAssets = pkgs.runCommand "shower-static-assets-${version}" { } ''
@@ -148,7 +148,7 @@ let
outputHashAlgo = "sha256";
# Pinned dep closure — reproducible until version bumps. To recompute,
# set to pkgs.lib.fakeHash and read the failure.
- outputHash = "sha256-B5INpydOP3DmlgHfgpzKf+2mv0y9Wr2YNK7/5kh0hOc=";
+ outputHash = "sha256-1xx2qWAIwherklHIPXo6IOKkKHML1KUrUx6pbkMxffc=";
dontFixup = true;
};
diff --git a/docs/changelog.d/+shower-1.1.3.infra.md b/docs/changelog.d/+shower-1.1.3.infra.md
new file mode 100644
index 0000000..33ee49d
--- /dev/null
+++ b/docs/changelog.d/+shower-1.1.3.infra.md
@@ -0,0 +1 @@
+Bumped shower app to v1.1.3 (wheel/sdist + FOD hashes probed on ringtail).
From e222d47d455d07d18d1cf66d2a8984aa85d32586 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 15 May 2026 20:09:54 -0700
Subject: [PATCH 11/52] C0: deploy shower v1.1.3 (kustomize newTag bump)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Image v1.1.3-3645098-nix was built directly on ringtail and pushed via
skopeo, bypassing the Forgejo runner: indri was severely overloaded
(load avg 24.92, minikube VM at 344% CPU) and the workflow-dispatch
endpoint timed out. The image content is identical to what the runner
would have produced — same default.nix at commit 3645098 (on main),
same NIX_PATH (current nixpkgs flake), same skopeo invocation. Tag
short-sha matches the commit that defines the recipe so we aren't
pinning to a ghost.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
argocd/manifests/shower/kustomization.yaml | 2 +-
docs/changelog.d/+shower-1.1.3-deploy.infra.md | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+shower-1.1.3-deploy.infra.md
diff --git a/argocd/manifests/shower/kustomization.yaml b/argocd/manifests/shower/kustomization.yaml
index 6d4628c..1c29224 100644
--- a/argocd/manifests/shower/kustomization.yaml
+++ b/argocd/manifests/shower/kustomization.yaml
@@ -14,4 +14,4 @@ resources:
images:
- name: registry.ops.eblu.me/blumeops/shower
- newTag: v1.1.2-a33fa47-nix
+ newTag: v1.1.3-3645098-nix
diff --git a/docs/changelog.d/+shower-1.1.3-deploy.infra.md b/docs/changelog.d/+shower-1.1.3-deploy.infra.md
new file mode 100644
index 0000000..833fac6
--- /dev/null
+++ b/docs/changelog.d/+shower-1.1.3-deploy.infra.md
@@ -0,0 +1 @@
+Deployed shower v1.1.3 to ringtail (image built and pushed from ringtail; runner bypassed due to indri overload).
From 1897eb1c5bf4ef1f6d3dfe3601f875b49b8ba2a4 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Sun, 17 May 2026 08:46:22 -0700
Subject: [PATCH 12/52] C0: move immich blackbox probe to ringtail alloy
Immich migrated to ringtail's k3s cluster but the probe still targeted
the in-cluster service DNS on indri's minikube, firing ServiceProbeFailure
indefinitely. Moved the target into alloy-ringtail's config so the probe
runs in the cluster where immich actually lives.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
argocd/manifests/alloy-k8s/config.alloy | 6 ------
argocd/manifests/alloy-ringtail/config.alloy | 20 +++++++++++++++++++
.../+immich-probe-ringtail.infra.md | 1 +
3 files changed, 21 insertions(+), 6 deletions(-)
create mode 100644 docs/changelog.d/+immich-probe-ringtail.infra.md
diff --git a/argocd/manifests/alloy-k8s/config.alloy b/argocd/manifests/alloy-k8s/config.alloy
index 56a2e13..5a0a8f9 100644
--- a/argocd/manifests/alloy-k8s/config.alloy
+++ b/argocd/manifests/alloy-k8s/config.alloy
@@ -196,12 +196,6 @@ prometheus.exporter.blackbox "services" {
module = "http_2xx"
}
- target {
- name = "immich"
- address = "http://immich-server.immich.svc.cluster.local:2283/api/server/ping"
- module = "http_2xx"
- }
-
target {
name = "navidrome"
address = "http://navidrome.navidrome.svc.cluster.local:4533/"
diff --git a/argocd/manifests/alloy-ringtail/config.alloy b/argocd/manifests/alloy-ringtail/config.alloy
index e92ab0f..e5cc045 100644
--- a/argocd/manifests/alloy-ringtail/config.alloy
+++ b/argocd/manifests/alloy-ringtail/config.alloy
@@ -45,6 +45,26 @@ prometheus.scrape "kube_state_metrics" {
forward_to = [prometheus.remote_write.prometheus.receiver]
}
+// ============== SERVICE HEALTH PROBES ==============
+
+// Blackbox-style HTTP probes for in-cluster services on ringtail
+prometheus.exporter.blackbox "services" {
+ config = "{ modules: { http_2xx: { prober: http, timeout: 5s } } }"
+
+ target {
+ name = "immich"
+ address = "http://immich-server.immich.svc.cluster.local:2283/api/server/ping"
+ module = "http_2xx"
+ }
+}
+
+// Scrape blackbox probe results
+prometheus.scrape "blackbox" {
+ targets = prometheus.exporter.blackbox.services.targets
+ scrape_interval = "30s"
+ forward_to = [prometheus.remote_write.prometheus.receiver]
+}
+
// Push metrics to indri Prometheus
prometheus.remote_write "prometheus" {
external_labels = { cluster = "ringtail" }
diff --git a/docs/changelog.d/+immich-probe-ringtail.infra.md b/docs/changelog.d/+immich-probe-ringtail.infra.md
new file mode 100644
index 0000000..f2d3dee
--- /dev/null
+++ b/docs/changelog.d/+immich-probe-ringtail.infra.md
@@ -0,0 +1 @@
+Moved the Immich blackbox health probe from indri's alloy to ringtail's alloy. After the immich migration to ringtail, the probe still targeted `immich-server.immich.svc.cluster.local` on indri's cluster where the service no longer exists, causing a persistent `ServiceProbeFailure` alert.
From 2fae0f71618cb7ba8858714693a127555ace6543 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Tue, 19 May 2026 06:33:26 -0700
Subject: [PATCH 13/52] C0: switch grafana deployment to Recreate strategy
Grafana uses an RWO PVC for SQLite + Bleve search index. RollingUpdate
spawns the new pod before terminating the old one, so the new pod
crashloops on the index lock until rollout timeout. Recreate terminates
the old pod first, letting the new pod acquire the lock cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
argocd/manifests/grafana/deployment.yaml | 4 +++-
docs/changelog.d/+grafana-recreate-strategy.infra.md | 1 +
2 files changed, 4 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+grafana-recreate-strategy.infra.md
diff --git a/argocd/manifests/grafana/deployment.yaml b/argocd/manifests/grafana/deployment.yaml
index 0aad9b3..cbba267 100644
--- a/argocd/manifests/grafana/deployment.yaml
+++ b/argocd/manifests/grafana/deployment.yaml
@@ -14,7 +14,9 @@ spec:
app.kubernetes.io/name: grafana
app.kubernetes.io/instance: grafana
strategy:
- type: RollingUpdate
+ # RWO PVC for SQLite + Bleve index — RollingUpdate spawns the new pod
+ # before the old one terminates, and it crashloops on the index lock.
+ type: Recreate
template:
metadata:
labels:
diff --git a/docs/changelog.d/+grafana-recreate-strategy.infra.md b/docs/changelog.d/+grafana-recreate-strategy.infra.md
new file mode 100644
index 0000000..3662e10
--- /dev/null
+++ b/docs/changelog.d/+grafana-recreate-strategy.infra.md
@@ -0,0 +1 @@
+Switched Grafana's deployment strategy from `RollingUpdate` to `Recreate`. With an RWO PVC holding the SQLite database and Bleve search index, `RollingUpdate` reliably crashloops the new pod on the index lock until rollout timeout. `Recreate` terminates the old pod first so the new one acquires the lock cleanly.
From ee51bcafb447ff1ef6e76f67f2d0a51fdaffb1c4 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 22 May 2026 21:08:53 -0700
Subject: [PATCH 14/52] Rip out compensating-controls framework (#359)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Removes the compensating-controls (CC) framework. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files stay in place but no longer carry \`CC: \` prefixes — each entry now just keeps a free-form \`Description\` of why it's muted.
The CC review cadence proved to be more process overhead than this single-operator homelab needed.
## What changed
**Deleted**
- \`compensating-controls.yaml\` — the CC registry
- \`mise-tasks/review-compensating-controls\` — the staleness-review task
- \`docs/how-to/operations/review-compensating-controls.md\`
- \`docs/how-to/operations/record-review-evidence.md\` (was aspirational)
- \`docs/explanation/compliance-mute-categories.md\` (proposed-future CC/NA/RA work)
- 5 orphan \`+review-cc-*\` / \`+compliance-mute-categories\` changelog fragments
**Modified**
- 6 mutelist YAML files: stripped \`CC: .\` prefix from every \`Description\` / \`statement\` field, kept the free-form text
- \`mise-tasks/review-compliance-reports\`: removed CC mentions from docstrings, panel text, and the node-verification table title. Node-verification logic itself is unchanged.
- \`docs/reference/operations/security.md\`: removed the "Compensating controls" section
- \`docs/how-to/operations/read-compliance-reports.md\`: rewrote step 3 of "Acting on findings" to point at the mutelist YAML directly
- \`docs/changelog.d/prowler-iac-mutelist.infra.md\`: rewrote to drop the "two new compensating controls" framing
## What did not change
- All Prowler manifests (cronjobs, RBAC, PVs, kustomization) — scans still run on the same schedule
- The Kingfisher deployment
- The trivy-shim in the Prowler container — that's about Trivy ignorefile plumbing, independent of the CC concept
- The mutelist entries themselves — each \`Resources\` list is unchanged; only the prose of \`Description\` was edited
- \`CHANGELOG.md\` — historical releases are left as-is
## Test plan
- [ ] Wait for human review before deploying — once merged, re-point ArgoCD: \`argocd app set prowler --revision main && argocd app sync prowler\` (no manifest changes besides the ConfigMap, so impact is limited to muted-finding descriptions in next week's report)
- [ ] Confirm next weekly Prowler K8s CIS run (Sunday 3am) still completes and produces a report on sifaka
- [ ] Confirm next weekly Prowler IaC run still honors \`trivyignore.yaml\` (the trivy shim is untouched but the ignorefile content was rewritten)
- [ ] \`mise run review-compliance-reports\` — verify node-verification block still runs and prints the renamed table title
Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/359
---
.../manifests/prowler/mutelist/apiserver.yaml | 24 +-
.../prowler/mutelist/control-plane.yaml | 6 +-
.../prowler/mutelist/core-pod-security.yaml | 33 ++-
.../prowler/mutelist/manual-node-checks.yaml | 30 +--
argocd/manifests/prowler/mutelist/rbac.yaml | 15 +-
.../prowler/mutelist/trivyignore.yaml | 24 +-
compensating-controls.yaml | 210 ----------------
.../+compliance-mute-categories.doc.md | 1 -
...eview-cc-ephemeral-privileged-jobs.misc.md | 1 -
...review-cc-init-container-isolation.misc.md | 1 -
.../+review-cc-trusted-ci-only.misc.md | 1 -
.../changelog.d/prowler-iac-mutelist.infra.md | 2 +-
...ervability-stack-audit-2026-05-11.infra.md | 1 -
.../rip-out-compensating-controls.infra.md | 1 +
.../explanation/compliance-mute-categories.md | 99 --------
.../operations/read-compliance-reports.md | 2 +-
.../operations/record-review-evidence.md | 50 ----
.../review-compensating-controls.md | 80 ------
docs/reference/operations/security.md | 8 +-
mise-tasks/review-compensating-controls | 229 ------------------
mise-tasks/review-compliance-reports | 12 +-
21 files changed, 72 insertions(+), 758 deletions(-)
delete mode 100644 compensating-controls.yaml
delete mode 100644 docs/changelog.d/+compliance-mute-categories.doc.md
delete mode 100644 docs/changelog.d/+review-cc-ephemeral-privileged-jobs.misc.md
delete mode 100644 docs/changelog.d/+review-cc-init-container-isolation.misc.md
delete mode 100644 docs/changelog.d/+review-cc-trusted-ci-only.misc.md
delete mode 100644 docs/changelog.d/review-cc-observability-stack-audit-2026-05-11.infra.md
create mode 100644 docs/changelog.d/rip-out-compensating-controls.infra.md
delete mode 100644 docs/explanation/compliance-mute-categories.md
delete mode 100644 docs/how-to/operations/record-review-evidence.md
delete mode 100644 docs/how-to/operations/review-compensating-controls.md
delete mode 100755 mise-tasks/review-compensating-controls
diff --git a/argocd/manifests/prowler/mutelist/apiserver.yaml b/argocd/manifests/prowler/mutelist/apiserver.yaml
index 5a25d4f..fd077e8 100644
--- a/argocd/manifests/prowler/mutelist/apiserver.yaml
+++ b/argocd/manifests/prowler/mutelist/apiserver.yaml
@@ -6,48 +6,48 @@ Mutelist:
"apiserver_always_pull_images_plugin":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: single-user-cluster, local-registry. Only the operator has cluster access; all images pulled from private zot registry."
+ Description: "Only the operator has cluster access; all images pulled from private zot registry."
"apiserver_audit_log_maxage_set":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: observability-stack-audit. Alloy/Loki provides pod-level audit trail."
+ Description: "Alloy/Loki provides pod-level audit trail."
"apiserver_audit_log_maxbackup_set":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: observability-stack-audit. Alloy/Loki provides pod-level audit trail."
+ Description: "Alloy/Loki provides pod-level audit trail."
"apiserver_audit_log_maxsize_set":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: observability-stack-audit. Alloy/Loki provides pod-level audit trail."
+ Description: "Alloy/Loki provides pod-level audit trail."
"apiserver_audit_log_path_set":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: observability-stack-audit. Alloy/Loki provides pod-level audit trail."
+ Description: "Alloy/Loki provides pod-level audit trail."
"apiserver_deny_service_external_ips":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: tailscale-network-isolation. No external IPs routable; cluster only reachable via tailnet."
+ Description: "No external IPs routable; cluster only reachable via tailnet."
"apiserver_disable_profiling":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: tailscale-network-isolation. Profiling endpoint unreachable from public internet."
+ Description: "Profiling endpoint unreachable from public internet."
"apiserver_encryption_provider_config_set":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: tailscale-network-isolation, single-user-cluster. Etcd not network-exposed; only operator has node access."
+ Description: "Etcd not network-exposed; only operator has node access."
"apiserver_kubelet_cert_auth":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: tailscale-network-isolation. Kubelet API not exposed outside the node; minikube auto-generates certificates."
+ Description: "Kubelet API not exposed outside the node; minikube auto-generates certificates."
"apiserver_request_timeout_set":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: tailscale-network-isolation. API server only reachable via tailnet; DoS risk limited to trusted clients."
+ Description: "API server only reachable via tailnet; DoS risk limited to trusted clients."
"apiserver_service_account_lookup_true":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: single-user-cluster. Only operator manages service accounts; no revoked tokens in circulation."
+ Description: "Only operator manages service accounts; no revoked tokens in circulation."
"apiserver_strong_ciphers_only":
Regions: ["*"]
Resources: ["^kube-apiserver-minikube$"]
- Description: "CC: tailscale-network-isolation. API server traffic encrypted by WireGuard at the network layer."
+ Description: "API server traffic encrypted by WireGuard at the network layer."
diff --git a/argocd/manifests/prowler/mutelist/control-plane.yaml b/argocd/manifests/prowler/mutelist/control-plane.yaml
index 2056691..d3cc34a 100644
--- a/argocd/manifests/prowler/mutelist/control-plane.yaml
+++ b/argocd/manifests/prowler/mutelist/control-plane.yaml
@@ -6,12 +6,12 @@ Mutelist:
"controllermanager_disable_profiling":
Regions: ["*"]
Resources: ["^kube-controller-manager-minikube$"]
- Description: "CC: tailscale-network-isolation. Profiling endpoint unreachable from public internet."
+ Description: "Profiling endpoint unreachable from public internet."
"scheduler_profiling":
Regions: ["*"]
Resources: ["^kube-scheduler-minikube$"]
- Description: "CC: tailscale-network-isolation. Profiling endpoint unreachable from public internet."
+ Description: "Profiling endpoint unreachable from public internet."
"kubelet_tls_cert_and_key":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: tailscale-network-isolation, single-user-cluster. Kubelet API not exposed outside node; minikube auto-generates certificates."
+ Description: "Kubelet API not exposed outside node; minikube auto-generates certificates."
diff --git a/argocd/manifests/prowler/mutelist/core-pod-security.yaml b/argocd/manifests/prowler/mutelist/core-pod-security.yaml
index c39e0c6..b1e986e 100644
--- a/argocd/manifests/prowler/mutelist/core-pod-security.yaml
+++ b/argocd/manifests/prowler/mutelist/core-pod-security.yaml
@@ -17,9 +17,8 @@ Mutelist:
- "^kindnet-"
- "^storage-provisioner$"
Description: >-
- CC: tailscale-network-isolation. Control-plane and networking
- pods require hostNetwork by design. Host network itself is
- only reachable via tailnet.
+ Control-plane and networking pods require hostNetwork by design.
+ Host network itself is only reachable via tailnet.
"core_minimize_privileged_containers":
Regions: ["*"]
Resources:
@@ -31,7 +30,6 @@ Mutelist:
# Forgejo runner
- "^forgejo-runner-"
Description: >-
- CC: single-user-cluster, operator-managed-pods, trusted-ci-only.
kube-proxy: system pod, single-user cluster. ts-*/ingress-*:
Tailscale operator-managed. forgejo-runner: DinD limited to
trusted private forge repos.
@@ -49,25 +47,24 @@ Mutelist:
- "^nameserver-"
- "^ingress-"
Description: >-
- CC: single-user-cluster, operator-managed-pods. System pods
- managed by minikube and Tailscale operator; seccomp profiles
- set by upstream. Single-user cluster limits exploit surface.
+ System pods managed by minikube and Tailscale operator;
+ seccomp profiles set by upstream. Single-user cluster limits
+ exploit surface.
"core_minimize_hostPID_containers":
Regions: ["*"]
Resources:
- "^prowler-"
Description: >-
- CC: ephemeral-privileged-jobs. Prowler CIS scanner requires
- hostPID for file permission checks. Runs as CronJob with
- 7-day TTL, not a persistent workload.
+ Prowler CIS scanner requires hostPID for file permission
+ checks. Runs as CronJob with 7-day TTL, not a persistent
+ workload.
"core_minimize_root_containers_admission":
Regions: ["*"]
Resources:
- "^grafana-"
Description: >-
- CC: init-container-isolation. Root limited to init-chown-data
- container; all runtime containers run as UID 472 with caps
- dropped.
+ Root limited to init-chown-data container; all runtime
+ containers run as UID 472 with caps dropped.
"core_minimize_containers_added_capabilities":
Regions: ["*"]
Resources:
@@ -77,10 +74,9 @@ Mutelist:
# Grafana init-chown-data
- "^grafana-"
Description: >-
- CC: single-user-cluster, init-container-isolation. System
- pods: capabilities required by function (minikube-managed).
- Grafana: CHOWN limited to init phase; runtime containers
- drop ALL.
+ System pods: capabilities required by function
+ (minikube-managed). Grafana: CHOWN limited to init phase;
+ runtime containers drop ALL.
"core_minimize_containers_capabilities_assigned":
Regions: ["*"]
Resources:
@@ -88,5 +84,4 @@ Mutelist:
- "^kindnet-"
- "^grafana-"
Description: >-
- CC: single-user-cluster, init-container-isolation. See
- core_minimize_containers_added_capabilities.
+ See core_minimize_containers_added_capabilities.
diff --git a/argocd/manifests/prowler/mutelist/manual-node-checks.yaml b/argocd/manifests/prowler/mutelist/manual-node-checks.yaml
index 9c8354d..c91a2a6 100644
--- a/argocd/manifests/prowler/mutelist/manual-node-checks.yaml
+++ b/argocd/manifests/prowler/mutelist/manual-node-checks.yaml
@@ -1,7 +1,7 @@
# Node-level and RBAC checks that Prowler reports as MANUAL because it
-# cannot evaluate them from inside a pod. Compensated by automated
-# verification in `mise run review-compliance-reports`, which SSHes into
-# the minikube node and checks each condition directly every week.
+# cannot evaluate them from inside a pod. Verified out-of-band by the
+# node-verification block in `mise run review-compliance-reports`, which
+# SSHes into the minikube node and checks each condition directly.
Mutelist:
Accounts:
"*":
@@ -9,51 +9,51 @@ Mutelist:
"etcd_unique_ca":
Regions: ["*"]
Resources: ["^etcd-minikube$"]
- Description: "CC: node-config-automated-verification. Etcd CA fingerprint verified different from cluster CA by review-compliance-reports."
+ Description: "Etcd CA fingerprint verified different from cluster CA by review-compliance-reports."
"kubelet_conf_file_ownership":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. File ownership verified root:root by review-compliance-reports."
+ Description: "File ownership verified root:root by review-compliance-reports."
"kubelet_conf_file_permissions":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. File permissions verified 600 by review-compliance-reports."
+ Description: "File permissions verified 600 by review-compliance-reports."
"kubelet_config_yaml_ownership":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. File ownership verified root:root by review-compliance-reports."
+ Description: "File ownership verified root:root by review-compliance-reports."
"kubelet_config_yaml_permissions":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. File permissions verified 644 by review-compliance-reports."
+ Description: "File permissions verified 644 by review-compliance-reports."
"kubelet_service_file_ownership_root":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. File ownership verified root:root by review-compliance-reports."
+ Description: "File ownership verified root:root by review-compliance-reports."
"kubelet_service_file_permissions":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. File permissions verified 644 by review-compliance-reports."
+ Description: "File permissions verified 644 by review-compliance-reports."
"kubelet_disable_read_only_port":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. readOnlyPort absence (defaults to 0) verified by review-compliance-reports."
+ Description: "readOnlyPort absence (defaults to 0) verified by review-compliance-reports."
"kubelet_event_record_qps":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. eventRecordQPS absence (defaults to 5) verified by review-compliance-reports."
+ Description: "eventRecordQPS absence (defaults to 5) verified by review-compliance-reports."
"kubelet_manage_iptables":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification. makeIPTablesUtilChains absence (defaults to true) verified by review-compliance-reports."
+ Description: "makeIPTablesUtilChains absence (defaults to true) verified by review-compliance-reports."
"kubelet_strong_ciphers_only":
Regions: ["*"]
Resources: ["^kubelet-config$"]
- Description: "CC: node-config-automated-verification, tailscale-network-isolation. Go default ciphers used; all traffic WireGuard-encrypted via tailnet."
+ Description: "Go default ciphers used; all traffic WireGuard-encrypted via tailnet."
"rbac_cluster_admin_usage":
Regions: ["*"]
Resources:
- "^cluster-admin$"
- "^kubeadm:cluster-admins$"
- "^minikube-rbac$"
- Description: "CC: node-config-automated-verification, single-user-cluster. Only built-in/minikube cluster-admin bindings present; verified by review-compliance-reports."
+ Description: "Only built-in/minikube cluster-admin bindings present; verified by review-compliance-reports."
diff --git a/argocd/manifests/prowler/mutelist/rbac.yaml b/argocd/manifests/prowler/mutelist/rbac.yaml
index c9c52e4..324809d 100644
--- a/argocd/manifests/prowler/mutelist/rbac.yaml
+++ b/argocd/manifests/prowler/mutelist/rbac.yaml
@@ -13,9 +13,8 @@ Mutelist:
# ArgoCD
- "^argocd-"
Description: >-
- CC: single-user-cluster, sso-gated-admin-tools. Built-in
- K8s roles: only operator can bind them. ArgoCD: requires
- broad access but is SSO-gated via Authentik OIDC.
+ Built-in K8s roles: only operator can bind them. ArgoCD:
+ requires broad access but is SSO-gated via Authentik OIDC.
"rbac_minimize_pod_creation_access":
Regions: ["*"]
Resources:
@@ -26,14 +25,12 @@ Mutelist:
# CloudNativePG operator
- "^cnpg-manager$"
Description: >-
- CC: single-user-cluster. Built-in K8s roles and CNPG
- operator. Only the operator can assign these roles; no
- untrusted users have cluster access.
+ Built-in K8s roles and CNPG operator. Only the operator can
+ assign these roles; no untrusted users have cluster access.
"rbac_minimize_service_account_token_creation":
Regions: ["*"]
Resources:
- "^system:"
Description: >-
- CC: single-user-cluster. kube-controller-manager requires
- token creation for SA management. Only operator manages
- service accounts.
+ kube-controller-manager requires token creation for SA
+ management. Only operator manages service accounts.
diff --git a/argocd/manifests/prowler/mutelist/trivyignore.yaml b/argocd/manifests/prowler/mutelist/trivyignore.yaml
index 22c612a..87af966 100644
--- a/argocd/manifests/prowler/mutelist/trivyignore.yaml
+++ b/argocd/manifests/prowler/mutelist/trivyignore.yaml
@@ -14,26 +14,24 @@ misconfigurations:
paths:
- "argocd/manifests/external-secrets/rbac.yaml"
statement: >-
- CC: operator-purpose-bound-rbac. external-secrets-operator's entire
- function is to read and synthesize Secret objects; ClusterRole over
- secrets is its purpose. Both the controller and cert-controller are
+ external-secrets-operator's entire function is to read and
+ synthesize Secret objects; ClusterRole over secrets is its
+ purpose. Both the controller and cert-controller are
upstream-defined.
- id: KSV-0041
paths:
- "argocd/manifests/kube-state-metrics/rbac.yaml"
- "argocd/manifests/kube-state-metrics-ringtail/rbac.yaml"
statement: >-
- CC: kube-state-metrics-metadata-only. KSM exposes only Secret
- metadata (name, namespace, type, labels), never the data field.
- list/watch on secrets is required for kube_secret_info /
- kube_secret_labels metrics.
+ KSM exposes only Secret metadata (name, namespace, type, labels),
+ never the data field. list/watch on secrets is required for
+ kube_secret_info / kube_secret_labels metrics.
- id: KSV-0114
paths:
- "argocd/manifests/external-secrets/rbac.yaml"
statement: >-
- CC: operator-purpose-bound-rbac. cert-controller manages the
- external-secrets validating webhook configurations to inject its
- own rotating CA bundle. RBAC is scoped to two named webhooks
- (secretstore-validate, externalsecret-validate) via resourceNames;
- KSV-0114 doesn't see the resourceNames restriction so reports the
- full ClusterRole.
+ cert-controller manages the external-secrets validating webhook
+ configurations to inject its own rotating CA bundle. RBAC is
+ scoped to two named webhooks (secretstore-validate,
+ externalsecret-validate) via resourceNames; KSV-0114 doesn't see
+ the resourceNames restriction so reports the full ClusterRole.
diff --git a/compensating-controls.yaml b/compensating-controls.yaml
deleted file mode 100644
index 01b3cfd..0000000
--- a/compensating-controls.yaml
+++ /dev/null
@@ -1,210 +0,0 @@
-# Compensating Controls
-#
-# Documents controls that mitigate risks from suppressed or accepted security
-# findings. Referenced by security tools (Prowler mutelist, Kingfisher config,
-# etc.) via "CC: " in finding descriptions or suppression notes.
-#
-# Used by `mise run review-compensating-controls` to surface stale controls.
-#
-# Fields:
-# id - kebab-case unique identifier, referenced from tool configs
-# description - what the control actually does to mitigate risk
-# created - date (YYYY-MM-DD) the control was documented
-# last-reviewed - date (YYYY-MM-DD) or null
-# notes - optional context
-
-controls:
- - id: single-user-cluster
- description: >-
- Only the cluster operator (eblume) has kubectl access. No untrusted
- users can create pods, access cached images, or bind RBAC roles.
- created: 2026-03-30
- last-reviewed: 2026-04-01
- notes: >-
- Verify by checking kubeconfig distribution and Tailscale ACLs.
- If additional users gain cluster access, re-evaluate all findings
- muted under this control.
-
- - id: tailscale-network-isolation
- description: >-
- Cluster is not internet-exposed. All access requires Tailscale
- identity with ACL enforcement. Profiling endpoints, debug ports,
- and control-plane APIs are unreachable from the public internet.
- created: 2026-03-30
- last-reviewed: 2026-04-06
- notes: >-
- Verify with 'tailscale serve status --json' on indri and review
- Tailscale ACLs in pulumi/tailscale/. Only tag:flyio-target services
- are publicly routable.
-
- - id: local-registry
- description: >-
- Operator-built services use a private zot registry
- (registry.ops.eblu.me) for supply-chain control. Remaining
- images are pulled from public registries without stored
- credentials. No shared registry secrets are cached on cluster
- nodes.
- created: 2026-03-30
- last-reviewed: 2026-04-12
- notes: >-
- Verify by checking image prefixes in kustomization.yaml files.
- Known external-image categories: (1) upstream apps not yet
- mirrored — immich, ollama, frigate, frigate-notify, valkey;
- (2) infrastructure components — tailscale operator/proxy,
- external-secrets, 1password-connect, forgejo-runner, docker
- DinD, nvidia-device-plugin; (3) utility base images — busybox,
- alpine (grafana init containers). Track upstream versions in
- service-versions.yaml. Goal is to progressively mirror these
- into zot.
-
- - id: sso-gated-admin-tools
- description: >-
- ArgoCD requires SSO authentication via Authentik OIDC. Wildcard
- RBAC roles are mitigated by requiring authenticated identity
- before any API access.
- created: 2026-03-30
- last-reviewed: 2026-04-14
- notes: >-
- Verify Authentik OIDC provider config for ArgoCD and that
- anonymous access is disabled. Check ArgoCD --auth-token isn't
- leaked. The workflow-bot API key account is scoped to sync/get
- only.
-
- - id: operator-managed-pods
- description: >-
- Tailscale operator manages proxy pod specs (ts-*, ingress-*,
- operator-*, nameserver-*). Pod security settings are set by the
- operator, not user manifests. Operator is tracked in
- service-versions.yaml and regularly updated.
- created: 2026-03-30
- last-reviewed: 2026-04-21
- notes: >-
- Verify operator version is current via 'mise run service-review'.
- Check Tailscale changelog for security fixes. If operator adds
- seccomp support, remove these mutes. As of 2026-04-21: still no
- default seccomp on operator-generated pods (upstream issue #7359
- open). A ProxyClass + generic device plugin can downgrade proxies
- from privileged to NET_ADMIN+NET_RAW and set seccompProfile —
- potential future remediation to remove the seccomp mute without
- waiting for upstream defaults.
-
- - id: ephemeral-privileged-jobs
- description: >-
- Prowler CIS scanner runs as a CronJob with 7-day TTL
- auto-deletion, not as a persistent privileged workload. hostPID
- exposure is time-bounded to scan duration (~20s).
- created: 2026-03-30
- last-reviewed: 2026-04-29
- notes: >-
- Verify TTL is set in cronjob.yaml. Check that no persistent
- pods run with hostPID on the scanned cluster (indri). The
- alloy-tracing DaemonSet on ringtail also uses hostPID but is
- out of scope — Prowler only scans indri. Tracked in Todoist:
- "prowler scan against ringtail" — once that lands, the
- DaemonSet's hostPID+privileged posture will surface as a CIS
- finding and need its own CC or remediation.
-
- - id: trusted-ci-only
- description: >-
- Forgejo runner only executes workflows from repos on the private
- forge (forge.ops.eblu.me). No external or untrusted repos can
- trigger privileged CI jobs.
- created: 2026-03-30
- last-reviewed: 2026-05-01
- notes: >-
- Verification: (1) Runner config (argocd/manifests/forgejo-runner/
- config.yaml) connects only to https://forge.ops.eblu.me/. (2) Forge
- app.ini has DISABLE_REGISTRATION=true and ALLOW_ONLY_EXTERNAL_REGISTRATION
- =true (ansible/roles/forgejo/defaults/main.yml) — no untrusted users
- can sign up or create repos. The runner registers at instance scope
- (repo_id=0/owner_id=0 in action_runner table), but the instance itself
- is closed, so no per-repo allow-list is needed. Re-evaluate if the
- forge ever opens to additional users or if the runner is repointed
- to an external forge.
-
- - id: init-container-isolation
- description: >-
- Root privileges and added capabilities (CHOWN) are limited to
- init containers that run once at pod startup. All runtime
- containers run as non-root (UID 472) with all capabilities
- dropped.
- created: 2026-03-30
- last-reviewed: 2026-05-04
- notes: >-
- Verify by inspecting grafana deployment.yaml securityContext
- for both init and runtime containers. If fsGroup alone can
- handle PVC ownership, remove init-chown-data and this control.
- Retirement deferred until grafana lands on ringtail's k3s
- (see [[indri-k8s-migration]]) — storage backend will change,
- and removing init-chown-data right before that migration
- trades a real safety net for marginal cleanup. Revisit
- post-migration.
-
- - id: node-config-automated-verification
- description: >-
- Prowler reports certain node-level checks as MANUAL because it runs
- inside a pod and cannot evaluate kubelet file permissions, kubelet
- config arguments, etcd CA separation, or cluster-admin RBAC bindings.
- The review-compliance-reports script SSHes into the minikube node
- weekly and programmatically verifies each condition, failing loudly
- if any check deviates from expected values.
- created: 2026-04-14
- last-reviewed: 2026-04-14
- notes: >-
- Verification runs as part of 'mise run review-compliance-reports'.
- If minikube node is unreachable, all checks report as FAIL. If new
- MANUAL findings appear in Prowler, add corresponding verification
- logic to the script and update the mutelist.
-
- - id: operator-purpose-bound-rbac
- description: >-
- Operators whose entire function is to manage a sensitive resource
- legitimately need RBAC over that resource. external-secrets-operator
- manages Secret objects (its purpose) and the cert-controller mutates
- its own ValidatingWebhookConfigurations to inject rotating CA bundles.
- Risk is bounded by: (1) the operator code being upstream open-source
- and reviewed; (2) RBAC scoped to specific named webhooks where
- possible; (3) supply chain controls on the operator image (mirrored
- to local registry, version tracked in service-versions.yaml).
- created: 2026-04-27
- last-reviewed: 2026-04-27
- notes: >-
- Verify by checking that the operators in question still match their
- stated purpose (i.e. external-secrets is still the only consumer of
- these ClusterRoles) and that upstream hasn't published advisories
- for credential-handling bugs. Re-evaluate if a non-secrets-managing
- ClusterRole appears under this control.
-
- - id: kube-state-metrics-metadata-only
- description: >-
- kube-state-metrics holds list/watch on Secrets cluster-wide but only
- exposes Secret object *metadata* (name, namespace, type, creation
- timestamp, labels) via the kube_secret_info / kube_secret_labels
- metrics. Secret data fields are never read into KSM's exposed
- metrics by upstream design. Mitigation rests on KSM's metric
- schema, the version pin in service-versions.yaml, and the metrics
- endpoint being reachable only on the cluster network.
- created: 2026-04-27
- last-reviewed: 2026-04-27
- notes: >-
- Verify by inspecting the /metrics endpoint output for any series
- that include secret data (only *_info and *_labels metrics should
- reference secrets, and labels should be limited to user-applied
- labels — never the data:). Re-evaluate on KSM version bumps.
-
- - id: observability-stack-audit
- description: >-
- Alloy collects pod logs and ships them to Loki, providing an
- audit trail for cluster activity. Compensates for missing
- apiserver audit logging which neither minikube (indri) nor
- k3s (ringtail) configures by default.
- created: 2026-03-30
- last-reviewed: 2026-05-11
- notes: >-
- Verify Alloy DaemonSet is running on each cluster (alloy-k8s on
- minikube, alloy-ringtail on k3s) and Loki is receiving logs.
- Note this is weaker than native apiserver audit logs — it
- captures pod stdout/stderr, not API request-level auditing.
- Consider enabling apiserver audit logging on k3s post-migration
- (`--audit-log-path` / `--audit-policy-file`) — minikube made it
- hard, k3s makes it straightforward.
diff --git a/docs/changelog.d/+compliance-mute-categories.doc.md b/docs/changelog.d/+compliance-mute-categories.doc.md
deleted file mode 100644
index c776e46..0000000
--- a/docs/changelog.d/+compliance-mute-categories.doc.md
+++ /dev/null
@@ -1 +0,0 @@
-New explanation article [[compliance-mute-categories]] documenting the gap between current `CC:`-only mute tagging and the three structurally distinct categories (compensating control, not-applicable, risk-accepted) needed for real PCI DSS / SOC2 practice. Captures the current image-scan mutelist gap (`cronjob-image-scan.yaml` doesn't pass `--mutelist-file`) and proposes an order-of-operations for wiring it up alongside the new tag conventions. Triggered by CVE-2026-31789, an OpenSSL 32-bit-only finding that surfaced the need for an NA category.
diff --git a/docs/changelog.d/+review-cc-ephemeral-privileged-jobs.misc.md b/docs/changelog.d/+review-cc-ephemeral-privileged-jobs.misc.md
deleted file mode 100644
index 14dcdca..0000000
--- a/docs/changelog.d/+review-cc-ephemeral-privileged-jobs.misc.md
+++ /dev/null
@@ -1 +0,0 @@
-Reviewed compensating control `ephemeral-privileged-jobs`: TTL and hostPID scope verified on indri. Noted that the alloy-tracing DaemonSet on ringtail is out of scope until Prowler scans ringtail (tracked in Todoist).
diff --git a/docs/changelog.d/+review-cc-init-container-isolation.misc.md b/docs/changelog.d/+review-cc-init-container-isolation.misc.md
deleted file mode 100644
index 295e7f8..0000000
--- a/docs/changelog.d/+review-cc-init-container-isolation.misc.md
+++ /dev/null
@@ -1 +0,0 @@
-Reviewed compensating control `init-container-isolation` (35 days stale). Grafana's running pod matches the manifest and the CC's claim — only `init-chown-data` runs as root with `CHOWN`; runtime containers all run as UID 472 with all caps dropped. Retirement (replacing init-chown-data with `fsGroup` alone) is plausible given the in-tree minikube-hostpath provisioner, but deferred until grafana lands on ringtail's k3s — note added to the CC.
diff --git a/docs/changelog.d/+review-cc-trusted-ci-only.misc.md b/docs/changelog.d/+review-cc-trusted-ci-only.misc.md
deleted file mode 100644
index 89dc653..0000000
--- a/docs/changelog.d/+review-cc-trusted-ci-only.misc.md
+++ /dev/null
@@ -1 +0,0 @@
-Reviewed compensating control `trusted-ci-only`: Forgejo runner is registered only to the private forge, which has registration disabled — no untrusted users can create repos or trigger privileged CI. Tightened the notes to reflect that the closed-forge property (not a per-repo allow-list) is what actually mitigates the risk.
diff --git a/docs/changelog.d/prowler-iac-mutelist.infra.md b/docs/changelog.d/prowler-iac-mutelist.infra.md
index 793c1ec..077cfa8 100644
--- a/docs/changelog.d/prowler-iac-mutelist.infra.md
+++ b/docs/changelog.d/prowler-iac-mutelist.infra.md
@@ -1 +1 @@
-Address the 6 critical Prowler IaC findings against `argocd/manifests/`. Prowler's IaC provider hardcodes `self._mutelist = None` and delegates filtering to Trivy, but doesn't plumb `--ignorefile` through — so the documented "use Trivy filtering" path is actually broken. Added a shim around `trivy` in the Prowler image that injects `--ignorefile $TRIVY_IGNOREFILE` for `trivy fs` invocations when the env var points at a real file. The IaC cronjob now mounts `mutelist/trivyignore.yaml` (Trivy's per-path schema) and sets the env var. Two new compensating controls — `operator-purpose-bound-rbac` and `kube-state-metrics-metadata-only` — justify muting the `external-secrets` and `kube-state-metrics` Secret-access findings (KSV-0041, KSV-0114). Separately, `grafana-clusterrole` is tightened to remove `secrets` access entirely: the dashboard sidecar already only consumes ConfigMap-labeled dashboards, so its `RESOURCE` env var is now `configmap` instead of `both`.
+Address the 6 critical Prowler IaC findings against `argocd/manifests/`. Prowler's IaC provider hardcodes `self._mutelist = None` and delegates filtering to Trivy, but doesn't plumb `--ignorefile` through — so the documented "use Trivy filtering" path is actually broken. Added a shim around `trivy` in the Prowler image that injects `--ignorefile $TRIVY_IGNOREFILE` for `trivy fs` invocations when the env var points at a real file. The IaC cronjob now mounts `mutelist/trivyignore.yaml` (Trivy's per-path schema) and sets the env var, muting the `external-secrets` and `kube-state-metrics` Secret-access findings (KSV-0041, KSV-0114). Separately, `grafana-clusterrole` is tightened to remove `secrets` access entirely: the dashboard sidecar already only consumes ConfigMap-labeled dashboards, so its `RESOURCE` env var is now `configmap` instead of `both`.
diff --git a/docs/changelog.d/review-cc-observability-stack-audit-2026-05-11.infra.md b/docs/changelog.d/review-cc-observability-stack-audit-2026-05-11.infra.md
deleted file mode 100644
index 8100c6a..0000000
--- a/docs/changelog.d/review-cc-observability-stack-audit-2026-05-11.infra.md
+++ /dev/null
@@ -1 +0,0 @@
-Reviewed compensating control `observability-stack-audit`. Updated description to cover ringtail's k3s as well as indri's minikube; both Alloy DaemonSets and Loki are healthy.
diff --git a/docs/changelog.d/rip-out-compensating-controls.infra.md b/docs/changelog.d/rip-out-compensating-controls.infra.md
new file mode 100644
index 0000000..d41fd1a
--- /dev/null
+++ b/docs/changelog.d/rip-out-compensating-controls.infra.md
@@ -0,0 +1 @@
+Ripped out the compensating-controls (CC) framework: deleted `compensating-controls.yaml`, the `review-compensating-controls` mise task, and the associated how-to / explanation docs. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files remain in place but no longer carry `CC: ` prefixes — each entry just keeps a free-form `Description` of why the finding is muted. The CC review cadence proved to be more overhead than this single-operator homelab needed.
diff --git a/docs/explanation/compliance-mute-categories.md b/docs/explanation/compliance-mute-categories.md
deleted file mode 100644
index 4c5f3a3..0000000
--- a/docs/explanation/compliance-mute-categories.md
+++ /dev/null
@@ -1,99 +0,0 @@
----
-title: Compliance Mute Categories
-modified: 2026-05-04
-last-reviewed: 2026-05-04
-tags:
- - explanation
- - security
- - compliance
----
-
-# Compliance Mute Categories
-
-> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
-
-How BlumeOps should categorize muted compliance findings, why a single "compensating control" tag is not enough, and what tooling work is needed to support multiple categories cleanly.
-
-## Why this matters
-
-When a compliance scanner ([[prowler]], Trivy via Prowler IaC, Kingfisher) reports a failing finding, there are three structurally different reasons we might suppress it:
-
-1. **Compensating control (CC)** — the requirement applies and we *do not* meet it directly, but an alternative control mitigates the same risk.
-2. **Not applicable (NA)** — the requirement's preconditions cannot be satisfied in our environment, so the finding is structurally inert (e.g. a 32-bit-only CVE on 64-bit-only hosts).
-3. **Risk accepted (RA)** — the requirement applies, we do not meet it, no compensating control exists, and we have explicitly chosen to accept the residual risk for a bounded period.
-
-Today every muted finding in BlumeOps uses the `CC: ` convention. That conflates all three categories. In a real PCI DSS or SOC2 environment, auditors treat them very differently:
-
-- A CC requires documentation of the constraint, the alternative measure, and recurring validation that the measure still works.
-- An NA requires documentation of *why* the precondition cannot be met, with periodic verification that the environmental fact still holds.
-- An RA requires an explicit decision-maker, an expiry date, and a scheduled re-decision.
-
-Mixing them under one tag means stale CCs hide stale RAs, and NAs that should be revisited when the environment changes get treated as permanent fixtures.
-
-## Trigger case: CVE-2026-31789
-
-The 2026-05-03 weekly compliance review surfaced [CVE-2026-31789](https://nvd.nist.gov/vuln/detail/CVE-2026-31789), an OpenSSL heap buffer overflow during X.509 certificate processing on **32-bit systems**. Prowler's image scanner flagged 216 findings across 106 BlumeOps images carrying `libssl3` / `libcrypto3` below the fixed versions.
-
-The CVE is genuine, but its preconditions cannot be satisfied in our environment: indri is Apple Silicon (arm64), ringtail is x86_64, and we run no 32-bit containers. This is the canonical NA case — not a CC, because there is no "alternative measure mitigating the risk." The risk does not exist for us at all.
-
-A CC like `no-32bit-runtimes` would technically work, but conflates the categories: if we ever introduce a 32-bit runtime we would have to remember that this CC was load-bearing for the mute, retire or scope it down, and reopen the muted findings. An NA tag with a short justification makes the precondition explicit and self-documents the conditions under which it must be revisited.
-
-## Current tooling state
-
-Three Prowler scans run weekly. Their mute paths today:
-
-| Scan | Mute mechanism | File(s) |
-|------|----------------|---------|
-| K8s CIS (Sunday) | Prowler `--mutelist-file`, merged from ConfigMap | `argocd/manifests/prowler/mutelist/*.yaml` |
-| IaC (Saturday) | Trivy `--ignorefile` shim (Prowler's `--mutelist-file` is a no-op for IaC) | `argocd/manifests/prowler/mutelist/trivyignore.yaml` |
-| Container Images (Saturday) | **None — `cronjob-image-scan.yaml` does not pass `--mutelist-file`** | n/a |
-
-The image scan has never been wired to a mutelist. The CSV reports do contain a `MUTED` column, but it is always `False` because no mutelist is supplied. All 14k+ image findings flow through to `review-compliance-reports` unfiltered.
-
-The mute tag convention is consistent across the two configured scans: each entry's `Description:` (or `statement:` for trivyignore) starts with `CC: . `. `mise run review-compensating-controls` greps for those IDs to find every file that depends on each control. There is no NA tag, no RA tag, and no expiry field.
-
-## Proposed model
-
-### Tag prefixes
-
-Extend the description-prefix convention:
-
-- `CC: . ` — references an entry in `compensating-controls.yaml`. Existing convention, unchanged.
-- `NA: . ` — environmental precondition fails. Reason should be specific enough that a reviewer can verify it (e.g. `NA: no 32-bit runtimes`, not `NA: doesn't apply`).
-- `RA: ; expires . ` — explicit risk acceptance with a hard expiry. Past the expiry, re-review is mandatory.
-
-Tag choice is exclusive: a given mute is one of CC, NA, or RA. If two reasons apply, pick the strongest — CC > RA > NA.
-
-### Tooling changes required
-
-1. **Wire the image scan to a mutelist.** Add `argocd/manifests/prowler/mutelist/image-cves.yaml`, mount-and-merge it the same way `cronjob.yaml` mounts its mutelist parts, and pass `--mutelist-file` to `prowler image`. Verify experimentally that `prowler image` honors the flag — Prowler's behavior across providers is inconsistent, and the IaC provider notably does not. If `prowler image` ignores it, fall back to post-scan filtering inside `review-compliance-reports`.
-
-2. **Teach `review-compensating-controls` (or a sibling) to surface NA and RA entries.** CCs already get a staleness queue. NAs should appear in a separate queue keyed on the reason text — when an NA reason becomes false (e.g. we do introduce a 32-bit runtime), every NA mute citing that reason must be reopened. RAs should sort by expiry date, with anything past expiry flagged red.
-
-3. **Expiry parsing.** RA tags carry a hard date. The simplest path is to parse it from the description string at review time. A more durable path is to extend the mutelist YAML schema with a structured `expires:` field and a small wrapper that strips it before passing the file to Prowler. Either works; the structured field is friendlier to editors.
-
-### Out of scope (for now)
-
-- Changing the underlying Prowler mutelist YAML schema. Stay within the `Mutelist:` shape Prowler expects.
-- Migrating existing `CC:` entries. The current set is genuinely CCs and should stay tagged that way.
-- Building an issue-tracker integration. Todoist is the source of truth for "remember to re-review this" until that scales painfully.
-
-## Order of operations
-
-When this work is picked up, the suggested sequence is:
-
-1. **Scope and confirm.** Re-read this article, confirm the model still fits, adjust if not.
-2. **Wire the image-scan mutelist.** Smallest atomic change; produces immediate value (the CVE-2026-31789 mute can land as the first NA entry).
-3. **Add the NA convention.** Update [[read-compliance-reports]] and [[review-compensating-controls]] how-tos to describe the three tag prefixes. The convention can land before tooling supports it — review will just be manual until tooling catches up.
-4. **Extend the review tools.** Add NA and RA queues to `review-compensating-controls` (or a new task). At this point, parse expiry from RA descriptions.
-5. **Optionally: structured expiry.** If RA entries become common, migrate to a structured `expires:` YAML field with a wrapper that filters it out before Prowler reads the file.
-
-The first three steps are a coherent C1. Steps 4–5 can be split off if scope creeps.
-
-## Related
-
-- [[read-compliance-reports]] — the weekly review process this feeds into
-- [[review-compensating-controls]] — current CC review tooling
-- [[security-model]] — overall security posture
-- [[prowler]] — scanner reference
-- [[agent-change-process]] — how to scope and execute the implementation
diff --git a/docs/how-to/operations/read-compliance-reports.md b/docs/how-to/operations/read-compliance-reports.md
index 75fd3ab..e676ad5 100644
--- a/docs/how-to/operations/read-compliance-reports.md
+++ b/docs/how-to/operations/read-compliance-reports.md
@@ -80,7 +80,7 @@ Not all failures require action. Common expected failures in our minikube cluste
1. **Triage** — review new failures, distinguish real issues from expected noise
2. **Remediate** — fix what you can (pod security contexts, RBAC tightening)
-3. **Mutelist** — suppress expected/accepted failures via Prowler's `--mutelist-file` to reduce noise in future scans
+3. **Mutelist** — suppress expected/accepted failures by adding a Resource entry under the matching Check in `argocd/manifests/prowler/mutelist/*.yaml` with a free-form `Description` explaining why
4. **Track** — compare reports over time to spot regressions
## Related
diff --git a/docs/how-to/operations/record-review-evidence.md b/docs/how-to/operations/record-review-evidence.md
deleted file mode 100644
index 9de4e37..0000000
--- a/docs/how-to/operations/record-review-evidence.md
+++ /dev/null
@@ -1,50 +0,0 @@
----
-title: Record Review Evidence
-modified: 2026-04-01
-last-reviewed: 2026-04-01
-tags:
- - how-to
- - security
- - compliance
----
-
-# Record Review Evidence
-
-How review evidence *would* be captured after a [[review-compensating-controls|compensating control review]], to make the review auditable under a compliance framework.
-
-blumeops does not currently collect review evidence. This card documents the target process for reference and practice.
-
-## Why Record Evidence?
-
-Reviewing a control and updating `last-reviewed` proves the review *happened* but not *what was checked*. Under frameworks like PCI DSS v4.0, a QSA needs to see dated, immutable evidence that the reviewer verified the control and that an appropriate party accepted the residual risk. Compliance platforms like Drata automate this collection, but the underlying artifacts are the same whether you use a platform or a directory of files.
-
-## What Evidence Would Be Captured
-
-For each control reviewed, artifacts should answer:
-
-1. **Who reviewed it** — reviewer name, date
-2. **What was verified** — the specific checks performed (e.g., Tailscale ACL policy snapshot, `tailscale status` output, kubectl auth checks)
-3. **What was found** — the outcome: control still in effect, circumstances changed, or control invalidated
-4. **Residual risk** — what the control does *not* cover (the gap a QSA will ask about)
-5. **Acceptance** — formal sign-off that the residual risk is accepted by an appropriate party (reviewer + approver, typically a manager or CTO)
-
-Supporting artifacts would include command output, policy snapshots, screenshots, or API responses — anything that demonstrates the verification was actually performed.
-
-## PCI DSS Context
-
-Under PCI DSS v4.0, compensating controls require a **Compensating Control Worksheet (CCW)** that maps each control to the original requirement it substitutes for. The CCW fields are:
-
-- **Original requirement** — the specific PCI DSS requirement not directly met
-- **Constraint** — why direct compliance isn't feasible
-- **Compensating control definition** — what is done instead
-- **Risk addressed** — how the control mitigates the original threat
-- **Residual risk** — what remains unmitigated
-- **Validation procedure** — steps to verify (what `notes` captures in `compensating-controls.yaml`)
-
-Req 12.3.2 mandates review **at least annually** (quarterly is typical for Level 1 Service Providers). In a platform like Drata, these map to Controls with uploaded Evidence and review workflows requiring sign-off from both the reviewer and an approver.
-
-## Related
-
-- [[review-compensating-controls]] — The technical review process
-- [[security]] — Security posture overview
-- [[read-compliance-reports]] — Interpreting Prowler/Kingfisher reports
diff --git a/docs/how-to/operations/review-compensating-controls.md b/docs/how-to/operations/review-compensating-controls.md
deleted file mode 100644
index 8a32d98..0000000
--- a/docs/how-to/operations/review-compensating-controls.md
+++ /dev/null
@@ -1,80 +0,0 @@
----
-title: Review Compensating Controls
-modified: 2026-03-30
-last-reviewed: 2026-03-30
-tags:
- - how-to
- - security
- - maintenance
----
-
-# Review Compensating Controls
-
-How to periodically review compensating controls that justify suppressed security findings.
-
-## Review by Staleness
-
-Show controls sorted by when they were last reviewed (most stale first):
-
-```bash
-mise run review-compensating-controls
-```
-
-This reads `compensating-controls.yaml` (repo root), sorts by `last-reviewed`, and displays the most stale control with all codebase references. It also searches for every file that references the control ID, so you can see exactly which suppressed findings depend on it.
-
-To show more entries:
-
-```bash
-mise run review-compensating-controls --limit 20
-```
-
-## What is a Compensating Control?
-
-A compensating control is a security measure that mitigates the risk a finding was designed to detect, when the finding itself cannot be directly remediated. For example:
-
-- **Finding:** API server does not enable AlwaysPullImages admission plugin
-- **Risk:** Untrusted users could run pods using cached images they shouldn't have access to
-- **Compensating control:** `single-user-cluster` — only the operator has kubectl access; no untrusted users can create pods
-
-Controls are documented in `compensating-controls.yaml` and referenced from security tool configurations (Prowler mutelist files, Kingfisher config, etc.) using the format `CC: `.
-
-A compensating control is only one of three structurally distinct ways to suppress a finding — see [[compliance-mute-categories]] for when to reach for a CC versus a not-applicable (`NA:`) or risk-accepted (`RA:`) tag instead.
-
-## Review Process
-
-For each control up for review:
-
-1. **Understand the risk.** Read each suppressed finding that references this control. What attack or misconfiguration does the original check guard against?
-
-2. **Verify the control is in effect.** Follow the verification steps in the control's `notes` field. For example, for `tailscale-network-isolation`, check that the cluster is not directly internet-exposed and Tailscale ACLs are enforced.
-
-3. **Assess whether the control actually mitigates the risk.** A compensating control should address the same threat the check was designed to catch, not just be a vaguely related security measure. If it doesn't hold up, either:
- - Fix the underlying finding and remove the suppression
- - Document a stronger or more specific compensating control
-
-4. **Check for changed circumstances.** Has the cluster gained new users? Has a service been exposed publicly? Has an operator added native support for the missing feature? Any of these could invalidate the control.
-
-5. **Update the review date.** Edit `compensating-controls.yaml` and set `last-reviewed` to today's date. Commit alongside any changes.
-
-## Adding a New Control
-
-When suppressing a new security finding, either map it to an existing control or add a new one:
-
-```yaml
-- id: my-new-control
- description: >-
- What this control does and how it mitigates the specific risk.
- created: 2026-03-30
- last-reviewed: 2026-03-30
- notes: >-
- How to verify this control is still in effect.
-```
-
-Then reference it in the suppression configuration with `CC: my-new-control`.
-
-## Related
-
-- [[record-review-evidence]] — Capturing evidence artifacts for audit (aspirational)
-- [[security]] — Security posture overview
-- [[read-compliance-reports]] — Accessing and interpreting Prowler reports
-- [[review-services]] — Periodic service version review (similar staleness pattern)
diff --git a/docs/reference/operations/security.md b/docs/reference/operations/security.md
index 18561a5..11c4df9 100644
--- a/docs/reference/operations/security.md
+++ b/docs/reference/operations/security.md
@@ -46,13 +46,7 @@ Security posture and compliance scanning for BlumeOps infrastructure.
All compliance scan reports are stored on `sifaka:/volume1/reports/`. See [[read-compliance-reports]] for access and interpretation.
-## Compensating controls
-
-Suppressed findings reference named compensating controls tracked in `compensating-controls.yaml` (repo root). Each control has a review date and verification steps. See [[review-compensating-controls]] for the review process.
-
-```bash
-mise run review-compensating-controls
-```
+Suppressed findings are kept in Prowler mutelist YAML under `argocd/manifests/prowler/mutelist/`. Each entry's `Description` field explains why the finding is muted; entries are reviewed ad-hoc rather than on a scheduled cadence.
## Known gaps
diff --git a/mise-tasks/review-compensating-controls b/mise-tasks/review-compensating-controls
deleted file mode 100755
index e92d302..0000000
--- a/mise-tasks/review-compensating-controls
+++ /dev/null
@@ -1,229 +0,0 @@
-#!/usr/bin/env -S uv run --script
-# /// script
-# requires-python = ">=3.12"
-# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.25.0"]
-# ///
-#MISE description="Review the most stale compensating control"
-#USAGE flag "--limit " default="10" help="Number of controls to show in the table"
-"""Review compensating controls by staleness.
-
-Reads ``compensating-controls.yaml`` and sorts by ``last-reviewed``.
-Shows a staleness table, then displays the most stale control with all
-references found in the codebase.
-
-After reviewing, update the control entry:
-
- last-reviewed: YYYY-MM-DD
-
-Usage: mise run review-compensating-controls [--limit 10]
-"""
-
-import subprocess
-import sys
-from datetime import date
-from pathlib import Path
-from typing import Annotated
-
-import typer
-import yaml
-from rich.console import Console
-from rich.panel import Panel
-from rich.table import Table
-
-CONTROLS_FILE = Path(__file__).parent.parent / "compensating-controls.yaml"
-REPO_ROOT = Path(__file__).parent.parent
-
-
-def load_controls(path: Path) -> list[dict]:
- data = yaml.safe_load(path.read_text())
- return data.get("controls", [])
-
-
-def parse_date(raw) -> date | None:
- if raw is None:
- return None
- if isinstance(raw, date):
- return raw
- try:
- return date.fromisoformat(str(raw))
- except ValueError:
- return None
-
-
-def find_references(control_id: str) -> list[str]:
- """Find all files referencing a control ID using ripgrep."""
- try:
- result = subprocess.run(
- ["rg", "--no-heading", "-n", control_id, str(REPO_ROOT)],
- capture_output=True,
- text=True,
- timeout=10,
- )
- lines = result.stdout.strip().splitlines()
- # Exclude the controls file itself and this script
- return [
- ln
- for ln in lines
- if "compensating-controls.yaml" not in ln
- and "review-compensating-controls" not in ln
- ]
- except (FileNotFoundError, subprocess.TimeoutExpired):
- return []
-
-
-def main(
- limit: Annotated[
- int, typer.Option(help="Number of controls to show in the table")
- ] = 10,
-) -> None:
- console = Console()
- today = date.today()
-
- if not CONTROLS_FILE.exists():
- console.print(
- f"[bold red]Controls file not found:[/bold red] {CONTROLS_FILE}"
- )
- raise typer.Exit(code=1)
-
- controls = load_controls(CONTROLS_FILE)
-
- # Parse dates and build sortable entries
- entries: list[tuple[dict, date | None]] = []
- for ctrl in controls:
- reviewed = parse_date(ctrl.get("last-reviewed"))
- entries.append((ctrl, reviewed))
-
- # Sort: never-reviewed first, then oldest
- entries.sort(key=lambda e: (e[1] is not None, e[1] or date.min))
-
- never_reviewed = sum(1 for _, r in entries if r is None)
-
- # --- Summary panel ---
- console.print()
- console.print(
- Panel(
- f"[bold]{len(entries)}[/bold] compensating controls, "
- f"[bold red]{never_reviewed}[/bold red] never reviewed",
- title="[bold]Compensating Control Review Queue[/bold]",
- border_style="cyan",
- )
- )
- console.print()
-
- # --- Staleness table ---
- table = Table(show_header=True, header_style="bold")
- table.add_column("#", justify="right")
- table.add_column("Control ID")
- table.add_column("Last Reviewed", justify="right")
- table.add_column("Age (days)", justify="right")
- table.add_column("Refs", justify="right")
-
- for i, (ctrl, reviewed) in enumerate(entries[:limit], 1):
- control_id = ctrl["id"]
- refs = len(find_references(control_id))
-
- if reviewed is None:
- table.add_row(
- str(i),
- f"[red]{control_id}[/red]",
- "[red]never[/red]",
- "[red]—[/red]",
- str(refs),
- )
- else:
- age = (today - reviewed).days
- style = "yellow" if age > 90 else ""
- id_str = f"[{style}]{control_id}[/{style}]" if style else control_id
- date_str = f"[{style}]{reviewed}[/{style}]" if style else str(reviewed)
- age_str = f"[{style}]{age}[/{style}]" if style else str(age)
- table.add_row(str(i), id_str, date_str, age_str, str(refs))
-
- remaining = len(entries) - limit
- if remaining > 0:
- table.add_row("", f"[dim]… {remaining} more[/dim]", "", "", "")
-
- console.print(table)
- console.print()
-
- # --- Most stale control detail ---
- if not entries:
- console.print("[bold red]No controls found![/bold red]")
- raise typer.Exit(code=1)
-
- top_ctrl, top_reviewed = entries[0]
- control_id = top_ctrl["id"]
- refs = find_references(control_id)
-
- detail_lines = [
- f"[bold cyan]{control_id}[/bold cyan]",
- f"[dim]Last reviewed: {top_reviewed or 'never'}[/dim]",
- "",
- f"[bold]Description:[/bold] {top_ctrl.get('description', '').strip()}",
- ]
- notes = top_ctrl.get("notes", "").strip()
- if notes:
- detail_lines.append(f"[bold]Notes:[/bold] {notes}")
-
- console.print(
- Panel(
- "\n".join(detail_lines),
- title="[bold]Up For Review[/bold]",
- border_style="green",
- )
- )
- console.print()
-
- # --- References ---
- if refs:
- ref_table = Table(
- show_header=True, header_style="bold", title="References in codebase"
- )
- ref_table.add_column("File", style="cyan")
- ref_table.add_column("Line")
-
- for ref in refs:
- # rg output: file:line:content
- parts = ref.split(":", 2)
- if len(parts) >= 3:
- filepath = parts[0].replace(str(REPO_ROOT) + "/", "")
- line_no = parts[1]
- content = parts[2].strip()
- ref_table.add_row(f"{filepath}:{line_no}", content)
- else:
- ref_table.add_row(ref, "")
-
- console.print(ref_table)
- else:
- console.print(
- f"[yellow]No references to '{control_id}' found in the codebase.[/yellow]"
- )
- console.print()
-
- # --- Review checklist ---
- checklist = [
- "[bold]Verification:[/bold]\n",
- f"• {notes}\n" if notes else "",
- "\n[bold]Review each reference:[/bold]\n",
- "• For each muted finding referencing this control, confirm:\n",
- " 1. The risk the original check guards against\n",
- " 2. That this control actually mitigates that risk\n",
- " 3. That the control is still in effect (not degraded or bypassed)\n",
- "\n[bold]After review:[/bold]\n",
- f"• Update compensating-controls.yaml: [cyan]last-reviewed: {today}[/cyan]\n",
- "• If the control is no longer valid, either:\n",
- " - Fix the underlying finding and remove the mute, or\n",
- " - Document a new/updated compensating control\n",
- "• Commit the change",
- ]
-
- console.print(
- Panel(
- "".join(checklist),
- title="[bold yellow]Review Guidance[/bold yellow]",
- border_style="yellow",
- )
- )
-
-
-if __name__ == "__main__":
- typer.run(main)
diff --git a/mise-tasks/review-compliance-reports b/mise-tasks/review-compliance-reports
index bcbe090..a9146c8 100755
--- a/mise-tasks/review-compliance-reports
+++ b/mise-tasks/review-compliance-reports
@@ -143,7 +143,10 @@ def _kubectl(args: str, timeout: int = 15) -> subprocess.CompletedProcess:
def run_node_verification(console: Console) -> None:
"""Verify node-level conditions that Prowler reports as MANUAL.
- Compensating control: node-config-automated-verification
+ Prowler runs inside a pod and can't evaluate kubelet file permissions,
+ kubelet config arguments, etcd CA separation, or cluster-admin RBAC
+ bindings. We SSH into the minikube node and check each condition here,
+ failing loudly if any deviates from expected values.
"""
checks: list[tuple[str, str, bool]] = [] # (name, detail, passed)
@@ -278,7 +281,7 @@ def run_node_verification(console: Console) -> None:
table = Table(
show_header=True,
header_style="bold",
- title="Node Verification (CC: node-config-automated-verification)",
+ title="Node Verification (out-of-band checks for MANUAL findings)",
)
table.add_column("Check")
table.add_column("Detail")
@@ -528,8 +531,8 @@ def summarize_report(
Panel(
f"[bold yellow]{len(latest['unmuted'])} unmuted failure(s) "
f"need triage.[/bold yellow]\n\n"
- "For each: remediate or mute "
- "(add to mutelist + compensating control).",
+ "For each: remediate, or add a Resource entry to the "
+ "matching check in argocd/manifests/prowler/mutelist/.",
title=f"{label} Verdict",
border_style="yellow",
)
@@ -653,7 +656,6 @@ def main(
)
# --- Node-level MANUAL check verification ---
- # Compensating control: node-config-automated-verification
# These checks verify conditions Prowler reports as MANUAL because it
# runs inside a pod and cannot evaluate them directly.
run_node_verification(console)
From d02bf062af2cd3a867cd5c4da17686ae0806fa0b Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 22 May 2026 21:29:11 -0700
Subject: [PATCH 15/52] C0: review 1password reference card
Added vault split (blumeops vs Personal), noted onepassword-connect
runs on both indri and ringtail, and lifted op CLI guidance from
agent memory into the card. Bumped last-reviewed.
---
docs/changelog.d/+review-1password-doc.doc.md | 1 +
docs/reference/services/1password.md | 37 ++++++++++++-------
2 files changed, 24 insertions(+), 14 deletions(-)
create mode 100644 docs/changelog.d/+review-1password-doc.doc.md
diff --git a/docs/changelog.d/+review-1password-doc.doc.md b/docs/changelog.d/+review-1password-doc.doc.md
new file mode 100644
index 0000000..bba9591
--- /dev/null
+++ b/docs/changelog.d/+review-1password-doc.doc.md
@@ -0,0 +1 @@
+Reviewed [[1password]] reference card: added the `blumeops` vs `Personal` vault split, noted that `onepassword-connect` runs on both indri and ringtail (not just one cluster), and pulled the `op read` vs `op item get --fields` guidance up from agent memory into the card.
diff --git a/docs/reference/services/1password.md b/docs/reference/services/1password.md
index 4489194..5ad50da 100644
--- a/docs/reference/services/1password.md
+++ b/docs/reference/services/1password.md
@@ -1,6 +1,7 @@
---
title: 1Password
-modified: 2026-02-10
+modified: 2026-05-22
+last-reviewed: 2026-05-22
tags:
- service
- secrets
@@ -8,15 +9,22 @@ tags:
# 1Password
-Root credential store for all BlumeOps secrets, synced to Kubernetes via External Secrets Operator.
+Root credential store for all BlumeOps secrets. Kubernetes workloads read items via [[external-secrets|External Secrets Operator]]; humans and agents read via the `op` CLI.
-## Architecture
+## Vaults
+
+| Vault | Purpose |
+|-------|---------|
+| `blumeops` | Infrastructure secrets — referenced by ExternalSecret manifests and scripts. |
+| `Personal` | Human login credentials keyed by URL for autofill. Not consumed by infrastructure. |
+
+## Kubernetes Integration
```
1Password Cloud
|
v
-1Password Connect (namespace: 1password)
+1Password Connect (namespace: 1password, deployed on both indri and ringtail)
|
v
External Secrets Operator (namespace: external-secrets)
@@ -25,15 +33,15 @@ External Secrets Operator (namespace: external-secrets)
Native Kubernetes Secrets
```
-## Vault
+**ClusterSecretStore:** `onepassword-blumeops` (same name on both clusters).
-The `blumeops` vault contains all infrastructure credentials.
+Services reference 1Password items via `ExternalSecret` manifests. Both `minikube-indri` and `k3s-ringtail` run their own `onepassword-connect` deployment talking to the same vault.
-## Kubernetes Integration
+## Direct Access
-**ClusterSecretStore:** `onepassword-blumeops`
+Prefer `op read "op://vault/item/field"` over `op item get --fields` in scripts and IaC — `op item get --fields` wraps multi-line values in quotes, corrupting them. `op item get` without flags is fine for exploring item metadata.
-Services reference 1Password items via `ExternalSecret` manifests.
+If an item name contains special characters (e.g. parentheses), use the item ID instead of the name in the `op://` path.
## Disaster Recovery Backup
@@ -41,8 +49,9 @@ The `mise run op-backup` task encrypts a `.1pux` vault export and transfers it t
## Related
-- [[argocd]] - Uses secrets for git access
-- [[postgresql]] - Database credentials
-- [[run-1password-backup]] - Periodic backup procedure
-- [[restore-1password-backup]] - Recovery from backup
-- [[borgmatic]] - Backup system
+- [[external-secrets]] — Kubernetes operator that consumes ClusterSecretStore
+- [[argocd]] — Uses secrets for git access
+- [[postgresql]] — Database credentials
+- [[run-1password-backup]] — Periodic backup procedure
+- [[restore-1password-backup]] — Recovery from backup
+- [[borgmatic]] — Backup system
From 08a1cb164a3f96b408979ecda560a9f7dbf768b4 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 22 May 2026 21:36:13 -0700
Subject: [PATCH 16/52] C0: fix 1password export filename in backup how-to
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
1Password's desktop app names exports as
1PasswordExport--.1pux automatically — you can't
choose the name. Procedure now points the task at that glob.
---
.../+1password-backup-doc-export-name.doc.md | 1 +
docs/how-to/operations/run-1password-backup.md | 12 +++++-------
2 files changed, 6 insertions(+), 7 deletions(-)
create mode 100644 docs/changelog.d/+1password-backup-doc-export-name.doc.md
diff --git a/docs/changelog.d/+1password-backup-doc-export-name.doc.md b/docs/changelog.d/+1password-backup-doc-export-name.doc.md
new file mode 100644
index 0000000..6c4d262
--- /dev/null
+++ b/docs/changelog.d/+1password-backup-doc-export-name.doc.md
@@ -0,0 +1 @@
+Fixed the export-filename step in [[run-1password-backup]]: 1Password's desktop app names the export `1PasswordExport--.1pux` automatically rather than letting you save to a fixed name, so the procedure now points the task at that glob instead of pretending the default name is `1Password-export.1pux`.
diff --git a/docs/how-to/operations/run-1password-backup.md b/docs/how-to/operations/run-1password-backup.md
index b0807da..0dc9ec9 100644
--- a/docs/how-to/operations/run-1password-backup.md
+++ b/docs/how-to/operations/run-1password-backup.md
@@ -26,20 +26,18 @@ How to export and encrypt your 1Password vaults for inclusion in [[borgmatic]] b
1. Open the 1Password desktop app
2. **File > Export > All Vaults**
3. Choose **1PUX** format
-4. Save to `~/Documents/1Password-export.1pux`
+4. Save to `~/Documents/` — 1Password names the file `1PasswordExport--.1pux` automatically; don't bother renaming it, pass the path to the task in the next step
### 2. Run the Backup Task
-```fish
-mise run op-backup
-```
-
-Or, if you saved the export to a non-default location:
+Pass the exported file's path:
```fish
-mise run op-backup ~/path/to/export.1pux
+mise run op-backup ~/Documents/1PasswordExport-*.1pux
```
+(If only one export exists in `~/Documents/`, the glob expands cleanly. Otherwise, paste the full path.)
+
The task will:
1. Prompt for the `.1pux` path if not provided
From 57fd88b2698e87b5767d90c1a82151b1db87f446 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Fri, 22 May 2026 21:50:43 -0700
Subject: [PATCH 17/52] C0: fix op item edit syntax in zot key rotation
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
The pbpaste | op item edit ... "field[password]=-" stdin syntax is
rejected by op 2.34 as "invalid JSON" — recent op versions treat
piped input as a full JSON template, not a single field value.
Procedure now uses an inline assignment via a local fish variable.
---
docs/changelog.d/+zot-ci-rotation-op-syntax.doc.md | 1 +
docs/reference/services/zot.md | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+zot-ci-rotation-op-syntax.doc.md
diff --git a/docs/changelog.d/+zot-ci-rotation-op-syntax.doc.md b/docs/changelog.d/+zot-ci-rotation-op-syntax.doc.md
new file mode 100644
index 0000000..ec8834f
--- /dev/null
+++ b/docs/changelog.d/+zot-ci-rotation-op-syntax.doc.md
@@ -0,0 +1 @@
+Fixed the `op item edit` invocation in the [[zot]] API-key rotation procedure: the previous `pbpaste | op item edit ... "field[password]=-"` stdin syntax is rejected by op 2.34 as "invalid JSON" (recent op versions treat piped input as a full JSON template, not a single field value). Procedure now reads the clipboard into a local fish variable and passes it as an inline assignment.
diff --git a/docs/reference/services/zot.md b/docs/reference/services/zot.md
index d00a200..b01a6ce 100644
--- a/docs/reference/services/zot.md
+++ b/docs/reference/services/zot.md
@@ -56,8 +56,9 @@ The `zot-ci` API key expires every **90 days**. To rotate:
5. Generate a new API key, copy it to clipboard
6. Update 1Password:
```fish
- pbpaste | op item edit "Forgejo Secrets" --vault blumeops "zot-ci-api[password]=-"
+ set -l NEWKEY (pbpaste); op item edit "Forgejo Secrets" --vault blumeops "zot-ci-api[password]=$NEWKEY"; set -e NEWKEY
```
+ The value is briefly visible to other `ps`-readers on this machine (single-user mac, acceptable tradeoff). The older `pbpaste | op item edit ... "field[password]=-"` stdin syntax was rejected by op 2.34 as "invalid JSON" — recent op versions treat piped input as a full JSON template.
7. Sync to Forgejo: `mise run provision-indri -- --tags forgejo_actions_secrets`
## Related
From 35ae171783ca7ac54bc57fc1cc23e7a171b36782 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Wed, 27 May 2026 07:15:07 -0700
Subject: [PATCH 18/52] C0: fix sync button location in manage-forgejo-mirrors
The verify step pointed to the main repo page, but the "Synchronize now"
button is in the Mirror settings section of the settings page.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
docs/changelog.d/+manage-forgejo-mirrors-sync-location.doc.md | 1 +
docs/how-to/configuration/manage-forgejo-mirrors.md | 4 ++--
2 files changed, 3 insertions(+), 2 deletions(-)
create mode 100644 docs/changelog.d/+manage-forgejo-mirrors-sync-location.doc.md
diff --git a/docs/changelog.d/+manage-forgejo-mirrors-sync-location.doc.md b/docs/changelog.d/+manage-forgejo-mirrors-sync-location.doc.md
new file mode 100644
index 0000000..f71fc81
--- /dev/null
+++ b/docs/changelog.d/+manage-forgejo-mirrors-sync-location.doc.md
@@ -0,0 +1 @@
+Fix manage-forgejo-mirrors verify step — sync button is on the repo settings page ("Synchronize now"), not the main repo page.
diff --git a/docs/how-to/configuration/manage-forgejo-mirrors.md b/docs/how-to/configuration/manage-forgejo-mirrors.md
index 9c0e113..5d150dc 100644
--- a/docs/how-to/configuration/manage-forgejo-mirrors.md
+++ b/docs/how-to/configuration/manage-forgejo-mirrors.md
@@ -137,8 +137,8 @@ Return to [GitHub token settings](https://github.com/settings/tokens?type=beta)
Trigger a manual sync on one mirror to confirm the new PAT works:
-1. Go to any mirror repo on forge (e.g., `mirrors/cloudnative-pg`)
-2. Click the sync button (circular arrows icon) next to the mirror status
+1. Go to any mirror repo's settings page on forge (e.g., `https://forge.eblu.me/mirrors/cloudnative-pg/settings`)
+2. In the "Mirror settings" section, click "Synchronize now"
3. Confirm the sync completes without errors
## Related
From c09bd5b6129ce688722b305801100ae1199c9036 Mon Sep 17 00:00:00 2001
From: Erich Blume <725328+eblume@users.noreply.github.com>
Date: Wed, 27 May 2026 11:54:32 -0700
Subject: [PATCH 19/52] C0: cap systemd-coredump on ringtail to stop game-crash
lockups
Wine/Proton game segfaults (e.g. Diablo IV) produced multi-GB cores that
systemd-coredump spent minutes compressing to disk, pinning the CPU and
freezing the desktop. Cap ProcessSizeMax/ExternalSizeMax at 1G (oversized
cores logged but skipped) and MaxUse at 2G to bound the store.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
.../+ringtail-coredump-size-cap.infra.md | 1 +
nixos/ringtail/configuration.nix | 16 ++++++++++++++++
2 files changed, 17 insertions(+)
create mode 100644 docs/changelog.d/+ringtail-coredump-size-cap.infra.md
diff --git a/docs/changelog.d/+ringtail-coredump-size-cap.infra.md b/docs/changelog.d/+ringtail-coredump-size-cap.infra.md
new file mode 100644
index 0000000..824b2df
--- /dev/null
+++ b/docs/changelog.d/+ringtail-coredump-size-cap.infra.md
@@ -0,0 +1 @@
+Cap systemd-coredump on ringtail (ProcessSizeMax/ExternalSizeMax 1G, MaxUse 2G) so multi-GB Wine/Proton game crash dumps no longer thrash the disk and lock up the desktop.
diff --git a/nixos/ringtail/configuration.nix b/nixos/ringtail/configuration.nix
index e8c634a..f01ce9f 100644
--- a/nixos/ringtail/configuration.nix
+++ b/nixos/ringtail/configuration.nix
@@ -609,6 +609,22 @@ in
AllowSuspendThenHibernate=no
'';
+ # Cap systemd-coredump. Wine/Proton games (Diablo IV, etc.) segfault
+ # regularly and dump multi-GB cores; with the stock (effectively unbounded)
+ # limits, systemd-coredump then spends minutes streaming and compressing the
+ # dump to disk — e.g. a single D4 crash produced a 4.6G core, read 13.7G and
+ # wrote 17.4G, pinning the CPU and locking up the desktop for ~3.5 minutes.
+ # Those cores are useless anyway: Nix .so files carry no build-id, so no
+ # backtrace can be generated. Capping uncompressed size at 1G makes oversized
+ # cores get logged-but-skipped (the kernel stops dumping once we stop reading)
+ # while real service cores (well under 1G) are still captured. MaxUse bounds
+ # the on-disk store so frequent game crashes can't accumulate (was at 8.6G).
+ systemd.coredump.extraConfig = ''
+ ProcessSizeMax=1G
+ ExternalSizeMax=1G
+ MaxUse=2G
+ '';
+
# NixOS release
system.stateVersion = "25.11";
}
From 753fa9cb6317108ab8701e1f58ec1ba7c991d211 Mon Sep 17 00:00:00 2001
From: Erich Blume <725328+eblume@users.noreply.github.com>
Date: Wed, 27 May 2026 12:59:29 -0700
Subject: [PATCH 20/52] C0: disable VRR on ringtail DP-1 to stop OMEN panel
flicker
The OMEN 27i IPS pumps brightness when its refresh swings into the low
VRR range during low-framerate content (game cutscenes), producing a
~20Hz flicker that compounds over a session until a reboot. GPU health
is clean (no Xid/ECC/thermal); pinning fixed 165Hz eliminates it.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
docs/changelog.d/+ringtail-vrr-flicker.bugfix.md | 1 +
nixos/ringtail/configuration.nix | 7 ++++++-
2 files changed, 7 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+ringtail-vrr-flicker.bugfix.md
diff --git a/docs/changelog.d/+ringtail-vrr-flicker.bugfix.md b/docs/changelog.d/+ringtail-vrr-flicker.bugfix.md
new file mode 100644
index 0000000..cb23344
--- /dev/null
+++ b/docs/changelog.d/+ringtail-vrr-flicker.bugfix.md
@@ -0,0 +1 @@
+Disabled adaptive sync (VRR) on ringtail's DP-1 output. The OMEN 27i IPS panel pumps brightness when its refresh rate swings into the low VRR range during low-framerate content (e.g. game cutscenes), producing a flicker that worsened over a session until a reboot. Pinning the panel to a fixed 165Hz eliminates it.
diff --git a/nixos/ringtail/configuration.nix b/nixos/ringtail/configuration.nix
index f01ce9f..bc893d5 100644
--- a/nixos/ringtail/configuration.nix
+++ b/nixos/ringtail/configuration.nix
@@ -337,7 +337,12 @@ in
output = {
"DP-1" = {
mode = "2560x1440@165Hz";
- adaptive_sync = "on";
+ # VRR off: the OMEN 27i IPS pumps gamma/brightness when the panel
+ # refresh swings into its low VRR range (e.g. low-fps game
+ # cutscenes), producing a ~20Hz flicker that compounds over a long
+ # session until a reboot. Fixed refresh at 165Hz eliminates it.
+ # If you want VRR back, cap in-game fps so refresh never dips low.
+ adaptive_sync = "off";
bg = "~/.config/sway/wallpaper.jpg fill";
};
};
From c00d7db5079e78772e5e7e3780d7594baa009bd4 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 06:01:57 -0700
Subject: [PATCH 21/52] Recurring maintenance batch (2026-05-27) (#360)
Bundle of recurring overdue tasks:
- Ringtail flake update
- Security & compliance report review
- Tooling deps bump (prek, fly, mise, forgejo workflows)
- Top stale doc review
- Top stale service review (if trivial)
Larger items (service version bumps requiring upgrades, non-local container migration) split out as separate PRs.
Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/360
---
.../recurring-maintenance-2026-05-27.doc.md | 1 +
.../recurring-maintenance-2026-05-27.infra.md | 4 ++++
docs/reference/infrastructure/indri.md | 9 +++++++--
fly/Dockerfile | 8 ++++----
mise-tasks/branch-cleanup | 2 +-
mise-tasks/container-build-and-release | 2 +-
mise-tasks/container-list | 2 +-
mise-tasks/container-version-check | 2 +-
mise-tasks/dns-acme-cleanup | 2 +-
mise-tasks/docs-mikado | 2 +-
mise-tasks/docs-preview | 2 +-
mise-tasks/docs-review | 2 +-
mise-tasks/docs-review-stale | 2 +-
mise-tasks/mikado-branch-invariant-check | 2 +-
mise-tasks/op-backup | 2 +-
mise-tasks/pr-comments | 2 +-
mise-tasks/prune-ringtail-generations | 2 +-
mise-tasks/review-compliance-reports | 2 +-
mise-tasks/runner-logs | 2 +-
mise-tasks/service-review | 2 +-
mise-tasks/spork-create | 2 +-
nixos/ringtail/flake.lock | 18 +++++++++---------
prek.toml | 8 ++++----
23 files changed, 46 insertions(+), 36 deletions(-)
create mode 100644 docs/changelog.d/recurring-maintenance-2026-05-27.doc.md
create mode 100644 docs/changelog.d/recurring-maintenance-2026-05-27.infra.md
diff --git a/docs/changelog.d/recurring-maintenance-2026-05-27.doc.md b/docs/changelog.d/recurring-maintenance-2026-05-27.doc.md
new file mode 100644
index 0000000..af30489
--- /dev/null
+++ b/docs/changelog.d/recurring-maintenance-2026-05-27.doc.md
@@ -0,0 +1 @@
+Reviewed [[indri]] reference card: added `devpi`, `cv`, and `docs` to the native-services list; widened the k8s note to reflect the growing set of apps now on ringtail and the planned indri-minikube decommission; added CPU/RAM specs.
diff --git a/docs/changelog.d/recurring-maintenance-2026-05-27.infra.md b/docs/changelog.d/recurring-maintenance-2026-05-27.infra.md
new file mode 100644
index 0000000..f2d48ad
--- /dev/null
+++ b/docs/changelog.d/recurring-maintenance-2026-05-27.infra.md
@@ -0,0 +1,4 @@
+Recurring maintenance batch:
+
+- Ringtail flake inputs refreshed (`disko`, `home-manager`, `nixpkgs`).
+- Tooling deps bumped: prek hooks (trufflehog v3.95.3, kingfisher v1.101.0, ruff v0.15.14, `ansible-core` 2.21.0); fly proxy base images (nginx 1.30.1-alpine, alloy v1.16.1); `typer==0.26.2` in mise tasks.
diff --git a/docs/reference/infrastructure/indri.md b/docs/reference/infrastructure/indri.md
index cbb2a0f..67652ca 100644
--- a/docs/reference/infrastructure/indri.md
+++ b/docs/reference/infrastructure/indri.md
@@ -1,6 +1,7 @@
---
title: Indri
-modified: 2026-02-19
+modified: 2026-05-27
+last-reviewed: 2026-05-27
tags:
- infrastructure
- host
@@ -15,6 +16,7 @@ Primary BlumeOps server. Mac Mini M1 (2020).
| Property | Value |
|----------|-------|
| **Model** | Mac mini M1, 2020 (Macmini9,1) |
+| **CPU / RAM** | 8 cores / 16 GB |
| **Storage** | 2TB internal SSD |
| **macOS** | 15.7.3 (Sequoia) |
| **Tailscale hostname** | `indri.tail8d86e.ts.net` |
@@ -30,9 +32,12 @@ Primary BlumeOps server. Mac Mini M1 (2020).
- [[borgmatic]] - Backup system
- [[alloy|Alloy]] - Metrics/logs collector
- [[caddy]] - Reverse proxy for `*.ops.eblu.me`
+- [[devpi]] - PyPI mirror (LaunchAgent)
+- [[cv]] - Static CV site, served by Caddy
+- [[docs]] - Quartz-built docs site, served by Caddy
**Kubernetes (via minikube):**
-- [[apps|Most k8s applications]] (Frigate, ntfy migrated to [[ringtail]] k3s)
+- [[apps|Most k8s applications]]. A growing set of apps (Authentik, Frigate, ntfy, Immich, Homepage, Shower, Kingfisher, alloy-ringtail) now run on [[ringtail]]'s k3s instead. Long-term plan is to decommission indri's minikube entirely.
**GUI Applications (manual start required):**
- Docker Desktop - Container runtime for minikube
diff --git a/fly/Dockerfile b/fly/Dockerfile
index eae8c35..d4e7a18 100644
--- a/fly/Dockerfile
+++ b/fly/Dockerfile
@@ -1,5 +1,5 @@
-# nginx 1.30.0-alpine
-FROM nginx@sha256:0272e4604ed93c1792f03695a033a6e8546840f86e0de20a884bb17d2c924883
+# nginx 1.30.1-alpine
+FROM nginx@sha256:c819f83c54b0361f5557601bf5eb4943d09360e7a7fdf426afc466570f45874d
# Copy tailscale binaries from official image (v1.94.2)
COPY --from=docker.io/tailscale/tailscale@sha256:95e528798bebe75f39b10e74e7051cf51188ee615934f232ba7ad06a3390ffa1 \
@@ -13,8 +13,8 @@ RUN mkdir -p /var/run/tailscale /var/lib/tailscale \
&& apk add --no-cache fail2ban \
&& rm -f /etc/fail2ban/jail.d/alpine-ssh.conf
-# Copy Alloy binary from official image (v1.16.0, Ubuntu-based, needs libc6-compat)
-COPY --from=docker.io/grafana/alloy@sha256:6e00cf7c5a692ff5f24844529416ed017d76fce922f8199004e73d5eca46b6b8 \
+# Copy Alloy binary from official image (v1.16.1, Ubuntu-based, needs libc6-compat)
+COPY --from=docker.io/grafana/alloy@sha256:51aeb9d829239345070619dad3edd6873186f913c84f45b365b74574fcb38ec0 \
/bin/alloy /usr/local/bin/alloy
RUN mkdir -p /var/log/nginx /etc/alloy /tmp/alloy-data
diff --git a/mise-tasks/branch-cleanup b/mise-tasks/branch-cleanup
index 575c9a1..a538880 100755
--- a/mise-tasks/branch-cleanup
+++ b/mise-tasks/branch-cleanup
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Delete branches that have been merged into main (local and remote)"
#MISE alias="bc"
diff --git a/mise-tasks/container-build-and-release b/mise-tasks/container-build-and-release
index ba569e7..85e6cb8 100755
--- a/mise-tasks/container-build-and-release
+++ b/mise-tasks/container-build-and-release
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["typer==0.25.0", "httpx==0.28.1"]
+# dependencies = ["typer==0.26.2", "httpx==0.28.1"]
# ///
#MISE description="Trigger container build workflows via Forgejo API"
#USAGE arg "" help="Container name (directory under containers/)"
diff --git a/mise-tasks/container-list b/mise-tasks/container-list
index 26639f2..7dad346 100755
--- a/mise-tasks/container-list
+++ b/mise-tasks/container-list
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="List available containers and their recent tags"
#USAGE arg "[name]" help="Optional container name to filter output"
diff --git a/mise-tasks/container-version-check b/mise-tasks/container-version-check
index 4ebe3b6..06f96ae 100755
--- a/mise-tasks/container-version-check
+++ b/mise-tasks/container-version-check
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Validate container version consistency across container.py, Dockerfiles, nix derivations, and service-versions.yaml"
#USAGE flag "--all-files" help="Check all containers, not just changed ones"
diff --git a/mise-tasks/dns-acme-cleanup b/mise-tasks/dns-acme-cleanup
index 432a6ce..3a53b11 100755
--- a/mise-tasks/dns-acme-cleanup
+++ b/mise-tasks/dns-acme-cleanup
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Delete orphaned ACME challenge TXT records in eblu.me"
#USAGE flag "--dry-run" help="List orphans without deleting"
diff --git a/mise-tasks/docs-mikado b/mise-tasks/docs-mikado
index eea052f..c632e46 100755
--- a/mise-tasks/docs-mikado
+++ b/mise-tasks/docs-mikado
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["httpx==0.28.1", "pyyaml==6.0.3", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["httpx==0.28.1", "pyyaml==6.0.3", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="View active Mikado dependency chains for C2 changes"
#USAGE arg "[card]" help="Card stem to show chain for"
diff --git a/mise-tasks/docs-preview b/mise-tasks/docs-preview
index faa79af..9e0bd16 100755
--- a/mise-tasks/docs-preview
+++ b/mise-tasks/docs-preview
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Build docs with Dagger and serve locally, opening to a specific card"
#USAGE arg "" help="Card path relative to docs/, e.g. how-to/knowledgebase/review-documentation"
diff --git a/mise-tasks/docs-review b/mise-tasks/docs-review
index d07904d..12e301f 100755
--- a/mise-tasks/docs-review
+++ b/mise-tasks/docs-review
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Review the most stale documentation card by last-reviewed date"
#USAGE flag "--limit " default="15" help="Number of docs to show in the table"
diff --git a/mise-tasks/docs-review-stale b/mise-tasks/docs-review-stale
index 4449213..0c5490e 100755
--- a/mise-tasks/docs-review-stale
+++ b/mise-tasks/docs-review-stale
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Report docs by git-last-modified date, highlighting stale ones"
#USAGE flag "--threshold " default="180" help="Days before a doc is considered stale"
diff --git a/mise-tasks/mikado-branch-invariant-check b/mise-tasks/mikado-branch-invariant-check
index 1f0fbcf..3135bf2 100755
--- a/mise-tasks/mikado-branch-invariant-check
+++ b/mise-tasks/mikado-branch-invariant-check
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Validate Mikado Branch Invariant on mikado/* branches"
#USAGE arg "[commit_msg_file]" help="Commit message file (passed by commit-msg hook)"
diff --git a/mise-tasks/op-backup b/mise-tasks/op-backup
index 37a97a6..7db033b 100755
--- a/mise-tasks/op-backup
+++ b/mise-tasks/op-backup
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Encrypt a 1Password .1pux export and send to indri for borgmatic"
#USAGE arg "[export_path]" help="Path to .1pux export file (prompted if omitted)"
diff --git a/mise-tasks/pr-comments b/mise-tasks/pr-comments
index 7205617..39d7c9a 100755
--- a/mise-tasks/pr-comments
+++ b/mise-tasks/pr-comments
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="List unresolved comments on a PR"
#USAGE arg "" help="Pull request number"
diff --git a/mise-tasks/prune-ringtail-generations b/mise-tasks/prune-ringtail-generations
index 2b8e3f9..2ad8dc8 100755
--- a/mise-tasks/prune-ringtail-generations
+++ b/mise-tasks/prune-ringtail-generations
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Prune old NixOS generations on ringtail, preserving rollback safety"
#MISE alias="prg"
diff --git a/mise-tasks/review-compliance-reports b/mise-tasks/review-compliance-reports
index a9146c8..24d2afc 100755
--- a/mise-tasks/review-compliance-reports
+++ b/mise-tasks/review-compliance-reports
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["rich==15.0.0", "typer==0.25.0", "pyyaml==6.0.3"]
+# dependencies = ["rich==15.0.0", "typer==0.26.2", "pyyaml==6.0.3"]
# ///
#MISE description="Summarize the latest Prowler and Kingfisher compliance reports from sifaka"
#USAGE flag "--full" help="Show all unmuted failures, not just new ones"
diff --git a/mise-tasks/runner-logs b/mise-tasks/runner-logs
index 9c988ee..3c5e8e3 100755
--- a/mise-tasks/runner-logs
+++ b/mise-tasks/runner-logs
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="List recent Forgejo Actions runs or fetch logs for a specific job"
#USAGE arg "[run_number]" help="Run number to show jobs for (omit to list recent runs)"
diff --git a/mise-tasks/service-review b/mise-tasks/service-review
index 2d50e0b..f83b104 100755
--- a/mise-tasks/service-review
+++ b/mise-tasks/service-review
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["pyyaml==6.0.3", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Review the most stale service for version freshness"
#USAGE flag "--limit " default="15" help="Number of services to show in the table"
diff --git a/mise-tasks/spork-create b/mise-tasks/spork-create
index 92f4e5c..3f18563 100755
--- a/mise-tasks/spork-create
+++ b/mise-tasks/spork-create
@@ -1,7 +1,7 @@
#!/usr/bin/env -S uv run --script
# /// script
# requires-python = ">=3.12"
-# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.25.0"]
+# dependencies = ["httpx==0.28.1", "rich==15.0.0", "typer==0.26.2"]
# ///
#MISE description="Create a spork (floating-branch soft-fork) of a mirrored upstream project"
#USAGE arg "" help="Repository name in the mirrors/ org on forge (e.g. kingfisher)"
diff --git a/nixos/ringtail/flake.lock b/nixos/ringtail/flake.lock
index 0f53d0e..0f0da7e 100644
--- a/nixos/ringtail/flake.lock
+++ b/nixos/ringtail/flake.lock
@@ -7,11 +7,11 @@
]
},
"locked": {
- "lastModified": 1777713215,
- "narHash": "sha256-8GzXDOXckDWwST8TY5DbwYFjdvQLlP7K9CLSVx6iTTo=",
+ "lastModified": 1779699611,
+ "narHash": "sha256-EcCaSTKnmg2o4wLKaN1aqQFomwyhO7ik0bX9COdyCas=",
"owner": "nix-community",
"repo": "disko",
- "rev": "63b4e7e6cf75307c1d26ac3762b886b5b0247267",
+ "rev": "5ba0c9555c28685e57fa54c7a25e42c7efdbfc8d",
"type": "github"
},
"original": {
@@ -27,11 +27,11 @@
]
},
"locked": {
- "lastModified": 1778401693,
- "narHash": "sha256-OVHdCqXXUF5UdGkH+FF2ZL06OLZjj2kvP2dIUmzVWoo=",
+ "lastModified": 1779506708,
+ "narHash": "sha256-QOD/CNm196nCJRheux/URi4/HE66fthdOMqCJoPP1Y0=",
"owner": "nix-community",
"repo": "home-manager",
- "rev": "389b83002efc26f1145e89a6a8e6edc5a6435948",
+ "rev": "3ee51fbdac8c8bdfe1e7e1fcaba6520a563f394f",
"type": "github"
},
"original": {
@@ -43,11 +43,11 @@
},
"nixpkgs": {
"locked": {
- "lastModified": 1778430510,
- "narHash": "sha256-Ti+ZBvW6yrWWAg2szExVTwCd4qOJ3KlVr1tFHfyfi8Q=",
+ "lastModified": 1779467186,
+ "narHash": "sha256-nOesoDCiXcUftqbRBMz9tt4blI5PvljMWbm3kuCA+0s=",
"owner": "NixOS",
"repo": "nixpkgs",
- "rev": "8fd9daa3db09ced9700431c5b7ad0e8ba199b575",
+ "rev": "b77b3de8775677f84492abe84635f87b0e153f0f",
"type": "github"
},
"original": {
diff --git a/prek.toml b/prek.toml
index add7799..2c66b82 100644
--- a/prek.toml
+++ b/prek.toml
@@ -28,7 +28,7 @@ hooks = [{ id = "check-yaml", args = ["--unsafe"] }]
# Secret detection (running both tools in parallel to compare coverage)
[[repos]]
repo = "https://github.com/trufflesecurity/trufflehog"
-rev = "17456f8c7d042d8c82c9a8ca9e937231f9f42e26" # v3.95.2
+rev = "37b77001d0174ebec2fcca2bd83ff83a6d45a3ab" # v3.95.3
hooks = [
{ id = "trufflehog", entry = "trufflehog git file://. --since-commit HEAD --no-verification --fail", stages = [
"pre-commit",
@@ -38,7 +38,7 @@ hooks = [
[[repos]]
repo = "https://github.com/mongodb/kingfisher"
-rev = "9ddec4ab8b53653d4941e6b3fd4ff602ce91d81b" # v1.97.0
+rev = "6f560103cc6ea082ef4b80a9098e3f3111afb8bc" # v1.101.0
hooks = [
{ id = "kingfisher", args = [
"scan",
@@ -69,12 +69,12 @@ name = "ansible-lint"
entry = "env ANSIBLE_ROLES_PATH=ansible/roles ansible-lint"
language = "python"
files = "^ansible/"
-additional_dependencies = ["ansible-lint==26.4.0", "ansible-core==2.20.5"]
+additional_dependencies = ["ansible-lint==26.4.0", "ansible-core==2.21.0"]
# Python - ruff for linting and formatting
[[repos]]
repo = "https://github.com/astral-sh/ruff-pre-commit"
-rev = "6fec9b7edb08fd9989088709d864a7826dc74e80" # v0.15.12
+rev = "0c7b6c989466a93942def1f84baf36ddfcd60c83" # v0.15.14
hooks = [{ id = "ruff", args = ["--fix"] }, { id = "ruff-format" }]
# Python - ty type checker
From 4e25180b0ae3ff212b7fc4d57d136f215a92c310 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 07:13:13 -0700
Subject: [PATCH 22/52] C0: clone blumeops via tailnet on ringtail provision
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Switch ringtail.yml from forge.eblu.me (Fly proxy, WAN) to
forge.ops.eblu.me (Caddy on indri, tailnet). Ringtail is always
on the tailnet — the WAN round-trip was overhead and made
provision-ringtail fail any time Fly was slow or down.
---
ansible/playbooks/ringtail.yml | 2 +-
docs/changelog.d/+ringtail-clone-via-tailnet.infra.md | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+ringtail-clone-via-tailnet.infra.md
diff --git a/ansible/playbooks/ringtail.yml b/ansible/playbooks/ringtail.yml
index ee5604b..b05d67a 100644
--- a/ansible/playbooks/ringtail.yml
+++ b/ansible/playbooks/ringtail.yml
@@ -57,7 +57,7 @@
tasks:
- name: Ensure blumeops repo is present
ansible.builtin.git:
- repo: "https://forge.eblu.me/eblume/blumeops.git"
+ repo: "https://forge.ops.eblu.me/eblume/blumeops.git"
dest: /etc/blumeops
version: "{{ ringtail_commit | default('main') }}"
force: true
diff --git a/docs/changelog.d/+ringtail-clone-via-tailnet.infra.md b/docs/changelog.d/+ringtail-clone-via-tailnet.infra.md
new file mode 100644
index 0000000..d664163
--- /dev/null
+++ b/docs/changelog.d/+ringtail-clone-via-tailnet.infra.md
@@ -0,0 +1 @@
+Switch the ringtail provisioning playbook's blumeops clone URL from `forge.eblu.me` (public, via Fly proxy) to `forge.ops.eblu.me` (tailnet, direct via Caddy on indri). Ringtail is always on the tailnet, so the WAN round-trip is pure overhead — it also made `provision-ringtail` brittle whenever the Fly proxy was slow or down.
From f6febb1f772e858a82d69e7baade4f526e550f97 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 07:59:22 -0700
Subject: [PATCH 23/52] C0: switch fly proxy deploy strategy to immediate
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Bluegreen kept timing out — the new green machine couldn't reach
"started" within Fly's 5-minute deploy budget. The cold-start sequence
(tailscaled → tailscale up → wait-for-MagicDNS → nginx startup) eats
most of that, leaving no headroom for healthcheck propagation.
For a single-machine proxy, bluegreen offers little benefit anyway:
no warm second instance, so trading 5-10s of downtime for predictable
completion is the right call.
---
docs/changelog.d/+fly-deploy-immediate-strategy.infra.md | 1 +
fly/fly.toml | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+fly-deploy-immediate-strategy.infra.md
diff --git a/docs/changelog.d/+fly-deploy-immediate-strategy.infra.md b/docs/changelog.d/+fly-deploy-immediate-strategy.infra.md
new file mode 100644
index 0000000..205bd6a
--- /dev/null
+++ b/docs/changelog.d/+fly-deploy-immediate-strategy.infra.md
@@ -0,0 +1 @@
+Switch the Fly proxy deploy strategy from `bluegreen` to `immediate` in `fly/fly.toml`. With a single proxy machine, bluegreen offers little benefit — the green machine routinely failed to reach "started" inside Fly's default 5-minute deploy timeout (the cold-start sequence of `tailscaled` → `tailscale up` → wait-for-MagicDNS → nginx startup eats most of the budget), and the failed deploys would roll back. `immediate` replaces the machine in place with a brief downtime (~5–10s) but actually completes.
diff --git a/fly/fly.toml b/fly/fly.toml
index 11aac9c..6ccf29d 100644
--- a/fly/fly.toml
+++ b/fly/fly.toml
@@ -7,7 +7,7 @@ primary_region = "sjc"
memory = "512mb"
[deploy]
-strategy = "bluegreen"
+strategy = "immediate"
[http_service]
internal_port = 8080
From 4d1f4af25b9d2a55c1b0731e3a6b83259fc33dfa Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 09:59:46 -0700
Subject: [PATCH 24/52] =?UTF-8?q?Upgrade=20unpoller=20v2.34.0=20=E2=86=92?=
=?UTF-8?q?=20v3.2.0,=20migrate=20to=20container.py=20(#361)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
- Service Review pickup: unpoller (last reviewed 73 days ago).
- Upgrades unpoller from v2.34.0 to v3.2.0 (major version bump).
- Migrates the container build from a Dockerfile to a native Dagger pipeline (`containers/unpoller/container.py`) following the navidrome / miniflux pattern.
- Refreshes `service-versions.yaml` (last-reviewed, current-version).
## Breaking changes (upstream)
- **v3.0.0** — UniFi network API shifts (later 10.x). Some metric / event / log names and labels may have changed. Worth a follow-up sweep of the unpoller Grafana dashboard for missing series.
- **v3.2.0** — defaults to a 60s background poll feeding cached Prometheus scrapes (was on-demand poll per scrape). To restore previous behavior, set `interval = 0` in `up.conf`. Leaving the new default in this PR — every-15s scrapes will simply serve from cache, which is fine for our use.
## Build
- Image: `registry.ops.eblu.me/blumeops/unpoller:v3.2.0-1b27242`
- Built by build-container workflow run #559 from this branch.
## Test plan
- [ ] `argocd app set unpoller --revision unpoller-v3 && argocd app sync unpoller`
- [ ] Pod comes Ready
- [ ] Verify metrics exported (`Site/Client/UAP/USG/USW` counts in logs, `unpoller_*` series in Prometheus)
- [ ] Spot-check unpoller Grafana dashboard for missing series after the v3 API shift
- [ ] After merge: `argocd app set unpoller --revision main && argocd app sync unpoller`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/361
---
argocd/manifests/unpoller/kustomization.yaml | 2 +-
containers/unpoller/Dockerfile | 43 ----------------
containers/unpoller/container.py | 53 ++++++++++++++++++++
docs/changelog.d/unpoller-v3.infra.md | 1 +
service-versions.yaml | 4 +-
5 files changed, 57 insertions(+), 46 deletions(-)
delete mode 100644 containers/unpoller/Dockerfile
create mode 100644 containers/unpoller/container.py
create mode 100644 docs/changelog.d/unpoller-v3.infra.md
diff --git a/argocd/manifests/unpoller/kustomization.yaml b/argocd/manifests/unpoller/kustomization.yaml
index 5b7a9e2..d2c4e28 100644
--- a/argocd/manifests/unpoller/kustomization.yaml
+++ b/argocd/manifests/unpoller/kustomization.yaml
@@ -10,7 +10,7 @@ resources:
images:
- name: registry.ops.eblu.me/blumeops/unpoller
- newTag: v2.34.0-613f05d
+ newTag: v3.2.0-1b27242
configMapGenerator:
- name: unpoller-config
diff --git a/containers/unpoller/Dockerfile b/containers/unpoller/Dockerfile
deleted file mode 100644
index 241b375..0000000
--- a/containers/unpoller/Dockerfile
+++ /dev/null
@@ -1,43 +0,0 @@
-# UnPoller — UniFi metrics exporter for Prometheus
-# Two-stage build: Go compilation, then minimal Alpine runtime
-
-ARG CONTAINER_APP_VERSION=v2.34.0
-
-FROM golang:alpine3.22 AS build
-
-ARG CONTAINER_APP_VERSION
-RUN apk add --no-cache git
-
-RUN git clone --depth 1 --branch ${CONTAINER_APP_VERSION} \
- https://forge.ops.eblu.me/mirrors/unpoller.git /app
-
-WORKDIR /app
-
-ENV CGO_ENABLED=0
-
-RUN go build -ldflags="-s -w \
- -X main.version=${CONTAINER_APP_VERSION} \
- -X main.builtBy=blumeops \
- -X golift.io/version.Version=${CONTAINER_APP_VERSION} \
- -X golift.io/version.Branch=HEAD \
- -X golift.io/version.BuildUser=blumeops \
- -X golift.io/version.Revision=blumeops-build" \
- -o /bin/unpoller .
-
-FROM alpine:3.22
-
-ARG CONTAINER_APP_VERSION
-LABEL org.opencontainers.image.title="UnPoller"
-LABEL org.opencontainers.image.description="UniFi metrics exporter for Prometheus"
-LABEL org.opencontainers.image.version="${CONTAINER_APP_VERSION}"
-LABEL org.opencontainers.image.source="https://forge.eblu.me/eblume/blumeops"
-LABEL org.opencontainers.image.vendor="blumeops"
-
-RUN apk add --no-cache ca-certificates tzdata
-
-COPY --from=build /bin/unpoller /usr/bin/unpoller
-
-EXPOSE 9130
-USER 65534:65534
-ENTRYPOINT ["/usr/bin/unpoller"]
-CMD ["--config", "/etc/unpoller/up.conf"]
diff --git a/containers/unpoller/container.py b/containers/unpoller/container.py
new file mode 100644
index 0000000..bfc75ba
--- /dev/null
+++ b/containers/unpoller/container.py
@@ -0,0 +1,53 @@
+"""UnPoller — UniFi metrics exporter for Prometheus.
+
+Two-stage build: Go backend, Alpine runtime.
+Source cloned from forge mirror.
+"""
+
+import dagger
+
+from blumeops.containers import (
+ alpine_runtime,
+ clone_from_forge,
+ go_build,
+ oci_labels,
+)
+
+VERSION = "v3.2.0"
+
+
+async def build(src: dagger.Directory) -> dagger.Container:
+ source = clone_from_forge("unpoller", VERSION)
+
+ backend = go_build(
+ source,
+ "/unpoller",
+ ldflags=(
+ f"-s -w "
+ f"-X main.version={VERSION} "
+ f"-X main.builtBy=blumeops "
+ f"-X golift.io/version.Version={VERSION} "
+ f"-X golift.io/version.Branch=HEAD "
+ f"-X golift.io/version.BuildUser=blumeops "
+ f"-X golift.io/version.Revision=blumeops-build"
+ ),
+ )
+
+ runtime = alpine_runtime(
+ extra_apk=["ca-certificates", "tzdata"],
+ create_user=False,
+ )
+ runtime = oci_labels(
+ runtime,
+ title="UnPoller",
+ description="UniFi metrics exporter for Prometheus",
+ version=VERSION,
+ )
+ return (
+ runtime.with_file("/usr/bin/unpoller", backend.file("/unpoller"))
+ .with_exposed_port(9130)
+ .with_user("65534")
+ .with_default_args(
+ args=["/usr/bin/unpoller", "--config", "/etc/unpoller/up.conf"]
+ )
+ )
diff --git a/docs/changelog.d/unpoller-v3.infra.md b/docs/changelog.d/unpoller-v3.infra.md
new file mode 100644
index 0000000..fa6eaf9
--- /dev/null
+++ b/docs/changelog.d/unpoller-v3.infra.md
@@ -0,0 +1 @@
+Upgrade unpoller v2.34.0 → v3.2.0 and migrate container build from Dockerfile to native Dagger (container.py). v3.0.0 carries breaking UniFi API changes; v3.2.0 introduces a 60s background poll (cached scrapes) by default — set `interval = 0` in `up.conf` to restore on-demand polling.
diff --git a/service-versions.yaml b/service-versions.yaml
index 02f2979..63b0f15 100644
--- a/service-versions.yaml
+++ b/service-versions.yaml
@@ -345,8 +345,8 @@ services:
- name: unpoller
type: argocd
- last-reviewed: 2026-03-16
- current-version: "v2.34.0"
+ last-reviewed: 2026-05-28
+ current-version: "v3.2.0"
upstream-source: https://github.com/unpoller/unpoller/releases
notes: UniFi metrics exporter for Prometheus
From e703d25efe2b2da12793a6c459bce95ecdc48435 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 10:10:21 -0700
Subject: [PATCH 25/52] C0: rebuild unpoller container from squashed main
commit
Image was previously tagged with the unpoller-v3 branch SHA (1b27242),
which doesn't exist in main's history after squash-merge. Rebuilt from
the squashed commit so the tag references a reachable commit.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
argocd/manifests/unpoller/kustomization.yaml | 2 +-
docs/changelog.d/+unpoller-rebuild-on-main.infra.md | 1 +
2 files changed, 2 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+unpoller-rebuild-on-main.infra.md
diff --git a/argocd/manifests/unpoller/kustomization.yaml b/argocd/manifests/unpoller/kustomization.yaml
index d2c4e28..bf776bb 100644
--- a/argocd/manifests/unpoller/kustomization.yaml
+++ b/argocd/manifests/unpoller/kustomization.yaml
@@ -10,7 +10,7 @@ resources:
images:
- name: registry.ops.eblu.me/blumeops/unpoller
- newTag: v3.2.0-1b27242
+ newTag: v3.2.0-4d1f4af
configMapGenerator:
- name: unpoller-config
diff --git a/docs/changelog.d/+unpoller-rebuild-on-main.infra.md b/docs/changelog.d/+unpoller-rebuild-on-main.infra.md
new file mode 100644
index 0000000..60ae8fa
--- /dev/null
+++ b/docs/changelog.d/+unpoller-rebuild-on-main.infra.md
@@ -0,0 +1 @@
+Rebuild unpoller container from squashed main commit so the image SHA tag matches a commit in main's history (was tagged with the pre-squash branch SHA).
From 1ce381cb6e15ca1226feee1d6a0fa2c449f929b7 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 14:36:33 -0700
Subject: [PATCH 26/52] C0: surface missing-log failures in runner-logs
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
`mise run runner-logs -j ` previously silently succeeded with
no output when forgejo had no log for the task. Two layered causes:
1. zstdcat exits 0 even when the file is missing (writes "can't stat
… -- ignored" to stderr).
2. ssh to indri runs fish, which silently drops the remote exit code so
the subprocess returncode is always 0.
Probe `test -f` over SSH and parse a stdout marker (EXISTS / MISSING) to
detect the missing-log case, then report it explicitly with the indri
path and a hint about action_task.log_in_storage = 0 so the operator
knows where to look next.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
.../+runner-logs-missing-log.misc.md | 1 +
mise-tasks/runner-logs | 25 ++++++++++++++++++-
2 files changed, 25 insertions(+), 1 deletion(-)
create mode 100644 docs/changelog.d/+runner-logs-missing-log.misc.md
diff --git a/docs/changelog.d/+runner-logs-missing-log.misc.md b/docs/changelog.d/+runner-logs-missing-log.misc.md
new file mode 100644
index 0000000..c06704a
--- /dev/null
+++ b/docs/changelog.d/+runner-logs-missing-log.misc.md
@@ -0,0 +1 @@
+`mise run runner-logs -j ` now reports a clear error when the log file doesn't exist on indri (e.g. a runner crash that left `action_task.log_in_storage = 0`). Previously it printed only the header and exited 0, because `zstdcat` exits 0 with a "can't stat … -- ignored" stderr message and ssh+fish on indri swallows the remote exit code.
diff --git a/mise-tasks/runner-logs b/mise-tasks/runner-logs
index 3c5e8e3..0d3028b 100755
--- a/mise-tasks/runner-logs
+++ b/mise-tasks/runner-logs
@@ -229,12 +229,35 @@ def fetch_log(run_number: int, job_index: int, repo: str, token: str) -> None:
hex_prefix = f"{task_id & 0xff:02x}"
log_path = f"~/forgejo/data/actions_log/{repo}/{hex_prefix}/{task_id}.log.zst"
+ # indri's login shell (fish) silently swallows SSH exit codes, so we can't
+ # rely on returncode. zstdcat itself also exits 0 with a "can't stat ...
+ # -- ignored" stderr message when the file is missing. Detect missing logs
+ # by running `test -f` over SSH and parsing the marker line from stdout.
+ probe = subprocess.run(
+ ["ssh", "indri", f"test -f {log_path} && echo EXISTS || echo MISSING"],
+ capture_output=True,
+ text=True,
+ )
+ marker = probe.stdout.strip().splitlines()[-1] if probe.stdout.strip() else ""
+ if marker != "EXISTS":
+ typer.echo(
+ f"Error: log not found for run #{run_number} job {job_index} (task {task_id})",
+ err=True,
+ )
+ typer.echo(f"Path: indri:{log_path}", err=True)
+ typer.echo(
+ "The runner may have crashed before uploading its log buffer "
+ "(action_task.log_in_storage = 0).",
+ err=True,
+ )
+ raise typer.Exit(1)
+
result = subprocess.run(
["ssh", "indri", f"zstdcat {log_path}"],
capture_output=True,
text=True,
)
- if result.returncode != 0:
+ if result.returncode != 0 or not result.stdout:
typer.echo(
f"Error: could not read log for run #{run_number} job {job_index} (task {task_id})",
err=True,
From ecded3007368e094baebeed10fbf2a3fe49aed90 Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 14:51:09 -0700
Subject: [PATCH 27/52] Make valkey local on ringtail (nix amd64) + bump to
8.1.7 (#362)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Summary
Weekly "make one non-local container local" pickup: immich-ringtail still pulled `docker.io/valkey/valkey:8.1.6` because the existing `containers/valkey/container.py` build was arm64-only.
- Adds `containers/valkey/default.nix` — nix-built amd64 valkey image, packaged by the ringtail nix-container-builder runner using `pkgs.dockerTools.buildLayeredImage`. Mirrors the existing `containers/authentik-redis/default.nix` pattern.
- `containers/valkey/container.py` keeps building the Alpine arm64 image for paperless on indri. Bumped both builds to upstream valkey 8.1.7 (Alpine 3.22 now ships `8.1.7-r0`; nixpkgs has 8.1.7).
- Splits `VERSION` (upstream app) from `ALPINE_PIN` (apk pin) in `container.py` so both build files can declare the same upstream version and pass `container-version-check`.
- Updates `service-versions.yaml`: current-version 8.1.7, refreshed last-reviewed, upstream-source now points at the canonical valkey-io releases page.
- Switches kustomizations:
- `immich-ringtail/kustomization.yaml`: `docker.io/valkey/valkey:8.1.6` → `registry.ops.eblu.me/blumeops/valkey:v8.1.7-02859c5-nix`, comment updated.
- `paperless/kustomization.yaml`: `v8.1.6-r0-fabca04` → `v8.1.7-02859c5`.
## Build
build-container run #563 — both jobs succeeded after a transient runner crash on the first dispatch (#562 build-nix), which surfaced two separate bugs that landed in a separate C0 on main:
- `runner-logs` silently returned 0 with no output when the log file didn't exist on indri
- `ssh indri` swallowing remote exit codes (fish login shell), which the wrapper now works around via a stdout marker
## Test plan
- [ ] `argocd app set immich-ringtail --revision valkey-nix && argocd app sync immich-ringtail`
- [ ] `argocd app set paperless --revision valkey-nix && argocd app sync paperless`
- [ ] Both valkey pods come Ready and start serving on :6379
- [ ] Immich app + paperless can read/write their respective cache
- [ ] After merge: rebuild from squashed main commit + update kustomization tags (squash-tag follow-up)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: https://forge.eblu.me/eblume/blumeops/pulls/362
---
.../immich-ringtail/kustomization.yaml | 9 +++---
argocd/manifests/paperless/kustomization.yaml | 2 +-
containers/valkey/container.py | 15 +++++-----
containers/valkey/default.nix | 30 +++++++++++++++++++
docs/changelog.d/valkey-nix.infra.md | 1 +
service-versions.yaml | 15 +++++-----
6 files changed, 53 insertions(+), 19 deletions(-)
create mode 100644 containers/valkey/default.nix
create mode 100644 docs/changelog.d/valkey-nix.infra.md
diff --git a/argocd/manifests/immich-ringtail/kustomization.yaml b/argocd/manifests/immich-ringtail/kustomization.yaml
index c1f639e..7a97fef 100644
--- a/argocd/manifests/immich-ringtail/kustomization.yaml
+++ b/argocd/manifests/immich-ringtail/kustomization.yaml
@@ -21,8 +21,9 @@ images:
- name: ghcr.io/immich-app/immich-machine-learning
# CUDA variant of the same release — ringtail has an RTX 4080
newTag: v2.6.3-cuda
- # Using upstream multi-arch valkey image directly; the
- # registry.ops.eblu.me/blumeops/valkey mirror is arm64-only (built
- # on indri) and would crashloop on ringtail.
+ # amd64 valkey built via nix on the ringtail nix-container-builder
+ # (see containers/valkey/default.nix). The Alpine container.py build
+ # is arm64-only and serves paperless on indri.
- name: docker.io/valkey/valkey
- newTag: "8.1.6"
+ newName: registry.ops.eblu.me/blumeops/valkey
+ newTag: v8.1.7-02859c5-nix
diff --git a/argocd/manifests/paperless/kustomization.yaml b/argocd/manifests/paperless/kustomization.yaml
index 9c6a086..575dfb4 100644
--- a/argocd/manifests/paperless/kustomization.yaml
+++ b/argocd/manifests/paperless/kustomization.yaml
@@ -16,4 +16,4 @@ images:
newTag: v2.20.13-07f52e9
- name: docker.io/library/redis
newName: registry.ops.eblu.me/blumeops/valkey
- newTag: v8.1.6-r0-fabca04
+ newTag: v8.1.7-02859c5
diff --git a/containers/valkey/container.py b/containers/valkey/container.py
index 5d150e7..34e8524 100644
--- a/containers/valkey/container.py
+++ b/containers/valkey/container.py
@@ -1,8 +1,8 @@
-"""Valkey — native Dagger build.
+"""Valkey — native Dagger build (arm64, indri).
Alpine 3.22 base with the `valkey` apk package (8.1.x — Redis-compatible).
-Mirrors `docker.io/valkey/valkey:8.1-alpine`, used by paperless and immich
-as a cache/queue sidecar.
+Used by paperless (sidecar) on indri. immich on ringtail uses the
+nix-built amd64 variant from `default.nix` in this directory.
"""
import dagger
@@ -10,9 +10,10 @@ from dagger import dag
from blumeops.containers import oci_labels
-# Alpine 3.22 ships valkey 8.1.6-r0. Alpine 3.23 jumps to 9.0 — hold on 3.22
-# to keep this a 1:1 swap for the upstream `valkey:8.1-alpine` image.
-VERSION = "8.1.6-r0"
+# Alpine 3.22 currently ships valkey 8.1.7-r0. Alpine 3.23 jumps to 9.0 —
+# hold on 3.22 to keep this aligned with the 8.1 line.
+VERSION = "8.1.7"
+ALPINE_PIN = "8.1.7-r0"
ALPINE_BASE = "alpine:3.22"
@@ -21,7 +22,7 @@ async def build(src: dagger.Directory) -> dagger.Container:
ctr = (
dag.container()
.from_(ALPINE_BASE)
- .with_exec(["apk", "add", "--no-cache", f"valkey={VERSION}"])
+ .with_exec(["apk", "add", "--no-cache", f"valkey={ALPINE_PIN}"])
.with_exec(["mkdir", "-p", "/data"])
.with_exec(["chown", "valkey:valkey", "/data"])
.with_workdir("/data")
diff --git a/containers/valkey/default.nix b/containers/valkey/default.nix
new file mode 100644
index 0000000..9cb1713
--- /dev/null
+++ b/containers/valkey/default.nix
@@ -0,0 +1,30 @@
+# Nix-built Valkey for ringtail (amd64)
+# Companion to container.py (Alpine 3.22, arm64 on indri).
+# Used by immich-ringtail which needs an amd64 image; paperless on indri
+# continues to use the Alpine container.py build.
+#
+# The version assertion ensures nix-build fails if a flake.lock update
+# changes the Valkey version — forcing an explicit version acknowledgment
+# here and in service-versions.yaml (enforced by container-version-check).
+{ pkgs ? import { } }:
+
+let
+ version = "8.1.7";
+in
+
+assert pkgs.valkey.version == version;
+
+pkgs.dockerTools.buildLayeredImage {
+ name = "blumeops/valkey";
+ contents = [
+ pkgs.valkey
+ ];
+
+ config = {
+ Entrypoint = [ "${pkgs.valkey}/bin/valkey-server" ];
+ Cmd = [ "--bind" "0.0.0.0" "--protected-mode" "no" "--dir" "/data" ];
+ ExposedPorts = {
+ "6379/tcp" = { };
+ };
+ };
+}
diff --git a/docs/changelog.d/valkey-nix.infra.md b/docs/changelog.d/valkey-nix.infra.md
new file mode 100644
index 0000000..e41eb63
--- /dev/null
+++ b/docs/changelog.d/valkey-nix.infra.md
@@ -0,0 +1 @@
+Add nix-built amd64 valkey for ringtail (`containers/valkey/default.nix`) so immich-ringtail can stop pulling the upstream multi-arch `docker.io/valkey/valkey` image. Existing `container.py` continues to build Alpine arm64 for paperless on indri. Both bump to valkey 8.1.7 (Alpine 3.22 8.1.7-r0 / nixpkgs 8.1.7).
diff --git a/service-versions.yaml b/service-versions.yaml
index 63b0f15..5440f01 100644
--- a/service-versions.yaml
+++ b/service-versions.yaml
@@ -146,14 +146,15 @@ services:
- name: valkey
type: argocd
- last-reviewed: 2026-05-01
- current-version: "8.1.6-r0"
- upstream-source: https://pkgs.alpinelinux.org/package/v3.22/community/aarch64/valkey
+ last-reviewed: 2026-05-28
+ current-version: "8.1.7"
+ upstream-source: https://github.com/valkey-io/valkey/releases
notes: >-
- Shared Alpine-built valkey image, used as a sidecar/cache by paperless
- (sidecar) and immich (separate Deployment). Mirrors the upstream
- docker.io/valkey/valkey:8.1-alpine. Pinned to Alpine 3.22 for valkey 8.1.x;
- Alpine 3.23 jumps to 9.0. Distinct from authentik-redis (nix-built Redis
+ Dual-build valkey image: container.py builds Alpine 3.22 + apk valkey
+ (arm64, indri) for paperless; default.nix builds via nixpkgs (amd64,
+ ringtail) for immich-ringtail. Both track upstream valkey 8.1.x; Alpine
+ 3.22 currently ships 8.1.7-r0 and nixpkgs valkey is 8.1.7. Alpine 3.23
+ jumps to 9.0. Distinct from authentik-redis (nix-built Redis
8.x) which has its own entry.
- name: external-secrets
From f588638331567d921e189cbff25db5425ccebaef Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Thu, 28 May 2026 14:53:21 -0700
Subject: [PATCH 28/52] C0: rebuild valkey from squashed main commit
Image tags from PR #362 (v8.1.7-02859c5{,-nix}) referenced a branch
SHA that no longer exists on main after squash-merge. Rebuilt both
the dagger arm64 and nix amd64 variants from the squashed commit
(ecded30) and updated paperless + immich-ringtail to the new tags.
Co-Authored-By: Claude Opus 4.7 (1M context)
---
argocd/manifests/immich-ringtail/kustomization.yaml | 2 +-
argocd/manifests/paperless/kustomization.yaml | 2 +-
docs/changelog.d/+valkey-rebuild-on-main.infra.md | 1 +
3 files changed, 3 insertions(+), 2 deletions(-)
create mode 100644 docs/changelog.d/+valkey-rebuild-on-main.infra.md
diff --git a/argocd/manifests/immich-ringtail/kustomization.yaml b/argocd/manifests/immich-ringtail/kustomization.yaml
index 7a97fef..2fa131c 100644
--- a/argocd/manifests/immich-ringtail/kustomization.yaml
+++ b/argocd/manifests/immich-ringtail/kustomization.yaml
@@ -26,4 +26,4 @@ images:
# is arm64-only and serves paperless on indri.
- name: docker.io/valkey/valkey
newName: registry.ops.eblu.me/blumeops/valkey
- newTag: v8.1.7-02859c5-nix
+ newTag: v8.1.7-ecded30-nix
diff --git a/argocd/manifests/paperless/kustomization.yaml b/argocd/manifests/paperless/kustomization.yaml
index 575dfb4..3cd0d74 100644
--- a/argocd/manifests/paperless/kustomization.yaml
+++ b/argocd/manifests/paperless/kustomization.yaml
@@ -16,4 +16,4 @@ images:
newTag: v2.20.13-07f52e9
- name: docker.io/library/redis
newName: registry.ops.eblu.me/blumeops/valkey
- newTag: v8.1.7-02859c5
+ newTag: v8.1.7-ecded30
diff --git a/docs/changelog.d/+valkey-rebuild-on-main.infra.md b/docs/changelog.d/+valkey-rebuild-on-main.infra.md
new file mode 100644
index 0000000..c743e61
--- /dev/null
+++ b/docs/changelog.d/+valkey-rebuild-on-main.infra.md
@@ -0,0 +1 @@
+Rebuild valkey container from squashed main commit (both arm64 dagger and amd64 nix variants), and update paperless + immich-ringtail kustomizations to the main-SHA tags `v8.1.7-ecded30` and `v8.1.7-ecded30-nix`.
From e0064de83d0d15a1f34f16146542a62817dca3ef Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Mon, 1 Jun 2026 15:52:09 -0700
Subject: [PATCH 29/52] C0: update ringtail flake inputs (nixpkgs, disko)
Co-Authored-By: Claude Opus 4.8 (1M context)
---
.../+ringtail-flake-update-2026-06-01.infra.md | 4 ++++
nixos/ringtail/flake.lock | 12 ++++++------
2 files changed, 10 insertions(+), 6 deletions(-)
create mode 100644 docs/changelog.d/+ringtail-flake-update-2026-06-01.infra.md
diff --git a/docs/changelog.d/+ringtail-flake-update-2026-06-01.infra.md b/docs/changelog.d/+ringtail-flake-update-2026-06-01.infra.md
new file mode 100644
index 0000000..dd488b6
--- /dev/null
+++ b/docs/changelog.d/+ringtail-flake-update-2026-06-01.infra.md
@@ -0,0 +1,4 @@
+Update the ringtail NixOS flake lockfile (`nixos/ringtail/flake.lock`): bump
+`nixpkgs` (b77b3de → 25f5383) and `disko` (5ba0c95 → 115e521) to latest.
+`nixpkgs-services` was intentionally left pinned (skipped by the
+`flake-update` pipeline). Routine recurring maintenance per [[manage-lockfile]].
diff --git a/nixos/ringtail/flake.lock b/nixos/ringtail/flake.lock
index 0f0da7e..bb60501 100644
--- a/nixos/ringtail/flake.lock
+++ b/nixos/ringtail/flake.lock
@@ -7,11 +7,11 @@
]
},
"locked": {
- "lastModified": 1779699611,
- "narHash": "sha256-EcCaSTKnmg2o4wLKaN1aqQFomwyhO7ik0bX9COdyCas=",
+ "lastModified": 1780290312,
+ "narHash": "sha256-eTAlX0CwgB84Ts3GaBd944A3DRXVMzgA0EqroZBISUo=",
"owner": "nix-community",
"repo": "disko",
- "rev": "5ba0c9555c28685e57fa54c7a25e42c7efdbfc8d",
+ "rev": "115e5211780054d8a890b41f0b7734cafad54dfe",
"type": "github"
},
"original": {
@@ -43,11 +43,11 @@
},
"nixpkgs": {
"locked": {
- "lastModified": 1779467186,
- "narHash": "sha256-nOesoDCiXcUftqbRBMz9tt4blI5PvljMWbm3kuCA+0s=",
+ "lastModified": 1779796641,
+ "narHash": "sha256-ZsIrKmhp4vbBXoXXmR/tBXA/UCsAQiJL9vsgZEduhVY=",
"owner": "NixOS",
"repo": "nixpkgs",
- "rev": "b77b3de8775677f84492abe84635f87b0e153f0f",
+ "rev": "25f538306313eae3927264466c70d7001dcea1df",
"type": "github"
},
"original": {
From a36a18aaa6714e187834edc09eb2fc565d0f5fbb Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Mon, 1 Jun 2026 20:52:20 -0700
Subject: [PATCH 30/52] C0: black-hole /mirrors/* at Fly edge + name-and-shame
scrapers
A $29.60 Fly bill traced to ~1.25 TB/30d egress on forge.eblu.me (99.95% of
all proxy egress), ~71% of it AI scrapers (Meta meta-externalagent, OpenAI
GPTBot, Amazonbot, Bytespider) crawling the public mirror repos' infinite
git-history URL space and timing out Forgejo. robots.txt already disallowed
/mirrors/ but those agents ignore it, so enforce at the edge: return 403 (^~
to beat the regex asset locations), served as a roll-of-dishonour page with an
X-Naughty-Scrapers header. Mirrors stay reachable on the tailnet via
forge.ops.eblu.me. Tier 2 (UA denylist + Anubis) and the Cloudflare rejection
are documented in docs/explanation/ai-scraper-mitigation.md.
Co-Authored-By: Claude Opus 4.8 (1M context)
---
.../+ai-scraper-mitigation-doc.doc.md | 1 +
.../+forge-mirrors-blackhole.infra.md | 1 +
docs/explanation/ai-scraper-mitigation.md | 201 ++++++++++++++++++
docs/tutorials/expose-service-publicly.md | 7 +
fly/Dockerfile | 1 +
fly/naughty.html | 64 ++++++
fly/nginx.conf | 27 +++
7 files changed, 302 insertions(+)
create mode 100644 docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
create mode 100644 docs/changelog.d/+forge-mirrors-blackhole.infra.md
create mode 100644 docs/explanation/ai-scraper-mitigation.md
create mode 100644 fly/naughty.html
diff --git a/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md b/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
new file mode 100644
index 0000000..246fedb
--- /dev/null
+++ b/docs/changelog.d/+ai-scraper-mitigation-doc.doc.md
@@ -0,0 +1 @@
+Add `docs/explanation/ai-scraper-mitigation.md` — the egress-cost / AI-crawler threat model for the public Fly proxy, the tiered mitigation plan (Tier 1: mirror black-hole, shipped; Tier 2: user-agent denylist + Anubis; Tier 3: Cloudflare, rejected on principle), and the data behind it.
diff --git a/docs/changelog.d/+forge-mirrors-blackhole.infra.md b/docs/changelog.d/+forge-mirrors-blackhole.infra.md
new file mode 100644
index 0000000..29a5e6a
--- /dev/null
+++ b/docs/changelog.d/+forge-mirrors-blackhole.infra.md
@@ -0,0 +1 @@
+Black-hole the `/mirrors/*` repositories at the Fly proxy edge (`return 403` → `forge.ops.eblu.me`). A surprise $29.60 Fly bill traced to ~1.24 TB/30d of egress on `forge.eblu.me`, 99.95% of all proxy egress — of which ~71% was AI scrapers (Meta `meta-externalagent`, OpenAI `GPTBot`, Amazonbot) crawling the near-infinite git-history URL space of the public mirror repos and timing out Forgejo in the process. Mirrors exist for supply-chain control and are consumed over the tailnet, so their public web UI had no legitimate audience. `robots.txt` already disallowed `/mirrors/`, but the offending agents ignore it. Tier-2 mitigations (user-agent denylist, Anubis proof-of-work gateway) are documented in `docs/explanation/ai-scraper-mitigation.md`.
diff --git a/docs/explanation/ai-scraper-mitigation.md b/docs/explanation/ai-scraper-mitigation.md
new file mode 100644
index 0000000..fe4ba3d
--- /dev/null
+++ b/docs/explanation/ai-scraper-mitigation.md
@@ -0,0 +1,201 @@
+---
+title: AI Scraper Mitigation
+modified: 2026-06-01
+last-reviewed: 2026-06-01
+tags:
+ - explanation
+ - fly-io
+ - forgejo
+ - security
+ - networking
+---
+
+# AI Scraper Mitigation on the Public Proxy
+
+> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words — these serve as placeholders to establish the documentation structure.
+
+How BlumeOps keeps AI crawlers from running up the [[expose-service-publicly|Fly.io proxy]] egress bill and DoS-ing [[forgejo|Forgejo]] on [[indri]].
+
+## The incident
+
+A $29.60 Fly.io invoice arrived, nearly all of it a single line:
+
+```
+Bandwidth: Egress (iad) — 958,524,714,138 bytes — $19.17
+```
+
+The `iad` (Ashburn) region is a red herring: the proxy machine runs in `sjc`,
+but Fly bills egress at the edge PoP nearest the *client*, so `iad` just means
+"the traffic went to clients on the US East Coast."
+
+Tracing it through the nginx access logs (shipped to Loki via [[alloy|Alloy]]):
+
+| Signal | Value |
+|--------|-------|
+| Total proxy egress (30d) | ~1.25 TB |
+| Share that was `forge.eblu.me` | **99.95%** |
+| Share of forge egress that was `/mirrors/*` | **~71%** |
+| Share that was declared AI bots | **~85%+** |
+| Top offenders | Meta `meta-externalagent` (66% of bytes), OpenAI `GPTBot` (16%), Amazonbot, Bytespider |
+| Forgejo `5xx` (upstream timeouts) | tens of thousands/day, spiking to 112k |
+
+The crawlers were walking [[forgejo|Forgejo]]'s git-history browse endpoints —
+`src/commit/`, `commits/`, `blame/`, `raw/commit/`, plus `.patch`/`.diff`
+and `?page=N` pagination. That URL space is effectively **infinite**: every
+file × every commit × every page, multiplied across every mirrored repo. A
+crawler that follows links never finishes, and every page is a cache `MISS`
+that both tunnels to indri *and* bills as egress.
+
+Two distinct harms, not one:
+
+1. **Cost** — ~1.25 TB/mo of egress on a free-tier-ish proxy.
+2. **Availability** — the crawl alone generates ~400–530k requests/day,
+ enough to time out Forgejo regardless of how much RAM [[indri]] has. Moving
+ egress elsewhere would *not* fix this; the crawl has to be throttled at the
+ source.
+
+`robots.txt` already `Disallow`s `/mirrors/`, `/user/`, and archive/download
+paths — but **`meta-externalagent` and `GPTBot` ignore it.** For these agents,
+`robots.txt` is a dead letter, which is why edge enforcement is required.
+
+## The tiered plan
+
+### Tier 1 — Black-hole `/mirrors/*` (shipped)
+
+The mirror repositories (`tailscale`, `prometheus`, `mealie`, `paperless-ngx`,
+…) are mirrors of *already-public upstreams*, kept for supply-chain control
+(see [[spork-strategy]] and the container/mirror story in [[why-gitops]]). They
+are consumed by CI, gilbert, and other tailnet clients over
+`forge.ops.eblu.me`. Their web UI on the public internet served **no
+legitimate audience** — only scrapers. So the proxy now returns `403` for
+anything under `/mirrors/`, pointing humans at the tailnet host:
+
+```nginx
+location ^~ /mirrors/ {
+ return 403 "Mirror repositories are tailnet-only — use forge.ops.eblu.me.\n";
+}
+```
+
+The `^~` modifier matters: without it, the regex `location` blocks for static
+assets (`*.css`, `*.js`, release downloads) would match first and leak content
+under `/mirrors/`. `^~` tells nginx to stop at the prefix match and skip the
+regex round.
+
+This is config, not bot-fighting — we simply stopped serving an infinite
+tarpit to the world. It removes ~71% of forge egress and a large share of the
+upstream timeouts, with zero impact on any human or tailnet consumer. It
+mirrors the existing tailnet-only blocks for `/api/packages/` and `/swagger`.
+
+The `403` is also a small act of public shaming. Blocked requests are served a
+"roll of dishonour" page (`fly/naughty.html`, status kept at `403` via
+`error_page 403 /naughty.html`) that names the offending operators and their
+share of the stolen bytes, and every response carries an `X-Naughty-Scrapers`
+header:
+
+```
+X-Naughty-Scrapers: OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers
+```
+
+Petty? A little. But it costs nothing, documents *why* the block exists for the
+next person who hits it, and the page is a few KB versus the megabytes of git
+HTML the crawlers were taking.
+
+**Trade-off accepted:** mirror release-artifact downloads over WAN now also
+`403`. Legitimate consumers already pull these over the tailnet, and the public
+exposure was the same crawl liability, so this is intentional.
+
+### Tier 2 — Defend the repos that *stay* public (planned)
+
+`/eblume/*` is intentionally public (a public profile is a feature). But the
+same git-history endpoints are still a tarpit there, just lower-volume. Two
+layers, in increasing order of effort and effectiveness:
+
+#### 2a. User-agent denylist (cheap, evadable)
+
+Block the declared AI crawlers at the edge regardless of path:
+
+```nginx
+# Illustrative — not yet deployed.
+map $http_user_agent $is_ai_bot {
+ default 0;
+ "~*meta-externalagent" 1;
+ "~*GPTBot" 1;
+ "~*ClaudeBot" 1;
+ "~*Amazonbot" 1;
+ "~*Bytespider" 1;
+ "~*SemrushBot" 1;
+}
+# in the forge.eblu.me server block:
+if ($is_ai_bot) { return 403; }
+```
+
+This catches ~85% of *current* traffic for a few lines of config. It is
+trivially evadable — a scraper need only spoof a browser UA — so it is a
+speed-bump, not a wall. Keep `robots.txt` too: well-behaved crawlers
+(Googlebot, Bingbot) do honor it, and it documents intent.
+
+#### 2b. Anubis proof-of-work gateway (the real wall)
+
+[Anubis](https://github.com/TecharoHQ/anubis) is a Go reverse proxy that
+weighs each request with a browser-based proof-of-work challenge before passing
+it upstream. It was written for *exactly this scenario* — its author built it
+after Amazon's scraper took down their Git server — and is widely deployed in
+front of Forgejo/Gitea (Codeberg, the UN, etc.). Headless scrapers that can't
+run the challenge JS never reach the application; humans clear it once and
+proceed.
+
+Why it fits BlumeOps better than the alternatives:
+
+- **It attacks cost *and* availability at once.** Bots receive a few-KB
+ challenge page instead of MB of git HTML (egress collapses) and never reach
+ Forgejo (timeouts collapse). No other single lever does both.
+- **It stays in-house.** No third party terminates our TLS or sees our
+ traffic.
+
+Placement options:
+
+| Where | Pros | Cons |
+|-------|------|------|
+| On [[indri]], between [[caddy|Caddy]] and Forgejo | Protects every path and every entry (WAN *and* tailnet); one config | Adds a hop and a service to the indri critical path; the challenge page still tunnels back through Fly for WAN clients (small egress) |
+| On the Fly proxy machine, in front of nginx | Challenge served at the edge — bots never even tunnel to indri | Fly VM is small (512 MB); another moving part in the boot sequence alongside `tailscaled`/nginx/`fail2ban`/Alloy |
+
+Leaning toward Caddy-side on indri for simplicity and uniform coverage, but
+this is the open design question for Tier 2. Anubis is MIT-licensed and the
+author has signalled a future move to an `equi-x`-based challenge, so pin a
+version and track upstream.
+
+### Tier 3 — Move egress off Fly entirely (rejected)
+
+A [[#The incident|Cloudflare]] Tunnel (`cloudflared` on indri → Cloudflare
+edge) would make this a non-problem on the cost axis: Cloudflare does not meter
+proxied bandwidth, and it bundles free AI-bot mitigation (Bot Fight Mode, the
+"block AI scrapers" toggle, Managed Challenge, AI Labyrinth). One move would
+zero the egress bill and add bot defense.
+
+**We are not doing this, on principle.** Cloudflare is a solid platform and a
+defensible engineering choice — but it already sits in front of an enormous
+fraction of the modern web, and routing BlumeOps through it would add one more
+site to the pile of the internet that one company can see and gate. BlumeOps
+deliberately keeps its own backbone ([[expose-service-publicly|Fly + Tailscale
++ Caddy]], DNS at [[gandi|Gandi]] — see the "no Cloudflare dependency" line in
+that doc). This is a values decision, not a technical one: we would rather pay
+a few dollars and run our own mitigation than centralize on Cloudflare.
+
+It is also worth noting that **Tier 3 would not, by itself, fix the upstream
+timeouts** — free egress just means we'd stop *caring* that bots crawl, while
+they continued to hammer Forgejo. Crawl mitigation (Tier 1 + Tier 2) is
+required regardless of where egress is billed.
+
+## Summary
+
+| Tier | Lever | Cost | Availability | Status |
+|------|-------|------|--------------|--------|
+| 1 | Black-hole `/mirrors/*` at edge | −~71% | big drop | **shipped** |
+| 2a | UA denylist on remaining repos | −most of the rest | further drop | planned |
+| 2b | Anubis PoW gateway | −near-total | near-total | planned |
+| 3 | Cloudflare Tunnel | −total | needs 2b anyway | **rejected (principle)** |
+
+The guiding insight: the cheapest, lowest-risk mitigation is to **not serve an
+infinite-URL surface that has no human audience.** Everything past Tier 1 is
+about defending the surface we *do* want public, in-house, without ceding
+control of our traffic to a third party.
diff --git a/docs/tutorials/expose-service-publicly.md b/docs/tutorials/expose-service-publicly.md
index 886cad4..65af611 100644
--- a/docs/tutorials/expose-service-publicly.md
+++ b/docs/tutorials/expose-service-publicly.md
@@ -376,6 +376,13 @@ Mitigations for dynamic services:
- fail2ban on indri (see below) can block IPs showing abuse patterns
- The break-glass shutoff remains the last resort
+The most acute version of this in practice has been **AI scrapers**, which
+ignore `robots.txt` and crawl dynamic services (notably [[forgejo|Forgejo]]'s
+infinite git-history URL space) into both a surprise egress bill and an
+effective L7 DoS. See [[ai-scraper-mitigation]] for the incident, the tiered
+defense (mirror black-hole, user-agent denylist, Anubis proof-of-work), and
+why a Cloudflare Tunnel is *not* the chosen answer here.
+
If a publicly exposed dynamic service attracts targeted attacks or the
home network bandwidth is impacted, consider migrating to Cloudflare
Tunnel for enterprise-grade DDoS protection (requires DNS migration;
diff --git a/fly/Dockerfile b/fly/Dockerfile
index d4e7a18..406c849 100644
--- a/fly/Dockerfile
+++ b/fly/Dockerfile
@@ -25,6 +25,7 @@ COPY fail2ban/action.d/nginx-deny.conf /etc/fail2ban/action.d/nginx-deny.conf
COPY nginx.conf /etc/nginx/nginx.conf
COPY error.html /usr/share/nginx/html/error.html
+COPY naughty.html /usr/share/nginx/html/naughty.html
COPY alloy.river /etc/alloy/config.alloy
COPY start.sh /start.sh
RUN chmod +x /start.sh
diff --git a/fly/naughty.html b/fly/naughty.html
new file mode 100644
index 0000000..d899171
--- /dev/null
+++ b/fly/naughty.html
@@ -0,0 +1,64 @@
+
+
+
+
+
+
+ 403 · Roll of Dishonour
+
+
+
+
+ 🪤 403 — you walked into the scraper trap
+ These are mirror repositories. They are tailnet-only.
+
+
+ This path used to serve the web UI for mirrors of public upstream
+ projects. It exists for supply-chain control, not for crawling. A
+ robots.txt politely disallowed /mirrors/.
+ A pack of AI scrapers ignored it, walked the infinite git-history URL
+ space, and ran up ~1.25 TB of egress and a real
+ money bill in a single month — while timing out the server for everyone
+ else.
+
+
+ So /mirrors/ is closed at the edge now. Roll of dishonour,
+ by share of the bytes they stole:
+
+
+ | Operator | User-Agent | Bytes |
+
+ | Meta | meta-externalagent | 66% |
+ | OpenAI | GPTBot | 16% |
+ | Amazon | Amazonbot | 3% |
+ | ByteDance | Bytespider | 1% |
+
+
+
+
+ If you are a human who actually wanted these mirrors, they are reachable
+ from the tailnet at forge.ops.eblu.me. If you are a crawler:
+ read the robots.txt next time. We left you a header, too.
+
+
+
+
+
+
diff --git a/fly/nginx.conf b/fly/nginx.conf
index 570e6c9..ec35774 100644
--- a/fly/nginx.conf
+++ b/fly/nginx.conf
@@ -215,6 +215,33 @@ http {
return 403 "API documentation is only available at forge.ops.eblu.me (tailnet).\n";
}
+ # Black-hole the mirror repositories on WAN. These are mirrors of
+ # already-public upstreams (tailscale, prometheus, mealie, …) kept
+ # for supply-chain control; CI, gilbert, and tailnet clients consume
+ # them via forge.ops.eblu.me. Their web UI served no public purpose
+ # but AI scrapers, which crawled the near-infinite git-history URL
+ # space (src/commit, commits, blame, raw) and drove ~70% of Fly
+ # egress (1.24 TB/30d → a surprise bill) plus enough upstream load to
+ # time out Forgejo. robots.txt already Disallows /mirrors/, but
+ # meta-externalagent and GPTBot ignore it — so enforce at the edge.
+ # `^~` makes this win over the regex locations below (e.g. *.css), so
+ # static assets under /mirrors/ can't leak through. We also name and
+ # shame: blocked requests get a "roll of dishonour" page (403 status
+ # preserved) and an X-Naughty-Scrapers header. See
+ # docs/explanation/ai-scraper-mitigation.md.
+ location ^~ /mirrors/ {
+ error_page 403 /naughty.html;
+ return 403;
+ }
+
+ # Roll of dishonour — served on the /mirrors/ 403, status kept at 403.
+ location = /naughty.html {
+ internal;
+ root /usr/share/nginx/html;
+ add_header X-Naughty-Scrapers "OpenAI/GPTBot, Meta/meta-externalagent, Amazonbot, ByteDance/Bytespider — robots.txt ignorers" always;
+ add_header X-Clacks-Overhead "GNU Terry Pratchett" always;
+ }
+
# Redirect archive endpoints to tailnet — archive requests generate full
# git bundles on demand. Unauthenticated crawlers hitting unique commit
# SHAs cause unbounded CPU and disk usage (DoS vector). Legitimate users
From 40bd92982015582cb7aa2680c6dc8412706498fb Mon Sep 17 00:00:00 2001
From: Erich Blume
Date: Mon, 1 Jun 2026 20:55:05 -0700
Subject: [PATCH 31/52] C0: remove visible GNU Terry Pratchett from
naughty.html body
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
GNU lives in the overhead — the X-Clacks-Overhead header — never on the
visible page. Keep the header, drop the footer.
Co-Authored-By: Claude Opus 4.8 (1M context)
---
fly/naughty.html | 3 ---
1 file changed, 3 deletions(-)
diff --git a/fly/naughty.html b/fly/naughty.html
index d899171..b6eada8 100644
--- a/fly/naughty.html
+++ b/fly/naughty.html
@@ -21,7 +21,6 @@
td.share { color: #f2c14e; text-align: right; font-variant-numeric: tabular-nums; }
.name { color: #e8867a; }
a { color: #7fb3d5; }
- footer { margin-top: 2rem; color: #5c574f; font-size: .85rem; }
@@ -57,8 +56,6 @@
from the tailnet at forge.ops.eblu.me. If you are a crawler:
read the robots.txt next time. We left you a header, too.
-
-