diff --git a/AGENTS.md b/AGENTS.md index 510176d..c64af40 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -12,9 +12,10 @@ blumeops is Erich Blume's GitOps repository for personal infrastructure, orchest ## Rules -1. **Start every task by finding and reading the relevant docs** - Search `docs/` for cards related to the change area (grep for titles/tags, follow `[[wiki-links]]`) and read what you find before acting. Wiki-links refer to cards under `docs/` by filename stem. - For problems with a very large surface area, `mise run ai-sources` concatenates all non-doc source files (~270K tokens) — opt-in only, confirm with the user before loading it wholesale; targeted reading is usually better. +1. **Always run `mise run ai-docs` at session start** + This will refresh your context with important information you will be assumed to know and follow. + **Read the full output** — never truncate, pipe to `head`/`tail`, or skip sections. + For problems with a large surface area, ask the user if `mise run ai-sources` should also be run — it concatenates all non-doc source files (~270K tokens) for deep codebase context. 2. **Always use `--context=minikube-indri` with kubectl** (or `--context=k3s-ringtail` for ringtail services) - work contexts must never be touched **NEVER run `minikube delete`** — it destroys all PVs, etcd, and cluster state. Use `minikube stop`/`minikube start` for restarts. If minikube is stuck, see [[restart-indri]]. Full rebuild from scratch requires the DR procedure in [[rebuild-minikube-cluster]]. 3. **Classify the change as C0/C1/C2 before starting** (see below) — this determines branching and PR requirements @@ -68,7 +69,7 @@ See [[agent-change-process]] for the full methodology. ~/code/3rd/ # mirrored external projects ~/code/work # FORBIDDEN ``` -This is just an overview — explore `docs/` for the rest. When you +Other code paths will be listed via ai-docs, this is just an overview. When you encounter wiki-links (`[[like-this]]`) it is referring to docs/ cards. ## Service Deployment @@ -147,42 +148,13 @@ Create a new spork: `mise run spork-create ` ## Task Discovery BlumeOps tasks live in [hephaestus](https://github.com/eblume/hephaestus) (`heph`), -the user's self-hosted context/task system. The CLI is a thin client of the -local `hephd` daemon. (This replaced the retired `blumeops-tasks` mise task, -which read from Todoist.) - -### Reading tasks +the user's self-hosted context/task system. Fetch them with the CLI: ```fish -heph list --project Blumeops --json # outstanding Blumeops tasks as JSON -heph next # tactical "what is next?" ranking -heph show # one task with its scalars -heph context # print the task's canonical-context doc -heph log # print the task's latest log entries +heph list --project Blumeops --json # outstanding Blumeops tasks as JSON ``` -JSON rows carry `node_id` (use this as `` in all commands below), `title`, -`state`, `do_date`/`late_on` (epoch ms), `recurrence` (RFC-5545), and -`attention` (red|orange|white|blue — a1–a4 urgency tiers; blue = on-deck). - -### Manipulating tasks - -```fish -heph done # mark done (recurring tasks roll forward) -heph drop # mark dropped -heph skip # skip a recurring task's current occurrence -heph log "text" # append a log entry -heph context --append "…" # append to the canonical-context doc (--body replaces; `-` reads stdin) -heph edit --do-date +3d # reschedule; also --late-on/--recur/--attention/--project (`none` clears) -heph task "Title" --project Blumeops --do-date fri --attention white # create a task -``` - -Date forms: `today|tomorrow|+3d|fri|YYYY-MM-DD`. Recurrence: presets -(`daily|weekly|monthly|yearly|weekdays`) or natural language (`"every 3 days"`). - -Conventions: don't save TODOs to agent memory — file them as heph tasks under -the Blumeops project. When completing a recurring chore (e.g. "BlumeOps doc -review"), `heph log` a short note of what was done, then `heph done` it. +(This replaced the retired `blumeops-tasks` mise task, which read from Todoist.) Most operational scripts are stored in `./mise-tasks/`. For scripts with any logic or complexity, use uv run --script 's with explicit dependencies. Complex diff --git a/argocd/manifests/prowler/cronjob-iac-scan.yaml b/argocd/manifests/prowler/cronjob-iac-scan.yaml new file mode 100644 index 0000000..c1303a5 --- /dev/null +++ b/argocd/manifests/prowler/cronjob-iac-scan.yaml @@ -0,0 +1,54 @@ +--- +apiVersion: batch/v1 +kind: CronJob +metadata: + name: prowler-iac-scan + namespace: prowler +spec: + schedule: "0 2 * * 6" # Saturday 2am + concurrencyPolicy: Forbid + jobTemplate: + spec: + ttlSecondsAfterFinished: 604800 # Auto-delete after 7 days + template: + spec: + securityContext: + seccompProfile: + type: RuntimeDefault + containers: + - name: prowler + image: registry.ops.eblu.me/blumeops/prowler:kustomized + command: ["/bin/sh", "-c"] + # Prowler's --mutelist-file is a no-op for the IaC provider + # (it delegates to Trivy). The Prowler image's trivy shim + # injects --ignorefile $TRIVY_IGNOREFILE when set; see + # containers/prowler/Dockerfile. + env: + - name: TRIVY_IGNOREFILE + value: /mutelist/trivyignore.yaml + args: + - | + DATEDIR=/reports/prowler-iac/$(date +%Y-%m-%d) + mkdir -p "$DATEDIR" + prowler iac \ + --scan-repository-url https://forge.ops.eblu.me/eblume/blumeops.git \ + -z \ + --output-formats html csv json-ocsf \ + --output-directory "$DATEDIR" + volumeMounts: + - name: reports + mountPath: /reports + - name: mutelist + mountPath: /mutelist + readOnly: true + restartPolicy: OnFailure + volumes: + - name: reports + persistentVolumeClaim: + claimName: prowler-reports + - name: mutelist + configMap: + name: prowler-mutelist + items: + - key: trivyignore.yaml + path: trivyignore.yaml diff --git a/argocd/manifests/prowler/cronjob-image-scan.yaml b/argocd/manifests/prowler/cronjob-image-scan.yaml new file mode 100644 index 0000000..b779d08 --- /dev/null +++ b/argocd/manifests/prowler/cronjob-image-scan.yaml @@ -0,0 +1,39 @@ +--- +apiVersion: batch/v1 +kind: CronJob +metadata: + name: prowler-image-scan + namespace: prowler +spec: + schedule: "0 3 * * 6" # Saturday 3am + concurrencyPolicy: Forbid + jobTemplate: + spec: + ttlSecondsAfterFinished: 604800 # Auto-delete after 7 days + template: + spec: + securityContext: + seccompProfile: + type: RuntimeDefault + containers: + - name: prowler + image: registry.ops.eblu.me/blumeops/prowler:kustomized + command: ["/bin/sh", "-c"] + args: + - | + DATEDIR=/reports/prowler-images/$(date +%Y-%m-%d) + mkdir -p "$DATEDIR" + prowler image \ + --registry https://registry.ops.eblu.me \ + --image-filter "^blumeops/" \ + -z \ + --output-formats html csv json-ocsf \ + --output-directory "$DATEDIR" + volumeMounts: + - name: reports + mountPath: /reports + restartPolicy: OnFailure + volumes: + - name: reports + persistentVolumeClaim: + claimName: prowler-reports diff --git a/argocd/manifests/prowler/kustomization.yaml b/argocd/manifests/prowler/kustomization.yaml index 38295a3..1d92a6b 100644 --- a/argocd/manifests/prowler/kustomization.yaml +++ b/argocd/manifests/prowler/kustomization.yaml @@ -10,6 +10,8 @@ resources: - pv-nfs.yaml - pvc.yaml - cronjob.yaml + - cronjob-image-scan.yaml + - cronjob-iac-scan.yaml configMapGenerator: - name: prowler-mutelist @@ -21,6 +23,7 @@ configMapGenerator: - mutelist/core-pod-security.yaml - mutelist/manual-node-checks.yaml - mutelist/rbac.yaml + - mutelist/trivyignore.yaml images: - name: registry.ops.eblu.me/blumeops/prowler diff --git a/argocd/manifests/prowler/mutelist/trivyignore.yaml b/argocd/manifests/prowler/mutelist/trivyignore.yaml new file mode 100644 index 0000000..87af966 --- /dev/null +++ b/argocd/manifests/prowler/mutelist/trivyignore.yaml @@ -0,0 +1,37 @@ +# Trivy ignorefile for Prowler IaC scan. +# +# Prowler's `--mutelist-file` flag is a no-op for the IaC provider +# (iac_provider.py sets self._mutelist = None and delegates to Trivy). +# Trivy in turn does not auto-discover this YAML form from cwd, so the +# Prowler image ships a shim wrapper around `trivy` that injects +# --ignorefile $TRIVY_IGNOREFILE when the env var is set. The cronjob +# mounts this file and sets TRIVY_IGNOREFILE accordingly. +# +# Schema: https://trivy.dev/latest/docs/configuration/filtering/ +# IDs use the hyphenated form Trivy displays (KSV-0041, not KSV0041). +misconfigurations: + - id: KSV-0041 + paths: + - "argocd/manifests/external-secrets/rbac.yaml" + statement: >- + external-secrets-operator's entire function is to read and + synthesize Secret objects; ClusterRole over secrets is its + purpose. Both the controller and cert-controller are + upstream-defined. + - id: KSV-0041 + paths: + - "argocd/manifests/kube-state-metrics/rbac.yaml" + - "argocd/manifests/kube-state-metrics-ringtail/rbac.yaml" + statement: >- + KSM exposes only Secret metadata (name, namespace, type, labels), + never the data field. list/watch on secrets is required for + kube_secret_info / kube_secret_labels metrics. + - id: KSV-0114 + paths: + - "argocd/manifests/external-secrets/rbac.yaml" + statement: >- + cert-controller manages the external-secrets validating webhook + configurations to inject its own rotating CA bundle. RBAC is + scoped to two named webhooks (secretstore-validate, + externalsecret-validate) via resourceNames; KSV-0114 doesn't see + the resourceNames restriction so reports the full ClusterRole. diff --git a/argocd/manifests/tailscale-operator-ringtail/kustomization.yaml b/argocd/manifests/tailscale-operator-ringtail/kustomization.yaml index 25c3545..2d9ceb2 100644 --- a/argocd/manifests/tailscale-operator-ringtail/kustomization.yaml +++ b/argocd/manifests/tailscale-operator-ringtail/kustomization.yaml @@ -9,19 +9,12 @@ resources: - proxygroup-ingress.yaml - external-secret.yaml -# Rewrite the operator image to the locally nix-built (amd64) mirror. -# The name must match the post-base-render image (base already rewrites -# tailscale/k8s-operator -> docker.io/tailscale/k8s-operator). -images: - - name: docker.io/tailscale/k8s-operator - newName: registry.ops.eblu.me/blumeops/tailscale-operator - newTag: v1.94.2-d03ed33-nix - -# Rewrite the proxyclass image to our local nix-built mirror (indri's overlay -# carries the equivalent dagger/arm64 patch). A strategic merge patch is used -# instead of kustomize's `images:` directive because that directive only -# rewrites images in standard k8s container fields, not custom-resource fields -# like ProxyClass.spec.statefulSet.pod.tailscaleContainer.image. +# Rewrite the proxyclass image to our local nix-built mirror. +# Scoped to ringtail only; indri's tailscale-operator/kustomization.yaml still +# pulls from upstream docker.io. A strategic merge patch is used instead of +# kustomize's `images:` directive because that directive only rewrites images +# in standard k8s container fields, not custom-resource fields like +# ProxyClass.spec.statefulSet.pod.tailscaleContainer.image. patches: - path: proxyclass-image.yaml target: diff --git a/argocd/manifests/tailscale-operator/kustomization.yaml b/argocd/manifests/tailscale-operator/kustomization.yaml index 239f7ea..f1d6f89 100644 --- a/argocd/manifests/tailscale-operator/kustomization.yaml +++ b/argocd/manifests/tailscale-operator/kustomization.yaml @@ -14,23 +14,3 @@ resources: # Endpoints). Apply manually: # kubectl --context=minikube-indri apply -f endpoints-forge.yaml - ingress-forge.yaml - -# Rewrite the operator image to the locally dagger-built (arm64) mirror. -# The name must match the post-base-render image (base already rewrites -# tailscale/k8s-operator -> docker.io/tailscale/k8s-operator). -images: - - name: docker.io/tailscale/k8s-operator - newName: registry.ops.eblu.me/blumeops/tailscale-operator - newTag: v1.94.2-d03ed33 - -# Rewrite the proxyclass image to the local mirror. A strategic merge patch -# is used instead of kustomize's `images:` directive because that directive -# only rewrites standard k8s container fields, not custom-resource fields -# like ProxyClass.spec.statefulSet.pod.tailscaleContainer.image. -patches: - - path: proxyclass-image.yaml - target: - group: tailscale.com - version: v1alpha1 - kind: ProxyClass - name: default diff --git a/argocd/manifests/tailscale-operator/proxyclass-image.yaml b/argocd/manifests/tailscale-operator/proxyclass-image.yaml deleted file mode 100644 index 82a7e0b..0000000 --- a/argocd/manifests/tailscale-operator/proxyclass-image.yaml +++ /dev/null @@ -1,11 +0,0 @@ -apiVersion: tailscale.com/v1alpha1 -kind: ProxyClass -metadata: - name: default -spec: - statefulSet: - pod: - tailscaleContainer: - image: registry.ops.eblu.me/blumeops/tailscale:v1.94.2-d03ed33 - tailscaleInitContainer: - image: registry.ops.eblu.me/blumeops/tailscale:v1.94.2-d03ed33 diff --git a/containers/tailscale-operator/container.py b/containers/tailscale-operator/container.py deleted file mode 100644 index ff63845..0000000 --- a/containers/tailscale-operator/container.py +++ /dev/null @@ -1,53 +0,0 @@ -"""Tailscale Kubernetes operator — native Dagger build. - -Single Go binary (cmd/k8s-operator) from the forge mirror, mirroring -upstream's build_docker.sh mkctr recipe: binary at /usr/local/bin/operator, -go tags ts_kube + ts_package_container, version stamps in ldflags. - -Consumed by the tailscale-operator app on indri's minikube (arm64); the -ringtail app uses the -nix tag from default.nix instead. -""" - -import dagger - -from blumeops.containers import ( - alpine_runtime, - clone_from_forge, - go_build, - oci_labels, -) - -VERSION = "v1.94.2" - - -async def build(src: dagger.Directory) -> dagger.Container: - source = clone_from_forge("tailscale", VERSION) - semver = VERSION.removeprefix("v") - - builder = go_build( - source, - "/out/operator", - cmd_path="./cmd/k8s-operator", - tags="ts_kube,ts_package_container", - ldflags=( - "-w -s" - f" -X tailscale.com/version.longStamp={semver}" - f" -X tailscale.com/version.shortStamp={semver}" - ), - ) - - # Upstream runs the operator as root on a minimal base; only CA certs - # are needed at runtime (operator talks to the k8s API and Tailscale - # control plane over HTTPS). - runtime = alpine_runtime(extra_apk=["ca-certificates"], create_user=False) - runtime = oci_labels( - runtime, - title="Tailscale Kubernetes Operator", - description="Tailscale operator for Kubernetes Ingress/egress proxies", - version=VERSION, - ) - return runtime.with_file( - "/usr/local/bin/operator", - builder.file("/out/operator"), - permissions=0o555, - ).with_entrypoint(["/usr/local/bin/operator"]) diff --git a/containers/tailscale-operator/default.nix b/containers/tailscale-operator/default.nix deleted file mode 100644 index 8b279d5..0000000 --- a/containers/tailscale-operator/default.nix +++ /dev/null @@ -1,67 +0,0 @@ -# Nix-built tailscale k8s-operator for ringtail's tailscale-operator app. -# Builds cmd/k8s-operator v1.94.2 from the forge mirror, mirroring upstream's -# build_docker.sh mkctr recipe (binary at /usr/local/bin/operator, ts_kube + -# ts_package_container go tags). Built on the ringtail nix-container-builder. -{ pkgs ? import { } }: - -let - version = "1.94.2"; - - src = pkgs.fetchgit { - url = "https://forge.ops.eblu.me/mirrors/tailscale.git"; - rev = "v${version}"; - hash = "sha256-qjWVB8xWVgIVUgrf27F6hwiFIE+4ERXWeHv26ugg/x4="; - }; - - operator = pkgs.buildGoModule { - inherit src version; - pname = "tailscale-operator"; - vendorHash = "sha256-WeMTOkERj4hvdg4yPaZ1gRgKnhRIBXX55kUVbX/k/xM="; - - subPackages = [ "cmd/k8s-operator" ]; - - tags = [ - "ts_kube" - "ts_package_container" - ]; - - ldflags = [ - "-s" - "-w" - "-X tailscale.com/version.longStamp=${version}" - "-X tailscale.com/version.shortStamp=${version}" - ]; - - doCheck = false; - - meta = with pkgs.lib; { - description = "Tailscale operator for Kubernetes"; - homepage = "https://tailscale.com"; - license = licenses.bsd3; - }; - }; -in - -pkgs.dockerTools.buildLayeredImage { - name = "blumeops/tailscale-operator"; - tag = "v${version}"; - - contents = [ - operator - pkgs.cacert - ]; - - # buildGoModule names the binary after the package dir (k8s-operator); - # upstream's image expects /usr/local/bin/operator. - extraCommands = '' - mkdir -p usr/local/bin - ln -s /bin/k8s-operator usr/local/bin/operator - ''; - - config = { - Entrypoint = [ "/usr/local/bin/operator" ]; - Env = [ - "SSL_CERT_FILE=${pkgs.cacert}/etc/ssl/certs/ca-bundle.crt" - ]; - }; -} diff --git a/containers/tailscale/container.py b/containers/tailscale/container.py deleted file mode 100644 index 8e3e509..0000000 --- a/containers/tailscale/container.py +++ /dev/null @@ -1,104 +0,0 @@ -"""Tailscale proxy image (containerboot) — native Dagger build. - -Builds cmd/tailscale, cmd/tailscaled, and cmd/containerboot from the forge -mirror, mirroring the upstream Dockerfile: Alpine runtime with iptables -(legacy symlinked over the default, per upstream issue #17854), iproute2, -and the /tailscale/run.sh compat symlink. - -Consumed by the tailscale-operator ProxyClass on indri's minikube (arm64); -ringtail's ProxyClass uses the -nix tag from default.nix instead. -""" - -import dagger - -from blumeops.containers import ( - alpine_runtime, - clone_from_forge, - go_build, - oci_labels, -) - -VERSION = "v1.94.2" - - -async def build(src: dagger.Directory) -> dagger.Container: - source = clone_from_forge("tailscale", VERSION) - semver = VERSION.removeprefix("v") - - ldflags = ( - "-w -s" - f" -X tailscale.com/version.longStamp={semver}" - f" -X tailscale.com/version.shortStamp={semver}" - ) - builder = go_build( - source, - "/out/tailscale", - cmd_path="./cmd/tailscale", - ldflags=ldflags, - ) - builder = builder.with_exec( - [ - "go", - "build", - f"-ldflags={ldflags}", - "-o", - "/out/tailscaled", - "./cmd/tailscaled", - ] - ).with_exec( - [ - "go", - "build", - f"-ldflags={ldflags}", - "-o", - "/out/containerboot", - "./cmd/containerboot", - ] - ) - - runtime = alpine_runtime( - extra_apk=["ca-certificates", "iptables", "iproute2", "ip6tables"], - create_user=False, - ) - runtime = oci_labels( - runtime, - title="Tailscale", - description="Tailscale containerboot proxy image for the k8s operator", - version=VERSION, - ) - return ( - runtime - # Match upstream Dockerfile: nftables-backed iptables misbehaves in - # some environments, force the legacy backend (tailscale/tailscale#17854). - .with_exec( - [ - "sh", - "-c", - "rm /usr/sbin/iptables && ln -s /usr/sbin/iptables-legacy /usr/sbin/iptables" - " && rm /usr/sbin/ip6tables && ln -s /usr/sbin/ip6tables-legacy /usr/sbin/ip6tables", - ] - ) - .with_file( - "/usr/local/bin/tailscale", - builder.file("/out/tailscale"), - permissions=0o555, - ) - .with_file( - "/usr/local/bin/tailscaled", - builder.file("/out/tailscaled"), - permissions=0o555, - ) - .with_file( - "/usr/local/bin/containerboot", - builder.file("/out/containerboot"), - permissions=0o555, - ) - .with_exec( - [ - "sh", - "-c", - "mkdir /tailscale && ln -s /usr/local/bin/containerboot /tailscale/run.sh", - ] - ) - .with_entrypoint(["/usr/local/bin/containerboot"]) - ) diff --git a/docs/changelog.d/+1password-export-menu-wording.doc.md b/docs/changelog.d/+1password-export-menu-wording.doc.md deleted file mode 100644 index 1236ffc..0000000 --- a/docs/changelog.d/+1password-export-menu-wording.doc.md +++ /dev/null @@ -1 +0,0 @@ -Corrected the 1Password backup how-to: the desktop app's export menu item is named after the account ("File > Export > Blume/Davis"), not "All Vaults". Verified an account export contains all four vaults (Private, blumeops, Payrix, Shared). diff --git a/docs/changelog.d/+jellyfin-10-11-11.bugfix.md b/docs/changelog.d/+jellyfin-10-11-11.bugfix.md deleted file mode 100644 index 779a042..0000000 --- a/docs/changelog.d/+jellyfin-10-11-11.bugfix.md +++ /dev/null @@ -1 +0,0 @@ -Upgraded Jellyfin on indri from 10.11.6 to 10.11.11, picking up the security fixes in 10.11.7 (disclosed CVEs/GHSAs, flagged "upgrade immediately") and 10.11.10 (three further GHSAs). Noted the recurring gotcha in the service-versions tracking: after a `brew upgrade --cask jellyfin`, the re-quarantined `.app` makes the launchd-spawned process hang silently until the Gatekeeper first-launch dialog is approved on indri's GUI console — removing the quarantine xattr over SSH is blocked by macOS TCC. diff --git a/docs/changelog.d/+ringtail-flake-update.infra.md b/docs/changelog.d/+ringtail-flake-update.infra.md deleted file mode 100644 index 1d806df..0000000 --- a/docs/changelog.d/+ringtail-flake-update.infra.md +++ /dev/null @@ -1 +0,0 @@ -Updated ringtail NixOS flake inputs (nixpkgs `nixos-25.11`, disko) to latest via `dagger call flake-update`. diff --git a/docs/changelog.d/+service-review-automounter.misc.md b/docs/changelog.d/+service-review-automounter.misc.md deleted file mode 100644 index 31e5644..0000000 --- a/docs/changelog.d/+service-review-automounter.misc.md +++ /dev/null @@ -1 +0,0 @@ -Service review: AutoMounter on indri is current at 1.13.0 (App Store auto-updated from the tracked 1.11.0); all sifaka SMB mounts verified healthy. Fixed the stale tracking-file path shown by `mise run service-review`. diff --git a/docs/changelog.d/+tailscale-operator-doc-review.doc.md b/docs/changelog.d/+tailscale-operator-doc-review.doc.md deleted file mode 100644 index 8f7d5a3..0000000 --- a/docs/changelog.d/+tailscale-operator-doc-review.doc.md +++ /dev/null @@ -1 +0,0 @@ -Reviewed the tailscale-operator reference card: documented the dual indri/ringtail deployment, corrected the ArgoCD apps list, pinned the upstream version, and added the ProxyGroup Ingress `host:` caveat. diff --git a/docs/changelog.d/doc-review-stalest-five.ai.md b/docs/changelog.d/doc-review-stalest-five.ai.md deleted file mode 100644 index 95da490..0000000 --- a/docs/changelog.d/doc-review-stalest-five.ai.md +++ /dev/null @@ -1 +0,0 @@ -Retired the `ai-docs` mise task and its mandatory session-start rule: the concatenated docs corpus had grown to ~130K tokens, too large to ingest wholesale. Agents now start tasks by finding and reading the relevant docs (grep + wiki-links); `ai-sources` remains for opt-in deep codebase context. Also documented the full `heph` CLI task workflow (read, log, complete, create) in AGENTS.md. diff --git a/docs/changelog.d/doc-review-stalest-five.doc.md b/docs/changelog.d/doc-review-stalest-five.doc.md deleted file mode 100644 index 8353e3d..0000000 --- a/docs/changelog.d/doc-review-stalest-five.doc.md +++ /dev/null @@ -1 +0,0 @@ -Reviewed the five stalest documentation cards (argocd, authentik, grafana, unifi, plan-a-meal): brought ArgoCD's SSO/dual-cluster/sync-policy story up to date, expanded Authentik's blueprint and OIDC client inventory to all eight clients, fixed Grafana's TeslaMate datasource target and dashboard list, and noted UnPoller's locally-built image. diff --git a/docs/changelog.d/localize-tailscale-operator.infra.md b/docs/changelog.d/localize-tailscale-operator.infra.md deleted file mode 100644 index 324eac6..0000000 --- a/docs/changelog.d/localize-tailscale-operator.infra.md +++ /dev/null @@ -1 +0,0 @@ -Localized the Tailscale operator stack: the k8s-operator image (both clusters) and the ProxyClass proxy image (indri, completing parity with ringtail) are now built from the forge mirror instead of pulled from Docker Hub. diff --git a/docs/changelog.d/retire-prowler-image-iac-scans.infra.md b/docs/changelog.d/retire-prowler-image-iac-scans.infra.md deleted file mode 100644 index 9afd261..0000000 --- a/docs/changelog.d/retire-prowler-image-iac-scans.infra.md +++ /dev/null @@ -1 +0,0 @@ -Retired the Prowler container-image CVE scan and IaC scan, keeping only the K8s CIS benchmark scan. The two retired scans generated tens of thousands of un-actioned, un-muted findings every week (~20,000 image findings and growing, mostly unpatchable upstream-image CVEs; ~650 systemic Trivy KSV pod-security warnings) — the weekly `mise run review-compliance-reports` re-surfaced them all as "action needed" though none were ever triaged. The K8s CIS scan is fully mutelisted and runs clean, so it stays. Removed the two CronJobs, the now-unused `trivyignore.yaml` mutelist, and the grouped-findings rendering in the review tool that existed solely for the high-volume scans. diff --git a/docs/explanation/agent-change-process.md b/docs/explanation/agent-change-process.md index a6d8684..5141950 100644 --- a/docs/explanation/agent-change-process.md +++ b/docs/explanation/agent-change-process.md @@ -1,6 +1,6 @@ --- title: Agent Change Process -modified: 2026-06-09 +modified: 2026-03-15 last-reviewed: 2026-02-23 tags: - explanation @@ -25,13 +25,13 @@ Before starting work, classify the change: When in doubt, start at C1. Upgrade to C2 if complexity spirals or the user requests it. -**Context loading:** All change classes start by finding and reading the docs relevant to the change area — grep `docs/` and follow wiki-links. For problems with a very large surface area, `mise run ai-sources` concatenates all non-doc source files (~270K tokens); confirm with the user before loading it wholesale. +**Context loading:** All change classes start with `mise run ai-docs` (~85K tokens of documentation). For problems with a large surface area, ask the user if `mise run ai-sources` should also be run — it concatenates all non-doc source files (~270K tokens). Together they cover the full codebase without overlap. ## C0 — Quick Fix A change where the risk is low enough that problems can be quickly fixed forward. -1. Find and read the docs relevant to the change area +1. Run `mise run ai-docs` to load context 2. Implement the change directly on main 3. Add a changelog fragment if the change is user-visible or noteworthy (`docs/changelog.d/+..md`) 4. Commit and push @@ -46,7 +46,7 @@ A change with enough complexity or risk that a human should review it, but not s ### Process -1. Find and read the docs relevant to the change area +1. Run `mise run ai-docs` to load context 2. **Search related docs** — read existing documentation and reference cards related to the change area 3. **Create a feature branch** and open a PR early (draft is fine) 4. **Documentation first** — commit doc changes reflecting the desired end state before writing code. This helps the reviewer understand intent and catches design issues early @@ -77,7 +77,7 @@ A complex, multi-session change managed through the [Mikado method](https://mika Before writing any code, invest in understanding the problem: -1. Find and read the docs relevant to the change area +1. Run `mise run ai-docs` to load context 2. Search related docs, reference cards, and existing how-to guides for the change area 3. Think through the dependency graph — what prerequisites exist? What could go wrong? 4. Create Mikado cards for everything you can anticipate (you'll discover more later — that's the point of the method) @@ -220,7 +220,7 @@ When the final leaf node is closed and no `status: active` cards remain: When starting a new session to continue C2 work: -1. Find and read the docs relevant to the change area +1. Run `mise run ai-docs` to load context 2. Run `mise run docs-mikado --resume` — this will: - Detect the current branch and match it to an active chain - Show the chain state, ready leaf nodes, and current position in the invariant diff --git a/docs/how-to/mealie/plan-a-meal.md b/docs/how-to/mealie/plan-a-meal.md index 10de2cb..1e6eb48 100644 --- a/docs/how-to/mealie/plan-a-meal.md +++ b/docs/how-to/mealie/plan-a-meal.md @@ -1,7 +1,6 @@ --- title: Plan a Meal -modified: 2026-06-09 -last-reviewed: 2026-06-09 +modified: 2026-03-17 tags: - how-to - mealie diff --git a/docs/how-to/operations/deploy-prowler.md b/docs/how-to/operations/deploy-prowler.md index 1475680..75dced2 100644 --- a/docs/how-to/operations/deploy-prowler.md +++ b/docs/how-to/operations/deploy-prowler.md @@ -1,6 +1,6 @@ --- title: Deploy Prowler CIS Scanner -modified: 2026-06-08 +modified: 2026-03-24 last-reviewed: 2026-03-24 tags: - how-to @@ -11,20 +11,7 @@ tags: # Deploy Prowler CIS Scanner -Prowler runs a weekly CIS Kubernetes Benchmark scan against minikube-indri and writes HTML/CSV/JSON reports to the NFS share on sifaka. - -## Why only the K8s CIS scan - -Prowler originally ran three CronJobs: K8s CIS, container-image CVE scanning, and IaC scanning. The image and IaC scans were **retired in 2026-06**. - -Both were pure toil with no realized value: - -- **Image scan** produced ~20,000 unmuted findings per run and growing, none ever triaged or muted. They were overwhelmingly CVEs in *upstream* base images we don't control and can't patch, and the job re-scanned every historical tag still in the registry, multiplying the count. -- **IaC scan** produced ~650 Trivy KSV findings (`runAsNonRoot`, `readOnlyRootFilesystem`, drop-capabilities, …) against our own manifests — real but systemic, homelab-acceptable, and likewise never muted, so the weekly review re-surfaced all of them indefinitely. - -The K8s CIS scan, by contrast, is fully mutelisted and runs clean (0 unmuted findings week over week), so it stays. The guiding principle matches [[ai-scraper-mitigation]]: don't keep generating a firehose of output that has no audience. If image-CVE signal is wanted later, the right shape is critical-severity-only, currently-deployed-tags-only, alert-on-new — a rebuild, not a revival (tracked as the "Trivy for image/IaC scanning" task). - -Note that the K8s CIS scan itself is tied to minikube-indri, which is slated for retirement; on k3s only ~22 of 70 checks produce results (no static pods). Re-pointing a lean posture check at ringtail is tracked separately ("prowler scan against ringtail"). +Prowler runs weekly CIS Kubernetes Benchmark scans against minikube-indri and writes HTML/CSV/JSON reports to the NFS share on sifaka. ## What it checks @@ -46,6 +33,38 @@ Prowler's Kubernetes provider runs ~70 checks from the CIS Kubernetes Benchmark **k3s note:** k3s embeds the control plane in a single binary — no static pods exist. Only core + RBAC checks (~22 of 70) produce results. Consider `kube-bench` for k3s control plane checks. +### Image vulnerability scanning (Saturday 3am) + +Prowler's image provider scans all `blumeops/*` container images in `registry.ops.eblu.me` for: + +- **CVEs** — known vulnerabilities from NVD, Alpine SecDB, Debian Security Tracker, and other sources +- **Embedded secrets** — credentials or API keys baked into image layers +- **Misconfigurations** — Dockerfile best practices (running as root, missing HEALTHCHECK, etc.) + +Uses Trivy under the hood. Reports are written to `sifaka:/volume1/reports/prowler-images/`. + +To run an ad-hoc image scan: + +```fish +kubectl create job --from=cronjob/prowler-image-scan prowler-image-manual -n prowler --context=minikube-indri +``` + +### IaC scanning (Saturday 2am) + +Prowler's IaC provider scans the blumeops repository (cloned at scan time) for misconfigurations in: + +- **Dockerfiles** — running as root, using `latest` tags, missing `HEALTHCHECK` +- **Kubernetes manifests** — missing resource limits, privileged containers, insecure settings +- **Other IaC files** — Terraform, CloudFormation, etc. if present + +Uses Trivy under the hood. Reports are written to `sifaka:/volume1/reports/prowler-iac/`. + +To run an ad-hoc IaC scan: + +```fish +kubectl create job --from=cronjob/prowler-iac-scan prowler-iac-manual -n prowler --context=minikube-indri +``` + ## Reports Reports are written to `sifaka:/volume1/reports/prowler/` with timestamped filenames. See [[read-compliance-reports]] for how to access and interpret them. diff --git a/docs/how-to/operations/read-compliance-reports.md b/docs/how-to/operations/read-compliance-reports.md index 2990026..e676ad5 100644 --- a/docs/how-to/operations/read-compliance-reports.md +++ b/docs/how-to/operations/read-compliance-reports.md @@ -1,6 +1,6 @@ --- title: Read Compliance Reports -modified: 2026-06-08 +modified: 2026-04-06 last-reviewed: 2026-04-06 tags: - how-to @@ -27,13 +27,8 @@ Reports are stored on sifaka at `/volume1/reports/`. Each scanner writes to its | Scanner | Path | Schedule | |---------|------|----------| | [[prowler]] K8s CIS | `sifaka:/volume1/reports/prowler/` | Weekly (Sunday 3am) | - -> **Retired (2026-06):** the Prowler **image** (`prowler-images/`) and **IaC** -> (`prowler-iac/`) scans were retired. They produced tens of thousands of -> un-actioned, un-muted findings every week — mostly unpatchable upstream-image -> CVEs and systemic pod-security KSV warnings — and nobody triaged them. See -> [[deploy-prowler#Why only the K8s CIS scan]] for the rationale. Their stale -> report directories may linger on sifaka until manually removed. +| [[prowler]] Image | `sifaka:/volume1/reports/prowler-images/` | Weekly (Saturday 3am) | +| [[prowler]] IaC | `sifaka:/volume1/reports/prowler-iac/` | Weekly (Saturday 2am) | Copy reports to your local machine (remember `scp -O` for sifaka): diff --git a/docs/how-to/operations/run-1password-backup.md b/docs/how-to/operations/run-1password-backup.md index 2f8c88a..0dc9ec9 100644 --- a/docs/how-to/operations/run-1password-backup.md +++ b/docs/how-to/operations/run-1password-backup.md @@ -1,7 +1,7 @@ --- title: Run 1Password Backup -modified: 2026-06-09 -last-reviewed: 2026-06-09 +modified: 2026-03-11 +last-reviewed: 2026-03-16 tags: - how-to - operations @@ -24,7 +24,7 @@ How to export and encrypt your 1Password vaults for inclusion in [[borgmatic]] b ### 1. Export Vaults From 1Password 1. Open the 1Password desktop app -2. **File > Export > Blume/Davis** (the menu item is named after the account, not "All Vaults" — exporting the account covers all vaults: Private, blumeops, Payrix, and Shared) +2. **File > Export > All Vaults** 3. Choose **1PUX** format 4. Save to `~/Documents/` — 1Password names the file `1PasswordExport--.1pux` automatically; don't bother renaming it, pass the path to the task in the next step diff --git a/docs/reference/infrastructure/unifi.md b/docs/reference/infrastructure/unifi.md index 43297e7..6182880 100644 --- a/docs/reference/infrastructure/unifi.md +++ b/docs/reference/infrastructure/unifi.md @@ -1,7 +1,6 @@ --- title: UniFi -modified: 2026-06-09 -last-reviewed: 2026-06-09 +modified: 2026-03-16 tags: - infrastructure - networking @@ -72,7 +71,7 @@ Attempted Feb 2026 with the `ubiquiti-community/unifi` Terraform provider via Pu ## Monitoring -UniFi metrics are exported to Prometheus via [UnPoller](https://github.com/unpoller/unpoller), running as a k8s deployment in the `monitoring` namespace on indri's minikube (`argocd/manifests/unpoller/`, locally-built image `registry.ops.eblu.me/blumeops/unpoller`). UnPoller polls the UX7 controller API using an API key and exposes metrics on port 9130. +UniFi metrics are exported to Prometheus via [UnPoller](https://github.com/unpoller/unpoller), running as a k8s deployment in the `monitoring` namespace on indri. UnPoller polls the UX7 controller API using an API key and exposes metrics on port 9130. - **Prometheus job:** `unpoller` - **Metrics prefix:** `unifi_` diff --git a/docs/reference/kubernetes/tailscale-operator.md b/docs/reference/kubernetes/tailscale-operator.md index ba03014..c102e02 100644 --- a/docs/reference/kubernetes/tailscale-operator.md +++ b/docs/reference/kubernetes/tailscale-operator.md @@ -1,7 +1,6 @@ --- title: Tailscale Operator -modified: 2026-06-09 -last-reviewed: 2026-06-09 +modified: 2026-02-08 tags: - kubernetes - tailscale @@ -16,48 +15,8 @@ The Tailscale operator enables Kubernetes services to be exposed directly on the | Property | Value | |----------|-------| | **Namespace** | `tailscale` | -| **Upstream** | `mirrors/tailscale` on forge (static manifest, pinned `v1.94.2`) | -| **ArgoCD Apps** | `tailscale-operator` (indri/minikube), `tailscale-operator-ringtail` (ringtail/k3s) | - -The operator runs on **both** clusters — indri's minikube and ringtail's k3s. -Both apps layer on the shared `tailscale-operator-base` kustomize directory -(operator manifest, `ProxyClass`, `dnsconfig`); each cluster supplies its own -`ProxyGroup` (indri: 2 replicas, ringtail: 1) and OAuth `ExternalSecret`. See -[[ringtail]] and [[migrate-wave1-ringtail]] for the ongoing migration of k8s -workloads onto ringtail. - -## Local Images - -Both the operator and the proxy run locally-built images from the forge -mirror (`mirrors/tailscale`), not Docker Hub: - -| Image | Build | Used by | -|-------|-------|---------| -| `blumeops/tailscale-operator` | `containers/tailscale-operator/` (`container.py` for indri/arm64, `default.nix` `-nix` tag for ringtail/amd64) | operator Deployment, via each overlay's `images:` override | -| `blumeops/tailscale` | `containers/tailscale/` (same dual build) | `ProxyClass` proxy pods, via a strategic-merge patch in each overlay | - -The ProxyClass image must be set with a **patch**, not kustomize's `images:` -directive — that directive only rewrites standard container fields, not -custom-resource fields like `ProxyClass.spec.statefulSet.pod.tailscaleContainer.image`. - -The `dnsconfig` nameserver image (`tailscale/k8s-nameserver:stable`) is still -upstream — a known follow-up. - -## Rollout Safety (device identity) - -Proxy and operator tailnet identity lives in Kubernetes state Secrets in the -`tailscale` namespace, not in pods or images. An image swap rolls the -Deployment/StatefulSets but pods re-authenticate with their existing node -keys — devices keep their names. Shadow devices (`foo-1` suffixes) appear only -when a pod registers *fresh* while a stale device record still holds the name -(deleted state Secrets, cluster rebuilds). When rolling out image changes: - -1. Never delete the `tailscale` namespace state Secrets. -2. Verify after sync: pods healthy, device names unchanged in the admin - console, `mise run services-check` green. -3. If a collision does occur: delete the stale device in the admin console - AND the affected state Secret, then restart the pod (see - [[rebuild-minikube-cluster]]). +| **Upstream** | `mirrors/tailscale` on forge (static manifest) | +| **ArgoCD Apps** | `tailscale-operator-base` (upstream), `tailscale-operator` (config) | ## How It Works @@ -68,13 +27,7 @@ Ingresses use a shared ProxyGroup (`ingress`) rather than per-service Tailscale 3. Service becomes accessible at `.tail8d86e.ts.net` 4. TLS is handled automatically via Tailscale -Two requirements for VIP routing to work: - -1. Tailnet clients must have `--accept-routes` enabled to route to VIP addresses. -2. Ingress rules must **not** set an explicit `host:` field. The ProxyGroup - proxy receives the FQDN as the `Host` header (e.g. - `prometheus.tail8d86e.ts.net`), which won't match a short name. Use - `host: "*"` or omit `host:` entirely. +Tailnet clients must have `--accept-routes` enabled to route to VIP addresses. Services can be individually tagged (e.g., `tag:flyio-target`) via Ingress annotations to control which ACL grants apply. See [[expose-service-publicly]] for the tagging workflow. diff --git a/docs/reference/operations/security.md b/docs/reference/operations/security.md index 86b3d3b..11c4df9 100644 --- a/docs/reference/operations/security.md +++ b/docs/reference/operations/security.md @@ -1,6 +1,6 @@ --- title: Security & Compliance -modified: 2026-06-08 +modified: 2026-03-24 last-reviewed: 2026-03-24 tags: - operations @@ -21,7 +21,7 @@ Security posture and compliance scanning for BlumeOps infrastructure. ## Scanning tools -- [[prowler]] — CIS Kubernetes Benchmark scanner (weekly CronJob). The container-image CVE scan and IaC scan were retired in 2026-06 (un-actioned noise — see [[deploy-prowler#Why only the K8s CIS scan]]); only the K8s CIS scan remains. +- [[prowler]] — CIS Kubernetes Benchmark scanner (weekly CronJob) - [[deploy-prowler]] — deployment and ad-hoc scan how-to - [[read-compliance-reports]] — accessing and interpreting reports - [[kingfisher]] — Secret detection and live validation for Forgejo repos (weekly CronJob + prek hook) @@ -52,5 +52,5 @@ Suppressed findings are kept in Prowler mutelist YAML under `argocd/manifests/pr - No SOC 2 compliance mapping for Kubernetes (Prowler only maps SOC 2 for AWS/Azure/GCP) - k3s control plane checks produce no results (embedded binary, no static pods) — consider kube-bench -- No container-image CVE scanning (the Prowler image scan was retired 2026-06 as un-actioned noise). If reintroduced, scope it to critical-severity, currently-deployed tags, alert-on-new -- No automated IaC misconfiguration scanning (the Prowler IaC scan was retired 2026-06). Manifest pod-security hardening is now an accept-and-document decision rather than a weekly report +- Container image scanning covers `blumeops/*` images only — upstream images (ollama, immich, etc.) are not scanned +- IaC scanning covers the blumeops repo only — no scanning of third-party Helm charts or vendored manifests diff --git a/docs/reference/services/argocd.md b/docs/reference/services/argocd.md index 2eaecb5..e890cc5 100644 --- a/docs/reference/services/argocd.md +++ b/docs/reference/services/argocd.md @@ -1,7 +1,6 @@ --- title: ArgoCD -modified: 2026-06-09 -last-reviewed: 2026-06-09 +modified: 2026-02-07 tags: - service - gitops @@ -19,38 +18,22 @@ GitOps continuous delivery platform for the [[cluster|Kubernetes cluster]]. | **Tailscale URL** | https://argocd.tail8d86e.ts.net | | **Namespace** | `argocd` | | **Git Source** | `ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git` | -| **Manifests Path** | `argocd/apps/` (Applications), `argocd/manifests/` (workloads) | - -## Clusters - -A single ArgoCD instance (on indri's minikube) manages both clusters: - -| Cluster | Destination | Apps | -|---------|-------------|------| -| minikube (indri) | `https://kubernetes.default.svc` | Most services | -| k3s ([[ringtail]]) | `https://ringtail.tail8d86e.ts.net:6443` | GPU workloads and `*-ringtail` apps | +| **Manifests Path** | `argocd/` | ## Sync Policy -All applications use **manual sync** — including the `apps` app-of-apps root. To pick up newly added Application manifests, sync `apps` explicitly: +| Application | Sync Policy | Rationale | +|-------------|-------------|-----------| +| `apps` | Automated | Picks up new Application manifests | +| All workloads | Manual | Explicit control over deployments | -```bash -argocd app sync apps -``` +## Credentials -This gives explicit control over every deployment; nothing rolls out on push alone. - -## Authentication - -- **SSO via [[authentik]]** — OIDC with a public PKCE client (`argocd`), shared by the web UI and CLI: `argocd login argocd.ops.eblu.me --sso`. The Authentik `admins` group maps to `role:admin` via the RBAC ConfigMap; the default policy grants no access. -- **Local admin** — break-glass password in 1Password (blumeops vault), for when Authentik is down. - -The git deploy key (SSH) is injected via [[external-secrets]]. +- Admin password: 1Password (blumeops vault) +- Git deploy key (SSH): 1Password ## Related - [[argocd-cli]] - CLI usage and deployment workflows - [[apps|Apps]] - Full application registry - [[forgejo]] - Git source -- [[authentik]] - OIDC identity provider for SSO -- [[federated-login]] - How authentication works across BlumeOps diff --git a/docs/reference/services/authentik.md b/docs/reference/services/authentik.md index 480f59b..89a17cc 100644 --- a/docs/reference/services/authentik.md +++ b/docs/reference/services/authentik.md @@ -1,7 +1,6 @@ --- title: Authentik -modified: 2026-06-09 -last-reviewed: 2026-06-09 +modified: 2026-02-20 tags: - service - security @@ -43,7 +42,9 @@ Authentik configuration is managed via Blueprints (YAML) stored as a ConfigMap m - **`common.yaml`** — shared identity resources (`admins` group) - **`mfa.yaml`** — MFA enforcement on the default authentication flow (`not_configured_action: configure`) -- One blueprint per OIDC client (provider, application, and policy binding): `grafana.yaml`, `forgejo.yaml`, `zot.yaml`, `argocd.yaml`, `jellyfin.yaml`, `mealie.yaml`, `paperless.yaml`, `heph.yaml` +- **`grafana.yaml`** — Grafana OAuth2 provider, application, and policy binding +- **`forgejo.yaml`** — Forgejo OAuth2 provider, application, and policy binding +- **`zot.yaml`** — Zot registry OAuth2 provider, application, and policy binding Group membership is included in the `profile` scope claim (Authentik built-in). Services use `--group-claim-name groups` to read it. @@ -51,18 +52,13 @@ Blueprint file: `argocd/manifests/authentik/configmap-blueprint.yaml` ## OIDC Clients -| Client | Type | -|--------|------| -| [[grafana]] | Confidential | -| [[forgejo]] | Confidential | -| [[zot]] | Confidential | -| [[argocd]] | Public (PKCE, shared by web UI and CLI) | -| [[jellyfin]] | Confidential | -| [[mealie]] | Confidential | -| [[paperless]] | Confidential | -| heph | Public (PKCE, with `offline_access` for spoke sync refresh tokens) | +| Client | Status | +|--------|--------| +| [[grafana]] | Active | +| [[forgejo]] | Active | +| [[zot]] | Active | -Future clients: [[miniflux]] +Future clients: [[argocd]], [[miniflux]] ## Secrets @@ -71,10 +67,11 @@ Injected via [[external-secrets]] from the "Authentik (blumeops)" 1Password item | 1Password Field | Purpose | |-----------------|---------| | `secret-key` | Authentik secret key | -| `postgresql-host` / `-port` / `-name` / `-user` / `-password` | PostgreSQL connection | -| `-client-secret` | OIDC client secret, one per confidential client (grafana, forgejo, zot, jellyfin, mealie, paperless) | - -The item also holds an `api-token` field (Authentik API access for admin scripting); it is not synced into the cluster. +| `db-password` | PostgreSQL password | +| `grafana-client-secret` | OIDC client secret for Grafana | +| `forgejo-client-secret` | OIDC client secret for Forgejo | +| `zot-client-secret` | OIDC client secret for Zot | +| `api-token` | Authentik API token | ## Container Image diff --git a/docs/reference/services/grafana.md b/docs/reference/services/grafana.md index d6b812c..3a9ae01 100644 --- a/docs/reference/services/grafana.md +++ b/docs/reference/services/grafana.md @@ -1,7 +1,6 @@ --- title: Grafana -modified: 2026-06-09 -last-reviewed: 2026-06-09 +modified: 2026-02-28 tags: - service - observability @@ -26,7 +25,7 @@ Dashboards and visualization for BlumeOps observability. Grafana supports two login methods: -- **SSO via [[authentik]]** — OIDC login through Authentik (`auth.generic_oauth`). Members of the Authentik `admins` group get the Admin role; everyone else gets Viewer (`role_attribute_path` in `grafana.ini`). +- **SSO via [[authentik]]** — OIDC login through Authentik (`auth.generic_oauth`). Users click "Sign in with Authentik", authenticate at Authentik, and are redirected back as Admin. - **Local admin** — break-glass login using the password from 1Password ("Grafana (blumeops)"). Always available if Authentik is down. The OIDC client secret is injected via [[external-secrets]] (`grafana-authentik-oauth` secret in monitoring namespace). @@ -38,7 +37,7 @@ The OIDC client secret is injected via [[external-secrets]] (`grafana-authentik- | Prometheus | prometheus | `prometheus.monitoring.svc.cluster.local:9090` | | Loki | loki | `loki.monitoring.svc.cluster.local:3100` | | Tempo | tempo | `tempo.monitoring.svc.cluster.local:3200` | -| TeslaMate | postgres | `pg.ops.eblu.me:5434` (TeslaMate's database on [[ringtail]], via Caddy L4) | +| TeslaMate | postgres | `blumeops-pg-rw.databases.svc.cluster.local:5432` | ## Dashboard Provisioning @@ -50,9 +49,13 @@ Optional annotation: `grafana_folder: "FolderName"` ## Key Dashboards -Provisioned dashboards live in `argocd/manifests/grafana-config/dashboards/` (one ConfigMap per dashboard). Coverage as of 2026-06: alerts, borgmatic, CV APM, devpi, docs APM, fly.io proxy, forgejo, frigate, jellyfin, kubernetes, loki, macOS (indri host), postgresql, ringtail, shower APM, sifaka disks, snowflake proxy, tempo, transmission, zot. - -TeslaMate's dashboards are not in the repo — an init container fetches them from the forge mirror at a pinned tag (`TESLAMATE_VERSION` in `argocd/manifests/grafana/deployment.yaml`). +- macOS System - Host metrics for indri +- Minikube - Kubernetes cluster overview +- Borgmatic Backups - Backup status and trends +- Services Health - HTTP probe results +- Docs APM - Request rate, latency, cache for docs.eblu.me +- Fly.io Proxy Health - Aggregate proxy health across all upstream services +- TeslaMate (18 dashboards) - Vehicle data ## Related diff --git a/docs/reference/services/jellyfin.md b/docs/reference/services/jellyfin.md index c7b3074..bbdfafd 100644 --- a/docs/reference/services/jellyfin.md +++ b/docs/reference/services/jellyfin.md @@ -1,7 +1,7 @@ --- title: Jellyfin -modified: 2026-06-08 -last-reviewed: 2026-06-08 +modified: 2026-02-07 +last-reviewed: 2026-03-23 tags: - service - media @@ -41,24 +41,6 @@ Dashboard > Playback: 2. Allow hardware encoding: Enabled 3. VPP Tone mapping: Enabled -## Upgrades - -Installed via Homebrew cask (`state: present`, unpinned), so the Ansible role -won't bump an already-installed cask. To upgrade, run on indri: - -```bash -brew upgrade --cask jellyfin -``` - -**Gatekeeper gotcha:** a cask upgrade replaces `/Applications/Jellyfin.app` and -re-applies the `com.apple.quarantine` xattr. When launchd respawns the service, -the new binary hangs silently — process alive but ~0 CPU, no logs, no listening -socket — because Gatekeeper is holding the first launch pending approval. -Removing the xattr over SSH fails (`xattr -dr com.apple.quarantine ...` → -"Operation not permitted", blocked by macOS TCC). Approve the first-launch -dialog on indri's GUI console (or run the `xattr` removal from a local Terminal -with Full Disk Access), then reload the LaunchAgent. - ## Observability - Metrics: `jellyfin_metrics` ansible role diff --git a/docs/reference/services/prowler.md b/docs/reference/services/prowler.md index 9f7e4b3..f45955f 100644 --- a/docs/reference/services/prowler.md +++ b/docs/reference/services/prowler.md @@ -1,6 +1,6 @@ --- title: Prowler -modified: 2026-06-08 +modified: 2026-03-24 last-reviewed: 2026-03-24 tags: - service @@ -17,20 +17,20 @@ CIS Kubernetes Benchmark scanner for compliance posture reporting. |----------|-------| | **Namespace** | `prowler` | | **Image** | `registry.ops.eblu.me/blumeops/prowler` (see `argocd/manifests/prowler/kustomization.yaml` for current tag) | -| **Schedule** | K8s CIS: Sunday 3am | -| **Reports** | `sifaka:/volume1/reports/prowler/` (NFS) | +| **Schedule** | K8s CIS: Sunday 3am / Image: Saturday 3am / IaC: Saturday 2am | +| **Reports** | `sifaka:/volume1/reports/prowler/`, `prowler-images/`, `prowler-iac/` (NFS) | | **Manifests** | `argocd/manifests/prowler/` | ## What it does -Runs Prowler 5 as a single CronJob: +Runs Prowler 5 as two CronJobs: - **K8s CIS scan** (Sunday) — CIS Kubernetes Benchmark v1.11 checks across pod security, RBAC, apiserver, etcd, kubelet, controller-manager, and scheduler +- **Image scan** (Saturday) — CVE, secret, and misconfiguration scanning of all `blumeops/*` container images in the registry via Trivy +- **IaC scan** (Saturday) — static analysis of Dockerfiles, K8s manifests, and other IaC files in the repo via Trivy Reports are written in HTML, CSV, and JSON-OCSF to the NFS share on sifaka. -The **image** and **IaC** scans (formerly Saturday CronJobs) were retired in 2026-06 — they generated tens of thousands of un-actioned findings weekly. See [[deploy-prowler#Why only the K8s CIS scan]]. - ## See also - [[security]] — security & compliance posture overview diff --git a/docs/reference/tools/mise-tasks.md b/docs/reference/tools/mise-tasks.md index f777aa5..b614cb1 100644 --- a/docs/reference/tools/mise-tasks.md +++ b/docs/reference/tools/mise-tasks.md @@ -1,6 +1,6 @@ --- title: Mise Tasks -modified: 2026-06-09 +modified: 2026-04-11 tags: - reference - tools @@ -17,6 +17,7 @@ Run `mise tasks --sort name` for the live list with descriptions. | Task | Description | |------|-------------| +| `ai-docs` | All documentation concatenated for AI context (~85K tokens) | | `ai-sources` | All non-doc source files for deep AI context (~270K tokens) | | `docs-check-frontmatter` | Check required frontmatter fields | | `docs-check-links` | Validate wiki-links resolve correctly (supports path-based links) | diff --git a/docs/tutorials/ai-assistance-guide.md b/docs/tutorials/ai-assistance-guide.md index d3e23d7..4f0c595 100644 --- a/docs/tutorials/ai-assistance-guide.md +++ b/docs/tutorials/ai-assistance-guide.md @@ -1,6 +1,6 @@ --- title: AI Assistance Guide -modified: 2026-06-09 +modified: 2026-02-23 tags: - tutorials - ai @@ -17,7 +17,7 @@ This guide provides context for AI agents assisting with BlumeOps operations, an These are non-negotiable for AI agents working in this repo: 1. **Always use `--context=minikube-indri` with kubectl** - Work contexts exist that must never be touched -2. **Start every task by finding and reading the relevant docs** - Grep `docs/` and follow wiki-links +2. **Run `mise run ai-docs` at session start** - Review current infrastructure state 3. **Never commit secrets** - The repo is public at github.com/eblume/blumeops 4. **Wait for user review before deploying** - Create PRs, don't auto-deploy 5. **Never merge PRs without explicit request** - The user merges after review @@ -91,7 +91,8 @@ BlumeOps operations are driven by mise tasks. Run `mise tasks` to list all avail | Task | When to Use | |------|-------------| -| `ai-sources` | Deep context - all non-doc source files (~270K tokens). Ask user before running; useful for problems with a large surface area (see [[mise-tasks]]) | +| `ai-docs` | At session start - all documentation concatenated for AI context (~85K tokens, see [[mise-tasks]]) | +| `ai-sources` | Deep context - all non-doc source files (~270K tokens). Ask user before running; useful for problems with a large surface area | | `docs-mikado` | View active Mikado dependency chains for C2 changes | | `docs-mikado --resume` | Resume a C2 chain: detect branch, show state and next steps | | `provision-indri` | Deploy changes to [[indri]]-hosted services via Ansible | diff --git a/docs/tutorials/exploring-the-docs.md b/docs/tutorials/exploring-the-docs.md index 43966ec..2fd5f66 100644 --- a/docs/tutorials/exploring-the-docs.md +++ b/docs/tutorials/exploring-the-docs.md @@ -1,6 +1,6 @@ --- title: Exploring the Docs -modified: 2026-06-09 +modified: 2026-02-10 tags: - tutorials - getting-started @@ -31,6 +31,7 @@ You probably want quick access to operational details: - [How-to](/how-to/) guides for common operations (deploy, troubleshoot, update ACLs) - [Reference](/reference/) has service URLs, commands, and config locations - [[ai-assistance-guide]] explains how to work effectively with AI agents +- Run `mise run ai-docs` to prime AI context with key documentation ### For AI Agents @@ -74,7 +75,13 @@ Prek hooks validate that all wiki-links resolve to existing files and flag ambig ## AI Context Priming -AI agents prime themselves by searching `docs/` for cards relevant to the task at hand and following wiki-links from there. (The retired `ai-docs` mise task used to concatenate every doc for this purpose, but the corpus outgrew a context window.) For deep codebase questions, `mise run ai-sources` concatenates all non-doc source files. +The `ai-docs` mise task concatenates key documentation files for AI context: + +```bash +mise run ai-docs +``` + +This outputs key documentation files and a full tree listing of all docs, providing an agent with essential context for BlumeOps operations. ## Related diff --git a/mise-tasks/ai-docs b/mise-tasks/ai-docs new file mode 100755 index 0000000..66e11d7 --- /dev/null +++ b/mise-tasks/ai-docs @@ -0,0 +1,13 @@ +#!/usr/bin/env bash +#MISE description="Prime AI context with all BlumeOps documentation" + +set -euo pipefail + +DOCS_DIR="$(cd "$(dirname "$0")/.." && pwd)/docs" + +# Concatenate all docs (excluding changelog fragments) +find "$DOCS_DIR" -name '*.md' -not -path '*/changelog.d/*' | sort | while read -r f; do + printf '=== %s ===\n' "${f#"$DOCS_DIR/"}" + cat "$f" + printf '\n' +done diff --git a/mise-tasks/op-backup b/mise-tasks/op-backup index a8a5dc2..7db033b 100755 --- a/mise-tasks/op-backup +++ b/mise-tasks/op-backup @@ -86,7 +86,7 @@ def get_export_path(argv_path: str | None) -> Path | None: else: console.print("Export your vaults from the 1Password desktop app:") console.print(" 1. Open 1Password") - console.print(" 2. File > Export > (exports all vaults in the account)") + console.print(" 2. File > Export > All Vaults (or select specific vaults)") console.print(f" 3. Save as 1PUX format to: [cyan]{EXPORT_DIR}[/cyan]") console.print() raw = console.input("Path to .1pux file: ").strip() diff --git a/mise-tasks/review-compliance-reports b/mise-tasks/review-compliance-reports index f2a0a54..24d2afc 100755 --- a/mise-tasks/review-compliance-reports +++ b/mise-tasks/review-compliance-reports @@ -10,19 +10,19 @@ Covers: - Prowler K8s CIS (in-cluster): per-finding detail + - Prowler container image scans: grouped by check + resource + - Prowler IaC manifest scans: grouped by check + resource - Kingfisher secret scanning: TODO — pending upstream JSON/CSV output support (currently HTML-only; contribute from spork) -The Prowler container-image CVE scan and IaC scan were retired in 2026-06 -(see docs/how-to/operations/deploy-prowler.md) — they produced tens of -thousands of un-actioned findings weekly. Only the K8s CIS scan remains. - -For the Prowler scan, copies the two most recent CSV reports, parses +For each Prowler scan, copies the two most recent CSV reports, parses them, and displays: 1. Overall status (pass/fail/manual/muted counts) 2. Unmuted failures by severity 3. Delta from the previous report (new vs resolved) - 4. Actionable unmuted failures (per-finding detail) + 4. Actionable unmuted failures (per-finding for in-cluster; grouped + by check ID and resource for image/IaC because they have far too + many findings to list individually) This is the primary tool for the weekly compliance report review. """ @@ -39,9 +39,11 @@ from rich.console import Console from rich.panel import Panel from rich.table import Table -PROWLER_SCANS: list[tuple[str, str]] = [ - # (label, sifaka base path) - ("K8s CIS (In-Cluster)", "/volume1/reports/prowler"), +PROWLER_SCANS: list[tuple[str, str, bool]] = [ + # (label, sifaka base path, group_findings) + ("K8s CIS (In-Cluster)", "/volume1/reports/prowler", False), + ("Container Images", "/volume1/reports/prowler-images", True), + ("IaC (manifests)", "/volume1/reports/prowler-iac", True), ] console = Console() @@ -332,8 +334,14 @@ def summarize_report( tmpdir: str, *, show_muted: bool = False, + group_findings: bool = False, ) -> None: - """Fetch and summarize the latest Prowler report under `base`.""" + """Fetch and summarize the latest Prowler report under `base`. + + When `group_findings` is True, top-N CHECK_ID and RESOURCE_NAME tables + are shown instead of a per-finding detail table — appropriate for + image and IaC scans that produce thousands of findings. + """ console.rule(f"[bold]{label}[/bold]") csvs = list_reports(base) if not csvs: @@ -450,29 +458,36 @@ def summarize_report( ) console.print() - if new_keys: - console.print("[bold red]New Unmuted Failures:[/bold red]") - for k in sorted(new_keys): - r = curr_keys[k] - console.print( - f" [{r['SEVERITY']}] {r['CHECK_ID']}: " - f"{r['STATUS_EXTENDED'][:120]}" - ) - console.print() + # For grouped scans the new/resolved listings are too noisy + # (potentially thousands of lines). Skip the listings; the count + # is in the panel above and detail is in the grouped tables. + if not group_findings: + if new_keys: + console.print("[bold red]New Unmuted Failures:[/bold red]") + for k in sorted(new_keys): + r = curr_keys[k] + console.print( + f" [{r['SEVERITY']}] {r['CHECK_ID']}: " + f"{r['STATUS_EXTENDED'][:120]}" + ) + console.print() - if resolved_keys: - console.print("[bold green]Resolved:[/bold green]") - for k in sorted(resolved_keys): - r = prev_keys[k] - console.print( - f" [dim][{r['SEVERITY']}] {r['CHECK_ID']}: " - f"{r['STATUS_EXTENDED'][:120]}[/dim]" - ) - console.print() + if resolved_keys: + console.print("[bold green]Resolved:[/bold green]") + for k in sorted(resolved_keys): + r = prev_keys[k] + console.print( + f" [dim][{r['SEVERITY']}] {r['CHECK_ID']}: " + f"{r['STATUS_EXTENDED'][:120]}[/dim]" + ) + console.print() - # --- Unmuted failure details --- + # --- Unmuted failure details (grouped or per-finding) --- if latest["unmuted"]: - _print_findings_detail(latest["unmuted"]) + if group_findings: + _print_grouped_findings(latest["unmuted"]) + else: + _print_findings_detail(latest["unmuted"]) # --- Muted findings summary --- if show_muted and latest["muted"]: @@ -551,6 +566,75 @@ def _print_findings_detail(unmuted: list[dict]) -> None: console.print() +def _worst_severity(rows: list[dict]) -> str: + """Return the most severe severity label across `rows`.""" + if not rows: + return "" + return min( + (r["SEVERITY"] for r in rows), + key=lambda s: severity_sort({"SEVERITY": s}), + ) + + +def _print_grouped_findings(unmuted: list[dict], top_n: int = 15) -> None: + """Top-N tables grouped by CHECK_ID and RESOURCE_NAME. + + Used for image and IaC scans where per-finding tables would be too + large to be useful. Shows count and worst severity for each group. + """ + by_check: dict[str, list[dict]] = {} + by_resource: dict[str, list[dict]] = {} + for r in unmuted: + by_check.setdefault(r["CHECK_ID"], []).append(r) + by_resource.setdefault(r.get("RESOURCE_NAME", "") or "(no resource)", []).append(r) + + check_table = Table( + show_header=True, + header_style="bold", + title=f"Top {top_n} Checks by Unmuted Finding Count", + ) + check_table.add_column("Worst Sev") + check_table.add_column("Check ID") + check_table.add_column("Count", justify="right") + + for check, rows in sorted( + by_check.items(), key=lambda kv: -len(kv[1]) + )[:top_n]: + worst = _worst_severity(rows) + style = _sev_style(worst) + check_table.add_row( + f"[{style}]{worst}[/{style}]" if style else worst, + check, + str(len(rows)), + ) + + console.print(check_table) + console.print() + + res_table = Table( + show_header=True, + header_style="bold", + title=f"Top {top_n} Resources by Unmuted Finding Count", + ) + res_table.add_column("Worst Sev") + res_table.add_column("Resource") + res_table.add_column("Count", justify="right") + + for resource, rows in sorted( + by_resource.items(), key=lambda kv: -len(kv[1]) + )[:top_n]: + worst = _worst_severity(rows) + style = _sev_style(worst) + res_table.add_row( + f"[{style}]{worst}[/{style}]" if style else worst, + resource[:80], + str(len(rows)), + ) + + console.print(res_table) + console.print() + + def main( full: Annotated[ bool, typer.Option(help="(reserved) currently a no-op; all unmuted failures already shown") @@ -562,12 +646,13 @@ def main( del full # historical flag, kept for backwards compatibility with tempfile.TemporaryDirectory() as tmpdir: - for label, base in PROWLER_SCANS: + for label, base, group in PROWLER_SCANS: summarize_report( label, base, tmpdir, show_muted=show_muted, + group_findings=group, ) # --- Node-level MANUAL check verification --- diff --git a/mise-tasks/service-review b/mise-tasks/service-review index d22097f..f83b104 100755 --- a/mise-tasks/service-review +++ b/mise-tasks/service-review @@ -8,7 +8,7 @@ #USAGE flag "--type " help="Filter by service type (argocd, ansible, nixos, fly, mise)" """Review the most stale service for version freshness. -Reads ``service-versions.yaml`` (repo root) and sorts services +Reads ``docs/reference/services/service-versions.yaml`` and sorts services by the ``last-reviewed`` field. Services without the field (or null) are treated as never-reviewed and float to the top. Displays a staleness table and then shows the most stale service with a review checklist. @@ -210,7 +210,7 @@ def main( "• Verify the service is running and healthy\n", "• Check logs for errors or warnings\n", "\n[bold]After Review:[/bold]\n", - "• Update the tracking file: [cyan]service-versions.yaml[/cyan] (repo root)\n", + "• Update the tracking file: [cyan]docs/reference/services/service-versions.yaml[/cyan]\n", f"• Set [cyan]last-reviewed: {today}[/cyan] and [cyan]current-version[/cyan]\n", "• Commit the change (along with any upgrades)", ] diff --git a/nixos/ringtail/flake.lock b/nixos/ringtail/flake.lock index 340bd9d..bb60501 100644 --- a/nixos/ringtail/flake.lock +++ b/nixos/ringtail/flake.lock @@ -7,11 +7,11 @@ ] }, "locked": { - "lastModified": 1780894562, - "narHash": "sha256-c3430xwxwhHipl3jigUGMMBfpaMylDqytW/kdmB3ZGs=", + "lastModified": 1780290312, + "narHash": "sha256-eTAlX0CwgB84Ts3GaBd944A3DRXVMzgA0EqroZBISUo=", "owner": "nix-community", "repo": "disko", - "rev": "24fed06cac83bcc44ac8efbb57cab1a82fa0bedc", + "rev": "115e5211780054d8a890b41f0b7734cafad54dfe", "type": "github" }, "original": { @@ -43,11 +43,11 @@ }, "nixpkgs": { "locked": { - "lastModified": 1780511130, - "narHash": "sha256-2v9lT4ya59Lh1FqPeLnz1MoX9y/wz2huqfe9RtQZITk=", + "lastModified": 1779796641, + "narHash": "sha256-ZsIrKmhp4vbBXoXXmR/tBXA/UCsAQiJL9vsgZEduhVY=", "owner": "NixOS", "repo": "nixpkgs", - "rev": "535f3e6942cb1cead3929c604320d3db54b542b9", + "rev": "25f538306313eae3927264466c70d7001dcea1df", "type": "github" }, "original": { diff --git a/service-versions.yaml b/service-versions.yaml index 95b9e44..866c687 100644 --- a/service-versions.yaml +++ b/service-versions.yaml @@ -440,20 +440,14 @@ services: - name: jellyfin type: ansible - last-reviewed: 2026-06-08 - current-version: "10.11.11" + last-reviewed: 2026-03-17 + current-version: "10.11.6" upstream-source: https://github.com/jellyfin/jellyfin/releases - notes: >- - Homebrew cask (state: present, unpinned). Upgrade with - `brew upgrade --cask jellyfin` on indri. After upgrade the .app is - re-quarantined; launchd-spawned launch hangs silently until the - Gatekeeper first-launch dialog is approved on indri's GUI console - (xattr removal over SSH is blocked by TCC). - name: automounter type: ansible - last-reviewed: 2026-06-09 - current-version: "1.13.0" + last-reviewed: 2026-03-17 + current-version: "1.11.0" upstream-source: https://www.pixeleyes.co.nz/automounter/ notes: Mac App Store app, no Ansible role. Updates via App Store.