C1: docs-first removal of compensating-controls framework

Deletes the CC how-to and explanation docs, and the orphan changelog fragments describing CC reviews. Updates security.md and read-compliance-reports.md to describe muting in terms of the mutelist files only. Adds the branch changelog fragment. Mutelist YAML files, the Prowler CronJobs, and the review-compliance-reports task all stay — they're updated in the next commit.
2026-05-22 20:09:28 -07:00 · 2026-05-22 20:09:28 -07:00 · 69737dc915
commit 69737dc915
parent 2fae0f7161
12 changed files with 4 additions and 243 deletions
--- a/docs/changelog.d/+compliance-mute-categories.doc.md
+++ b/docs/changelog.d/+compliance-mute-categories.doc.md
@ -1 +0,0 @@
 New explanation article [[compliance-mute-categories]] documenting the gap between current `CC:`-only mute tagging and the three structurally distinct categories (compensating control, not-applicable, risk-accepted) needed for real PCI DSS / SOC2 practice. Captures the current image-scan mutelist gap (`cronjob-image-scan.yaml` doesn't pass `--mutelist-file`) and proposes an order-of-operations for wiring it up alongside the new tag conventions. Triggered by CVE-2026-31789, an OpenSSL 32-bit-only finding that surfaced the need for an NA category.
--- a/docs/changelog.d/+review-cc-ephemeral-privileged-jobs.misc.md
+++ b/docs/changelog.d/+review-cc-ephemeral-privileged-jobs.misc.md
@ -1 +0,0 @@
 Reviewed compensating control `ephemeral-privileged-jobs`: TTL and hostPID scope verified on indri. Noted that the alloy-tracing DaemonSet on ringtail is out of scope until Prowler scans ringtail (tracked in Todoist).
--- a/docs/changelog.d/+review-cc-init-container-isolation.misc.md
+++ b/docs/changelog.d/+review-cc-init-container-isolation.misc.md
@ -1 +0,0 @@
 Reviewed compensating control `init-container-isolation` (35 days stale). Grafana's running pod matches the manifest and the CC's claim — only `init-chown-data` runs as root with `CHOWN`; runtime containers all run as UID 472 with all caps dropped. Retirement (replacing init-chown-data with `fsGroup` alone) is plausible given the in-tree minikube-hostpath provisioner, but deferred until grafana lands on ringtail's k3s — note added to the CC.
--- a/docs/changelog.d/+review-cc-trusted-ci-only.misc.md
+++ b/docs/changelog.d/+review-cc-trusted-ci-only.misc.md
@ -1 +0,0 @@
 Reviewed compensating control `trusted-ci-only`: Forgejo runner is registered only to the private forge, which has registration disabled — no untrusted users can create repos or trigger privileged CI. Tightened the notes to reflect that the closed-forge property (not a per-repo allow-list) is what actually mitigates the risk.
--- a/docs/changelog.d/prowler-iac-mutelist.infra.md
+++ b/docs/changelog.d/prowler-iac-mutelist.infra.md
@ -1 +1 @@
-Address the 6 critical Prowler IaC findings against `argocd/manifests/`. Prowler's IaC provider hardcodes `self._mutelist = None` and delegates filtering to Trivy, but doesn't plumb `--ignorefile` through — so the documented "use Trivy filtering" path is actually broken. Added a shim around `trivy` in the Prowler image that injects `--ignorefile $TRIVY_IGNOREFILE` for `trivy fs` invocations when the env var points at a real file. The IaC cronjob now mounts `mutelist/trivyignore.yaml` (Trivy's per-path schema) and sets the env var. Two new compensating controls — `operator-purpose-bound-rbac` and `kube-state-metrics-metadata-only` — justify muting the `external-secrets` and `kube-state-metrics` Secret-access findings (KSV-0041, KSV-0114). Separately, `grafana-clusterrole` is tightened to remove `secrets` access entirely: the dashboard sidecar already only consumes ConfigMap-labeled dashboards, so its `RESOURCE` env var is now `configmap` instead of `both`.
+Address the 6 critical Prowler IaC findings against `argocd/manifests/`. Prowler's IaC provider hardcodes `self._mutelist = None` and delegates filtering to Trivy, but doesn't plumb `--ignorefile` through — so the documented "use Trivy filtering" path is actually broken. Added a shim around `trivy` in the Prowler image that injects `--ignorefile $TRIVY_IGNOREFILE` for `trivy fs` invocations when the env var points at a real file. The IaC cronjob now mounts `mutelist/trivyignore.yaml` (Trivy's per-path schema) and sets the env var, muting the `external-secrets` and `kube-state-metrics` Secret-access findings (KSV-0041, KSV-0114). Separately, `grafana-clusterrole` is tightened to remove `secrets` access entirely: the dashboard sidecar already only consumes ConfigMap-labeled dashboards, so its `RESOURCE` env var is now `configmap` instead of `both`.
--- a/docs/changelog.d/review-cc-observability-stack-audit-2026-05-11.infra.md
+++ b/docs/changelog.d/review-cc-observability-stack-audit-2026-05-11.infra.md
@ -1 +0,0 @@
 Reviewed compensating control `observability-stack-audit`. Updated description to cover ringtail's k3s as well as indri's minikube; both Alloy DaemonSets and Loki are healthy.
--- a/docs/changelog.d/rip-out-compensating-controls.infra.md
+++ b/docs/changelog.d/rip-out-compensating-controls.infra.md
@ -0,0 +1 @@
 Ripped out the compensating-controls (CC) framework: deleted `compensating-controls.yaml`, the `review-compensating-controls` mise task, and the associated how-to / explanation docs. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files remain in place but no longer carry `CC: <id>` prefixes — each entry just keeps a free-form `Description` of why the finding is muted. The CC review cadence proved to be more overhead than this single-operator homelab needed.
--- a/docs/explanation/compliance-mute-categories.md
+++ b/docs/explanation/compliance-mute-categories.md
@ -1,99 +0,0 @@
 ---
 title: Compliance Mute Categories
 modified: 2026-05-04
 last-reviewed: 2026-05-04
 tags:
  - explanation
  - security
  - compliance
 ---
 # Compliance Mute Categories
 > **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
 How BlumeOps should categorize muted compliance findings, why a single "compensating control" tag is not enough, and what tooling work is needed to support multiple categories cleanly.
 ## Why this matters
 When a compliance scanner ([[prowler]], Trivy via Prowler IaC, Kingfisher) reports a failing finding, there are three structurally different reasons we might suppress it:
 1. **Compensating control (CC)** — the requirement applies and we *do not* meet it directly, but an alternative control mitigates the same risk.
 2. **Not applicable (NA)** — the requirement's preconditions cannot be satisfied in our environment, so the finding is structurally inert (e.g. a 32-bit-only CVE on 64-bit-only hosts).
 3. **Risk accepted (RA)** — the requirement applies, we do not meet it, no compensating control exists, and we have explicitly chosen to accept the residual risk for a bounded period.
 Today every muted finding in BlumeOps uses the `CC: <id>` convention. That conflates all three categories. In a real PCI DSS or SOC2 environment, auditors treat them very differently:
 - A CC requires documentation of the constraint, the alternative measure, and recurring validation that the measure still works.
 - An NA requires documentation of *why* the precondition cannot be met, with periodic verification that the environmental fact still holds.
 - An RA requires an explicit decision-maker, an expiry date, and a scheduled re-decision.
 Mixing them under one tag means stale CCs hide stale RAs, and NAs that should be revisited when the environment changes get treated as permanent fixtures.
 ## Trigger case: CVE-2026-31789
 The 2026-05-03 weekly compliance review surfaced [CVE-2026-31789](https://nvd.nist.gov/vuln/detail/CVE-2026-31789), an OpenSSL heap buffer overflow during X.509 certificate processing on **32-bit systems**. Prowler's image scanner flagged 216 findings across 106 BlumeOps images carrying `libssl3` / `libcrypto3` below the fixed versions.
 The CVE is genuine, but its preconditions cannot be satisfied in our environment: indri is Apple Silicon (arm64), ringtail is x86_64, and we run no 32-bit containers. This is the canonical NA case — not a CC, because there is no "alternative measure mitigating the risk." The risk does not exist for us at all.
 A CC like `no-32bit-runtimes` would technically work, but conflates the categories: if we ever introduce a 32-bit runtime we would have to remember that this CC was load-bearing for the mute, retire or scope it down, and reopen the muted findings. An NA tag with a short justification makes the precondition explicit and self-documents the conditions under which it must be revisited.
 ## Current tooling state
 Three Prowler scans run weekly. Their mute paths today:
 | Scan | Mute mechanism | File(s) |
 |------|----------------|---------|
 | K8s CIS (Sunday) | Prowler `--mutelist-file`, merged from ConfigMap | `argocd/manifests/prowler/mutelist/*.yaml` |
 | IaC (Saturday) | Trivy `--ignorefile` shim (Prowler's `--mutelist-file` is a no-op for IaC) | `argocd/manifests/prowler/mutelist/trivyignore.yaml` |
 | Container Images (Saturday) | **None — `cronjob-image-scan.yaml` does not pass `--mutelist-file`** | n/a |
 The image scan has never been wired to a mutelist. The CSV reports do contain a `MUTED` column, but it is always `False` because no mutelist is supplied. All 14k+ image findings flow through to `review-compliance-reports` unfiltered.
 The mute tag convention is consistent across the two configured scans: each entry's `Description:` (or `statement:` for trivyignore) starts with `CC: <id>. <freeform>`. `mise run review-compensating-controls` greps for those IDs to find every file that depends on each control. There is no NA tag, no RA tag, and no expiry field.
 ## Proposed model
 ### Tag prefixes
 Extend the description-prefix convention:
 - `CC: <control-id>. <description>` — references an entry in `compensating-controls.yaml`. Existing convention, unchanged.
 - `NA: <reason>. <description>` — environmental precondition fails. Reason should be specific enough that a reviewer can verify it (e.g. `NA: no 32-bit runtimes`, not `NA: doesn't apply`).
 - `RA: <reason>; expires <YYYY-MM-DD>. <description>` — explicit risk acceptance with a hard expiry. Past the expiry, re-review is mandatory.
 Tag choice is exclusive: a given mute is one of CC, NA, or RA. If two reasons apply, pick the strongest — CC > RA > NA.
 ### Tooling changes required
 1. **Wire the image scan to a mutelist.** Add `argocd/manifests/prowler/mutelist/image-cves.yaml`, mount-and-merge it the same way `cronjob.yaml` mounts its mutelist parts, and pass `--mutelist-file` to `prowler image`. Verify experimentally that `prowler image` honors the flag — Prowler's behavior across providers is inconsistent, and the IaC provider notably does not. If `prowler image` ignores it, fall back to post-scan filtering inside `review-compliance-reports`.
 2. **Teach `review-compensating-controls` (or a sibling) to surface NA and RA entries.** CCs already get a staleness queue. NAs should appear in a separate queue keyed on the reason text — when an NA reason becomes false (e.g. we do introduce a 32-bit runtime), every NA mute citing that reason must be reopened. RAs should sort by expiry date, with anything past expiry flagged red.
 3. **Expiry parsing.** RA tags carry a hard date. The simplest path is to parse it from the description string at review time. A more durable path is to extend the mutelist YAML schema with a structured `expires:` field and a small wrapper that strips it before passing the file to Prowler. Either works; the structured field is friendlier to editors.
 ### Out of scope (for now)
 - Changing the underlying Prowler mutelist YAML schema. Stay within the `Mutelist:` shape Prowler expects.
 - Migrating existing `CC:` entries. The current set is genuinely CCs and should stay tagged that way.
 - Building an issue-tracker integration. Todoist is the source of truth for "remember to re-review this" until that scales painfully.
 ## Order of operations
 When this work is picked up, the suggested sequence is:
 1. **Scope and confirm.** Re-read this article, confirm the model still fits, adjust if not.
 2. **Wire the image-scan mutelist.** Smallest atomic change; produces immediate value (the CVE-2026-31789 mute can land as the first NA entry).
 3. **Add the NA convention.** Update [[read-compliance-reports]] and [[review-compensating-controls]] how-tos to describe the three tag prefixes. The convention can land before tooling supports it — review will just be manual until tooling catches up.
 4. **Extend the review tools.** Add NA and RA queues to `review-compensating-controls` (or a new task). At this point, parse expiry from RA descriptions.
 5. **Optionally: structured expiry.** If RA entries become common, migrate to a structured `expires:` YAML field with a wrapper that filters it out before Prowler reads the file.
 The first three steps are a coherent C1. Steps 4–5 can be split off if scope creeps.
 ## Related
 - [[read-compliance-reports]] — the weekly review process this feeds into
 - [[review-compensating-controls]] — current CC review tooling
 - [[security-model]] — overall security posture
 - [[prowler]] — scanner reference
 - [[agent-change-process]] — how to scope and execute the implementation
--- a/docs/how-to/operations/read-compliance-reports.md
+++ b/docs/how-to/operations/read-compliance-reports.md
@ -80,7 +80,7 @@ Not all failures require action. Common expected failures in our minikube cluste
 1. **Triage** — review new failures, distinguish real issues from expected noise
 2. **Remediate** — fix what you can (pod security contexts, RBAC tightening)
-3. **Mutelist** — suppress expected/accepted failures via Prowler's `--mutelist-file` to reduce noise in future scans
+3. **Mutelist** — suppress expected/accepted failures by adding a Resource entry under the matching Check in `argocd/manifests/prowler/mutelist/*.yaml` with a free-form `Description` explaining why
 4. **Track** — compare reports over time to spot regressions
 ## Related
--- a/docs/how-to/operations/record-review-evidence.md
+++ b/docs/how-to/operations/record-review-evidence.md
@ -1,50 +0,0 @@
 ---
 title: Record Review Evidence
 modified: 2026-04-01
 last-reviewed: 2026-04-01
 tags:
  - how-to
  - security
  - compliance
 ---
 # Record Review Evidence
 How review evidence *would* be captured after a [[review-compensating-controls|compensating control review]], to make the review auditable under a compliance framework.
 blumeops does not currently collect review evidence. This card documents the target process for reference and practice.
 ## Why Record Evidence?
 Reviewing a control and updating `last-reviewed` proves the review *happened* but not *what was checked*. Under frameworks like PCI DSS v4.0, a QSA needs to see dated, immutable evidence that the reviewer verified the control and that an appropriate party accepted the residual risk. Compliance platforms like Drata automate this collection, but the underlying artifacts are the same whether you use a platform or a directory of files.
 ## What Evidence Would Be Captured
 For each control reviewed, artifacts should answer:
 1. **Who reviewed it** — reviewer name, date
 2. **What was verified** — the specific checks performed (e.g., Tailscale ACL policy snapshot, `tailscale status` output, kubectl auth checks)
 3. **What was found** — the outcome: control still in effect, circumstances changed, or control invalidated
 4. **Residual risk** — what the control does *not* cover (the gap a QSA will ask about)
 5. **Acceptance** — formal sign-off that the residual risk is accepted by an appropriate party (reviewer + approver, typically a manager or CTO)
 Supporting artifacts would include command output, policy snapshots, screenshots, or API responses — anything that demonstrates the verification was actually performed.
 ## PCI DSS Context
 Under PCI DSS v4.0, compensating controls require a **Compensating Control Worksheet (CCW)** that maps each control to the original requirement it substitutes for. The CCW fields are:
 - **Original requirement** — the specific PCI DSS requirement not directly met
 - **Constraint** — why direct compliance isn't feasible
 - **Compensating control definition** — what is done instead
 - **Risk addressed** — how the control mitigates the original threat
 - **Residual risk** — what remains unmitigated
 - **Validation procedure** — steps to verify (what `notes` captures in `compensating-controls.yaml`)
 Req 12.3.2 mandates review **at least annually** (quarterly is typical for Level 1 Service Providers). In a platform like Drata, these map to Controls with uploaded Evidence and review workflows requiring sign-off from both the reviewer and an approver.
 ## Related
 - [[review-compensating-controls]] — The technical review process
 - [[security]] — Security posture overview
 - [[read-compliance-reports]] — Interpreting Prowler/Kingfisher reports
--- a/docs/how-to/operations/review-compensating-controls.md
+++ b/docs/how-to/operations/review-compensating-controls.md
@ -1,80 +0,0 @@
 ---
 title: Review Compensating Controls
 modified: 2026-03-30
 last-reviewed: 2026-03-30
 tags:
  - how-to
  - security
  - maintenance
 ---
 # Review Compensating Controls
 How to periodically review compensating controls that justify suppressed security findings.
 ## Review by Staleness
 Show controls sorted by when they were last reviewed (most stale first):
 ```bash
 mise run review-compensating-controls
 ```
 This reads `compensating-controls.yaml` (repo root), sorts by `last-reviewed`, and displays the most stale control with all codebase references. It also searches for every file that references the control ID, so you can see exactly which suppressed findings depend on it.
 To show more entries:
 ```bash
 mise run review-compensating-controls --limit 20
 ```
 ## What is a Compensating Control?
 A compensating control is a security measure that mitigates the risk a finding was designed to detect, when the finding itself cannot be directly remediated. For example:
 - **Finding:** API server does not enable AlwaysPullImages admission plugin
 - **Risk:** Untrusted users could run pods using cached images they shouldn't have access to
 - **Compensating control:** `single-user-cluster` — only the operator has kubectl access; no untrusted users can create pods
 Controls are documented in `compensating-controls.yaml` and referenced from security tool configurations (Prowler mutelist files, Kingfisher config, etc.) using the format `CC: <control-id>`.
 A compensating control is only one of three structurally distinct ways to suppress a finding — see [[compliance-mute-categories]] for when to reach for a CC versus a not-applicable (`NA:`) or risk-accepted (`RA:`) tag instead.
 ## Review Process
 For each control up for review:
 1. **Understand the risk.** Read each suppressed finding that references this control. What attack or misconfiguration does the original check guard against?
 2. **Verify the control is in effect.** Follow the verification steps in the control's `notes` field. For example, for `tailscale-network-isolation`, check that the cluster is not directly internet-exposed and Tailscale ACLs are enforced.
 3. **Assess whether the control actually mitigates the risk.** A compensating control should address the same threat the check was designed to catch, not just be a vaguely related security measure. If it doesn't hold up, either:
   - Fix the underlying finding and remove the suppression
   - Document a stronger or more specific compensating control
 4. **Check for changed circumstances.** Has the cluster gained new users? Has a service been exposed publicly? Has an operator added native support for the missing feature? Any of these could invalidate the control.
 5. **Update the review date.** Edit `compensating-controls.yaml` and set `last-reviewed` to today's date. Commit alongside any changes.
 ## Adding a New Control
 When suppressing a new security finding, either map it to an existing control or add a new one:
 ```yaml
 - id: my-new-control
  description: >-
    What this control does and how it mitigates the specific risk.
  created: 2026-03-30
  last-reviewed: 2026-03-30
  notes: >-
    How to verify this control is still in effect.
 ```
 Then reference it in the suppression configuration with `CC: my-new-control`.
 ## Related
 - [[record-review-evidence]] — Capturing evidence artifacts for audit (aspirational)
 - [[security]] — Security posture overview
 - [[read-compliance-reports]] — Accessing and interpreting Prowler reports
 - [[review-services]] — Periodic service version review (similar staleness pattern)
--- a/docs/reference/operations/security.md
+++ b/docs/reference/operations/security.md
@ -46,13 +46,7 @@ Security posture and compliance scanning for BlumeOps infrastructure.
 All compliance scan reports are stored on `sifaka:/volume1/reports/`. See [[read-compliance-reports]] for access and interpretation.
-## Compensating controls
+Suppressed findings are kept in Prowler mutelist YAML under `argocd/manifests/prowler/mutelist/`. Each entry's `Description` field explains why the finding is muted; entries are reviewed ad-hoc rather than on a scheduled cadence.
 Suppressed findings reference named compensating controls tracked in `compensating-controls.yaml` (repo root). Each control has a review date and verification steps. See [[review-compensating-controls]] for the review process.
 ```bash
 mise run review-compensating-controls
 ```
 ## Known gaps
		`@ -1 +0,0 @@`
			New explanation article [[compliance-mute-categories]] documenting the gap between current `CC:`-only mute tagging and the three structurally distinct categories (compensating control, not-applicable, risk-accepted) needed for real PCI DSS / SOC2 practice. Captures the current image-scan mutelist gap (`cronjob-image-scan.yaml` doesn't pass `--mutelist-file`) and proposes an order-of-operations for wiring it up alongside the new tag conventions. Triggered by CVE-2026-31789, an OpenSSL 32-bit-only finding that surfaced the need for an NA category.
		`@ -1 +0,0 @@`
			Reviewed compensating control `ephemeral-privileged-jobs`: TTL and hostPID scope verified on indri. Noted that the alloy-tracing DaemonSet on ringtail is out of scope until Prowler scans ringtail (tracked in Todoist).
		`@ -1 +0,0 @@`
			Reviewed compensating control `init-container-isolation` (35 days stale). Grafana's running pod matches the manifest and the CC's claim — only `init-chown-data` runs as root with `CHOWN`; runtime containers all run as UID 472 with all caps dropped. Retirement (replacing init-chown-data with `fsGroup` alone) is plausible given the in-tree minikube-hostpath provisioner, but deferred until grafana lands on ringtail's k3s — note added to the CC.
		`@ -1 +0,0 @@`
			Reviewed compensating control `trusted-ci-only`: Forgejo runner is registered only to the private forge, which has registration disabled — no untrusted users can create repos or trigger privileged CI. Tightened the notes to reflect that the closed-forge property (not a per-repo allow-list) is what actually mitigates the risk.
`@ -1 +1 @@`
	Address the 6 critical Prowler IaC findings against `argocd/manifests/`. Prowler's IaC provider hardcodes `self._mutelist = None` and delegates filtering to Trivy, but doesn't plumb `--ignorefile` through — so the documented "use Trivy filtering" path is actually broken. Added a shim around `trivy` in the Prowler image that injects `--ignorefile $TRIVY_IGNOREFILE` for `trivy fs` invocations when the env var points at a real file. The IaC cronjob now mounts `mutelist/trivyignore.yaml` (Trivy's per-path schema) and sets the env var. Two new compensating controls — `operator-purpose-bound-rbac` and `kube-state-metrics-metadata-only` — justify muting the `external-secrets` and `kube-state-metrics` Secret-access findings (KSV-0041, KSV-0114). Separately, `grafana-clusterrole` is tightened to remove `secrets` access entirely: the dashboard sidecar already only consumes ConfigMap-labeled dashboards, so its `RESOURCE` env var is now `configmap` instead of `both`.	Address the 6 critical Prowler IaC findings against `argocd/manifests/`. Prowler's IaC provider hardcodes `self._mutelist = None` and delegates filtering to Trivy, but doesn't plumb `--ignorefile` through — so the documented "use Trivy filtering" path is actually broken. Added a shim around `trivy` in the Prowler image that injects `--ignorefile $TRIVY_IGNOREFILE` for `trivy fs` invocations when the env var points at a real file. The IaC cronjob now mounts `mutelist/trivyignore.yaml` (Trivy's per-path schema) and sets the env var, muting the `external-secrets` and `kube-state-metrics` Secret-access findings (KSV-0041, KSV-0114). Separately, `grafana-clusterrole` is tightened to remove `secrets` access entirely: the dashboard sidecar already only consumes ConfigMap-labeled dashboards, so its `RESOURCE` env var is now `configmap` instead of `both`.
		`@ -1 +0,0 @@`
			Reviewed compensating control `observability-stack-audit`. Updated description to cover ringtail's k3s as well as indri's minikube; both Alloy DaemonSets and Loki are healthy.
		`@ -0,0 +1 @@`
							Ripped out the compensating-controls (CC) framework: deleted `compensating-controls.yaml`, the `review-compensating-controls` mise task, and the associated how-to / explanation docs. Prowler and Kingfisher continue to run weekly and produce reports; the Prowler mutelist YAML files remain in place but no longer carry `CC: <id>` prefixes — each entry just keeps a free-form `Description` of why the finding is muted. The CC review cadence proved to be more overhead than this single-operator homelab needed.