blumeops/compensating-controls.yaml

# Compensating Controls
#
# Documents controls that mitigate risks from suppressed or accepted security
# findings. Referenced by security tools (Prowler mutelist, Kingfisher config,
# etc.) via "CC: <id>" in finding descriptions or suppression notes.
#
# Used by `mise run review-compensating-controls` to surface stale controls.
#
# Fields:
#   id              - kebab-case unique identifier, referenced from tool configs
#   description     - what the control actually does to mitigate risk
#   created         - date (YYYY-MM-DD) the control was documented
#   last-reviewed   - date (YYYY-MM-DD) or null
#   notes           - optional context

controls:
  - id: single-user-cluster
    description: >-
      Only the cluster operator (eblume) has kubectl access. No untrusted
      users can create pods, access cached images, or bind RBAC roles.
    created: 2026-03-30
    last-reviewed: 2026-04-01
    notes: >-
      Verify by checking kubeconfig distribution and Tailscale ACLs.
      If additional users gain cluster access, re-evaluate all findings
      muted under this control.

  - id: tailscale-network-isolation
    description: >-
      Cluster is not internet-exposed. All access requires Tailscale
      identity with ACL enforcement. Profiling endpoints, debug ports,
      and control-plane APIs are unreachable from the public internet.
    created: 2026-03-30
    last-reviewed: 2026-04-06
    notes: >-
      Verify with 'tailscale serve status --json' on indri and review
      Tailscale ACLs in pulumi/tailscale/. Only tag:flyio-target services
      are publicly routable.

  - id: local-registry
    description: >-
      Operator-built services use a private zot registry
      (registry.ops.eblu.me) for supply-chain control. Remaining
      images are pulled from public registries without stored
      credentials. No shared registry secrets are cached on cluster
      nodes.
    created: 2026-03-30
    last-reviewed: 2026-04-12
    notes: >-
      Verify by checking image prefixes in kustomization.yaml files.
      Known external-image categories: (1) upstream apps not yet
      mirrored — immich, ollama, frigate, frigate-notify, valkey;
      (2) infrastructure components — tailscale operator/proxy,
      external-secrets, 1password-connect, forgejo-runner, docker
      DinD, nvidia-device-plugin; (3) utility base images — busybox,
      alpine (grafana init containers). Track upstream versions in
      service-versions.yaml. Goal is to progressively mirror these
      into zot.

  - id: sso-gated-admin-tools
    description: >-
      ArgoCD requires SSO authentication via Authentik OIDC. Wildcard
      RBAC roles are mitigated by requiring authenticated identity
      before any API access.
    created: 2026-03-30
    last-reviewed: 2026-04-14
    notes: >-
      Verify Authentik OIDC provider config for ArgoCD and that
      anonymous access is disabled. Check ArgoCD --auth-token isn't
      leaked. The workflow-bot API key account is scoped to sync/get
      only.

  - id: operator-managed-pods
    description: >-
      Tailscale operator manages proxy pod specs (ts-*, ingress-*,
      operator-*, nameserver-*). Pod security settings are set by the
      operator, not user manifests. Operator is tracked in
      service-versions.yaml and regularly updated.
    created: 2026-03-30
    last-reviewed: 2026-04-21
    notes: >-
      Verify operator version is current via 'mise run service-review'.
      Check Tailscale changelog for security fixes. If operator adds
      seccomp support, remove these mutes. As of 2026-04-21: still no
      default seccomp on operator-generated pods (upstream issue #7359
      open). A ProxyClass + generic device plugin can downgrade proxies
      from privileged to NET_ADMIN+NET_RAW and set seccompProfile —
      potential future remediation to remove the seccomp mute without
      waiting for upstream defaults.

  - id: ephemeral-privileged-jobs
    description: >-
      Prowler CIS scanner runs as a CronJob with 7-day TTL
      auto-deletion, not as a persistent privileged workload. hostPID
      exposure is time-bounded to scan duration (~20s).
    created: 2026-03-30
    last-reviewed: 2026-04-29
    notes: >-
      Verify TTL is set in cronjob.yaml. Check that no persistent
      pods run with hostPID on the scanned cluster (indri). The
      alloy-tracing DaemonSet on ringtail also uses hostPID but is
      out of scope — Prowler only scans indri. Tracked in Todoist:
      "prowler scan against ringtail" — once that lands, the
      DaemonSet's hostPID+privileged posture will surface as a CIS
      finding and need its own CC or remediation.

  - id: trusted-ci-only
    description: >-
      Forgejo runner only executes workflows from repos on the private
      forge (forge.ops.eblu.me). No external or untrusted repos can
      trigger privileged CI jobs.
    created: 2026-03-30
    last-reviewed: 2026-05-01
    notes: >-
      Verification: (1) Runner config (argocd/manifests/forgejo-runner/
      config.yaml) connects only to https://forge.ops.eblu.me/. (2) Forge
      app.ini has DISABLE_REGISTRATION=true and ALLOW_ONLY_EXTERNAL_REGISTRATION
      =true (ansible/roles/forgejo/defaults/main.yml) — no untrusted users
      can sign up or create repos. The runner registers at instance scope
      (repo_id=0/owner_id=0 in action_runner table), but the instance itself
      is closed, so no per-repo allow-list is needed. Re-evaluate if the
      forge ever opens to additional users or if the runner is repointed
      to an external forge.

  - id: init-container-isolation
    description: >-
      Root privileges and added capabilities (CHOWN) are limited to
      init containers that run once at pod startup. All runtime
      containers run as non-root (UID 472) with all capabilities
      dropped.
    created: 2026-03-30
    last-reviewed: 2026-05-04
    notes: >-
      Verify by inspecting grafana deployment.yaml securityContext
      for both init and runtime containers. If fsGroup alone can
      handle PVC ownership, remove init-chown-data and this control.
      Retirement deferred until grafana lands on ringtail's k3s
      (see [[indri-k8s-migration]]) — storage backend will change,
      and removing init-chown-data right before that migration
      trades a real safety net for marginal cleanup. Revisit
      post-migration.

  - id: node-config-automated-verification
    description: >-
      Prowler reports certain node-level checks as MANUAL because it runs
      inside a pod and cannot evaluate kubelet file permissions, kubelet
      config arguments, etcd CA separation, or cluster-admin RBAC bindings.
      The review-compliance-reports script SSHes into the minikube node
      weekly and programmatically verifies each condition, failing loudly
      if any check deviates from expected values.
    created: 2026-04-14
    last-reviewed: 2026-04-14
    notes: >-
      Verification runs as part of 'mise run review-compliance-reports'.
      If minikube node is unreachable, all checks report as FAIL. If new
      MANUAL findings appear in Prowler, add corresponding verification
      logic to the script and update the mutelist.

  - id: operator-purpose-bound-rbac
    description: >-
      Operators whose entire function is to manage a sensitive resource
      legitimately need RBAC over that resource. external-secrets-operator
      manages Secret objects (its purpose) and the cert-controller mutates
      its own ValidatingWebhookConfigurations to inject rotating CA bundles.
      Risk is bounded by: (1) the operator code being upstream open-source
      and reviewed; (2) RBAC scoped to specific named webhooks where
      possible; (3) supply chain controls on the operator image (mirrored
      to local registry, version tracked in service-versions.yaml).
    created: 2026-04-27
    last-reviewed: 2026-04-27
    notes: >-
      Verify by checking that the operators in question still match their
      stated purpose (i.e. external-secrets is still the only consumer of
      these ClusterRoles) and that upstream hasn't published advisories
      for credential-handling bugs. Re-evaluate if a non-secrets-managing
      ClusterRole appears under this control.

  - id: kube-state-metrics-metadata-only
    description: >-
      kube-state-metrics holds list/watch on Secrets cluster-wide but only
      exposes Secret object *metadata* (name, namespace, type, creation
      timestamp, labels) via the kube_secret_info / kube_secret_labels
      metrics. Secret data fields are never read into KSM's exposed
      metrics by upstream design. Mitigation rests on KSM's metric
      schema, the version pin in service-versions.yaml, and the metrics
      endpoint being reachable only on the cluster network.
    created: 2026-04-27
    last-reviewed: 2026-04-27
    notes: >-
      Verify by inspecting the /metrics endpoint output for any series
      that include secret data (only *_info and *_labels metrics should
      reference secrets, and labels should be limited to user-applied
      labels — never the data:). Re-evaluate on KSM version bumps.

  - id: observability-stack-audit
    description: >-
      Alloy collects pod logs and ships them to Loki, providing an
      audit trail for cluster activity. Compensates for missing
      apiserver audit logging which neither minikube (indri) nor
      k3s (ringtail) configures by default.
    created: 2026-03-30
    last-reviewed: 2026-05-11
    notes: >-
      Verify Alloy DaemonSet is running on each cluster (alloy-k8s on
      minikube, alloy-ringtail on k3s) and Loki is receiving logs.
      Note this is weaker than native apiserver audit logs — it
      captures pod stdout/stderr, not API request-level auditing.
      Consider enabling apiserver audit logging on k3s post-migration
      (`--audit-log-path` / `--audit-policy-file`) — minikube made it
      hard, k3s makes it straightforward.