Commit graph

818 commits

Author SHA1 Message Date
fc34a7da5b Review postgresql.md: add authentik user/db, immich-pg borgmatic secret
Doc review found the authentik database, user, and external secret were
missing, along with the immich-pg borgmatic secret. Added Cluster column
to Users table for clarity. Set last-reviewed: 2026-04-07.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 15:21:48 -07:00
1fd8aae8f6 Upgrade ArgoCD v3.3.2 → v3.3.6, SHA-pin install manifest
Patch upgrade with bug fixes (diff normalization, installation ID cache).
Pin the upstream manifest URL to commit SHA for supply chain integrity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-07 08:21:11 -07:00
e85c71e73f Add changelog fragments for seccomp hardening and bracket fix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:38:41 -07:00
d3d67272a7 Fix blumeops-tasks swallowing bracket content in descriptions
Rich markup parser interprets [text] as style tags, stripping
wiki-links like [[review-compensating-controls]] to empty [].
Escape description lines with rich.markup.escape().

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:37:40 -07:00
59f3422d3e Review compensating control: tailscale-network-isolation
Verified: tailscale serve status shows only svc:k8s, ACLs restrict
tag:flyio-target to port 443 with admin/operator ownership only,
indri has no flyio-target tag. All 10 muted findings remain valid.

Noted gap: no automated alerting on new flyio-target devices.
Tracked in Todoist as MC4 (Manual Compliance Control Check CronJob).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:35:13 -07:00
18fe172a54 Add seccomp RuntimeDefault profiles to alloy-k8s and immich pods
Resolves 4 unmuted Prowler core_seccomp_profile_docker_default
findings on alloy, immich-server, immich-machine-learning, and
immich-valkey.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:21:23 -07:00
a059d81314 Add review-compliance-reports task and reorganize report storage
New mise task fetches Prowler reports from sifaka, parses with proper
muted/unmuted distinction, shows week-over-week delta, and includes
a scaffold for Kingfisher once JSON/CSV output is available upstream.

Moved all legacy top-level reports on sifaka into date subdirectories
to match the current CronJob output structure. Updated
read-compliance-reports doc with task reference and links.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 10:16:46 -07:00
54213ab810 Fix flake-update pipeline and update ringtail flake inputs
The `--exclude` flag added in #321 never existed in nix — it was
introduced broken and never tested. Replace with dynamic input
discovery: query `nix flake metadata --json` for all input names,
filter out skip_inputs (default: nixpkgs-services), pass the rest
as positional args. Also bump NIX_IMAGE 2.33.3 → 2.34.4.

Updated inputs: nixpkgs, home-manager, disko.
nixpkgs-services stays pinned.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 08:27:31 -07:00
Forgejo Actions
370a3574b2 Update docs release to v1.15.4
- Built changelog from towncrier fragments

[skip ci]
2026-04-06 07:53:54 -07:00
0eaf8680fd Rewrite observability stack tutorial to match actual practices v1.15.4
Replace generic Helm install instructions with kustomize/ArgoCD patterns
that reflect how BlumeOps actually deploys Prometheus, Loki, Grafana, and
Alloy. Fix "BluemeOps" typos, document Alloy as a core (not optional)
component, remove hardcoded admin password, add proper prerequisites and
cross-references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 07:52:35 -07:00
f42fa2d558 Remove stale Helm chart mirror references from forgejo docs
All Helm chart mirrors (grafana-helm-charts, connect-helm-charts,
cloudnative-pg-charts) have been deleted from forge.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-06 07:37:21 -07:00
c7e5af6d51 Migrate 1Password Connect from Helm to kustomize (1.8.1 → 1.8.2) (#326)
## Summary

- Renders manifests from `connect-helm-charts v2.4.1` as plain kustomize (deployment + service)
- Bumps 1Password Connect from 1.8.1 → 1.8.2
- Completes the no-helm-policy migration — all services now use kustomize
- Retains all production hardening from the Helm chart (securityContext, runAsNonRoot, drop ALL, seccomp, resource limits)

## Changes

- **New:** `deployment.yaml`, `service.yaml`, `kustomization.yaml` in `argocd/manifests/1password-connect/`
- **Rewritten:** Both ArgoCD app definitions (indri + ringtail) — single source kustomize instead of multi-source Helm
- **Deleted:** `values.yaml` (Helm values no longer needed)
- **Updated:** `no-helm-policy.md`, `service-versions.yaml`, `README.md`

## Deployment plan

1. Sync `apps` app to pick up the new app definitions
2. `argocd app set 1password-connect --revision 1password-connect-kustomize`
3. `argocd app sync 1password-connect` — verify on indri
4. Repeat for ringtail
5. After merge: reset revision to main, re-sync both

## Test plan

- [ ] `kubectl kustomize` renders cleanly (verified locally)
- [ ] ArgoCD diff shows expected changes (Helm labels removed, images bumped)
- [ ] Pods come up healthy on indri
- [ ] External Secrets still resolves 1Password items
- [ ] Repeat on ringtail

Reviewed-on: #326
2026-04-06 07:31:40 -07:00
Forgejo Actions
facb803010 Update docs release to v1.15.3
- Built changelog from towncrier fragments

[skip ci]
2026-04-05 21:24:25 -07:00
f9397b7fa0 Review core-services tutorial: add SSH rationale, runner example, TODO notice v1.15.3
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-05 21:22:00 -07:00
5597e02467 Fix Homepage pod-selector for Immich (Helm labels → kustomize labels)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 12:12:48 -07:00
0f1143a5bd Merge pull request 'Migrate Immich from Helm to kustomize (v2.5.6 → v2.6.3)' (#324) from immich-kustomize-v2.6.3 into main 2026-04-04 12:09:40 -07:00
6cab5091ea Add storage-provisioner health check to minikube Ansible role
The storage-provisioner is a bare Pod with no controller. If the node
restarts via Docker Desktop (rather than `minikube start`), kubelet
restores static pods but bare pods are lost. Detect this and re-run
`minikube start` to restore addons.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 12:04:25 -07:00
3c819cf16e Fix Homepage migration date: 2025-12 → 2026-02 (per git history)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:53:48 -07:00
6e06efa6d0 Add AI-drafted disclaimer to no-helm-policy explanation doc
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:50:53 -07:00
64200a55c5 Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3)
Replace the Helm chart deployment with plain kustomize manifests following
the Authentik pattern (separate deployments per component). Consolidate
the immich-storage ArgoCD app into the main immich app. Add no-helm-policy
doc establishing kustomize as the standard deployment mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:42:25 -07:00
464e3222d2 Document upstream fix for Prowler --registry bug (pending release)
PR #10470 merged 2026-03-30; initContainer workaround stays until a
Prowler release includes the fix (latest is 5.22.0).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 20:21:19 -07:00
afb184fefc Add Sway fullscreen rule for RDR2 (gamescope broken on NVIDIA 580.x)
Gamescope 3.16.17 segfaults on NVIDIA 580.x in nested Wayland/Sway due
to explicit sync issues (ValveSoftware/gamescope#1662). Use a Sway
window rule to force RDR2 fullscreen instead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 16:04:29 -07:00
5de2ed9f96 Add gaming.nix for ringtail: gamescope + consolidate Steam config
Move Steam config from configuration.nix to a dedicated gaming.nix module
and add gamescope for fullscreen/resolution management with Proton games.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 15:48:36 -07:00
306f580bdb Point Tempo at main-built container v2.10.3-75f9ba4
C0 follow-up: update tag from branch-built image to main-SHA image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 13:45:57 -07:00
75f9ba4943 Build Tempo container from source (2.10.3) (#323)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (tempo) (push) Successful in 6s
## Summary
- Add `containers/tempo/Dockerfile` — two-stage Go build from forge mirror, modeled on loki
- Switch kustomization from upstream `grafana/tempo` to `registry.ops.eblu.me/blumeops/tempo`
- Bump Tempo 2.10.1 → 2.10.3

## Test plan
- [ ] Kick off container build via `mise run container-build-and-release tempo`
- [ ] Update kustomization `newTag` with built image tag
- [ ] Deploy from branch: `argocd app set tempo --revision local-tempo-container && argocd app sync tempo`
- [ ] Verify Tempo health: `curl tempo.ops.eblu.me/ready`
- [ ] Verify traces flowing in Grafana Tempo datasource

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #323
2026-04-02 13:45:02 -07:00
b1e2811077 Upgrade Grafana 12.3.3 → 12.4.2 (#322)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (grafana) (push) Successful in 7s
## Summary

- Bumps Grafana from 12.3.3 to 12.4.2
- Patches 7 CVEs, notably CVE-2026-27880 (unauthenticated OOM DoS, CVSS 7.5) and CVE-2026-27879 (authenticated OOM via resample queries)
- No config changes required — reviewed alerting, datasources, OIDC, and feature toggles against 12.4.x breaking changes

## Breaking changes reviewed

| Change | Impact |
|--------|--------|
| Alerting: pending period applies to NoData/Error | Net positive — reduces noise from transient blips |
| Default notification uses empty receiver | No impact — we explicitly set `ntfy-infra` |
| Removed feature toggles (4) | No impact — none configured |
| OAuth ID token signature validation | Low risk — verify OIDC login post-deploy |
| OpsGenie deprecated | No impact — using webhook |

## Test plan

- [ ] Container build completes at forge
- [ ] Update kustomization.yaml with new image tag
- [ ] `argocd app set grafana --revision upgrade/grafana-12.4.2 && argocd app sync grafana`
- [ ] Verify Grafana UI loads at grafana.ops.eblu.me
- [ ] Verify OIDC login via Authentik
- [ ] Verify dashboards and datasources load
- [ ] Check alerting rules are intact

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #322
2026-04-02 11:33:19 -07:00
08d57ef4d4 Review pulumi reference doc: fix Tailscale auth, add how-to links
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-02 10:55:57 -07:00
1e67975acb Add changelog fragment for compensating control review
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:04:38 -07:00
baee7ae54b Review single-user-cluster control and add evidence collection card
Stamp single-user-cluster last-reviewed to 2026-04-01 after verifying
Tailscale ACLs and kubeconfig distribution. Add aspirational how-to card
documenting what PCI DSS evidence collection would look like (CCW,
artifacts, Drata workflow). Link from existing review process card.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 22:01:57 -07:00
a18a424866 Pin NixOS service versions via nixpkgs-services overlay (#321)
## Summary

- Add `nixpkgs-services` flake input pinned to a specific nixpkgs commit, with an overlay that pulls `forgejo-runner`, `snowflake`, and `k3s` from it instead of the rolling `nixpkgs`
- Dagger `flake-update` pipeline now excludes `nixpkgs-services` via `--exclude`
- Fix stale nix-container-builder version in service-versions.yaml (was 12.6.4, actually running 12.7.2)
- Add k3s and minikube to service-versions.yaml tracking
- Document the pinning approach in review-services how-to and ringtail reference

## Motivation

During service review, discovered that flake updates had silently upgraded forgejo-runner from 12.6.4 → 12.7.2 without updating service-versions.yaml. This "sneak-in upgrade" bypasses the service review process. The overlay ensures these three services only change versions deliberately.

## Test plan

- [ ] Verify `nix flake update` from `nixos/ringtail/` does not change `nixpkgs-services` lock entry
- [ ] Verify `mise run provision-ringtail` builds successfully with the overlay
- [ ] Confirm running service versions unchanged after deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #321
2026-04-01 21:37:57 -07:00
cfbf4cadbd Review argocd-cli reference doc: stamp last-reviewed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-01 20:35:52 -07:00
Forgejo Actions
2b7b21dc9b Update docs release to v1.15.2
- Built changelog from towncrier fragments

[skip ci]
2026-03-30 17:48:40 -07:00
4059b3d27b Add compensating controls framework and date-based report dirs (#320) v1.15.2
## Summary

- Add `compensating-controls.yaml` tracking 9 named controls that justify suppressed security findings
- Update all Prowler mutelist descriptions with `CC: <id>` references to named controls
- Add `mise run review-compensating-controls` task — surfaces stalest control with all codebase references
- Add [[review-compensating-controls]] how-to doc
- Organize Prowler and Kingfisher reports into `YYYY-MM-DD` subdirectories

### Compensating controls

| ID | Mitigates |
|----|-----------|
| `single-user-cluster` | Image cache abuse, RBAC breadth, system pod privileges |
| `tailscale-network-isolation` | Profiling endpoints, weak TLS, debug ports |
| `local-registry` | AlwaysPullImages gap |
| `sso-gated-admin-tools` | ArgoCD wildcard RBAC |
| `operator-managed-pods` | Tailscale proxy pod security settings |
| `ephemeral-privileged-jobs` | Prowler hostPID exposure |
| `trusted-ci-only` | Forgejo runner DinD |
| `init-container-isolation` | Grafana root init container |
| `observability-stack-audit` | Missing apiserver audit logging |

## Test plan

- [ ] `mise run review-compensating-controls` shows table and references
- [ ] `kubectl kustomize argocd/manifests/prowler/` renders correctly
- [ ] Sync prowler and kingfisher, verify next scan writes to dated subdirectory
- [ ] Grep for `CC:` in mutelist files — every muted finding should have at least one

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #320
2026-03-30 17:44:11 -07:00
a76e471d54 Add Prowler mutelist and fix kube-state-metrics seccomp (#319)
## Summary

- Add mutelist files to suppress expected/accepted Prowler CIS findings from components we don't control
- Mutelist files stored in `mutelist/` directory, grouped by category, merged at runtime via initContainer
- Fix missing seccomp `RuntimeDefault` profile on kube-state-metrics deployment

### Mutelist categories

| File | Checks | Covers |
|------|--------|--------|
| `apiserver.yaml` | 12 | Minikube apiserver flags |
| `control-plane.yaml` | 3 | Scheduler, controller-manager, kubelet |
| `core-pod-security.yaml` | 7 | System pods, Tailscale operator, Grafana init, Prowler hostPID, forgejo-runner |
| `rbac.yaml` | 3 | Built-in K8s roles, ArgoCD, CNPG |

Muted findings appear as `status=MUTED` in reports (not hidden), preserving audit trail.

### Not muted (follow-up)

- Alloy, Immich pods missing seccomp — need separate investigation (Helm/operator-managed)

## Test plan

- [ ] `kubectl kustomize argocd/manifests/prowler/` renders cleanly
- [ ] Trigger manual scan: `kubectl --context=minikube-indri -n prowler create job prowler-mutelist-test --from=cronjob/prowler`
- [ ] Verify initContainer merges successfully (check pod logs)
- [ ] Verify muted findings show as `MUTED` in report
- [ ] Sync kube-state-metrics and verify pod starts with seccomp profile

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #319
2026-03-30 17:22:31 -07:00
1e391f96bb Upgrade forgejo-runner 12.7.0 → 12.7.3, add service card
Patch upgrade picks up idempotent FetchTask API, offline registration
fix, cloudflare/circl security dep update, and custom gRPC user-agent.
No config defaults changed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:31:06 -07:00
0ec0246847 Remove doc-reviewer agent 2026-03-30 16:12:48 -07:00
77eebe507e Review Ansible reference doc: add missing roles, clarify IaC positioning
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 16:10:24 -07:00
c069f889d2 Harden borgmatic photos backup: restrict dirs, add keepalives + checkpoints
Restrict backup to library/ and upload/ only (skip regenerable encoded-video/,
thumbs/, backups/). Add SSH ServerAliveInterval to prevent broken pipe on long
transfers, and checkpoint_interval so interrupted backups save progress.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 10:30:28 -07:00
b000efd6c3 Fix Kingfisher CronJob exit code handling
Kingfisher exits 200 (findings) or 205 (validated findings) on success.
Normalize these to 0 so the CronJob completes instead of restarting.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 07:16:02 -07:00
457ab19416 Scope Kingfisher scan to eblume user only on ringtail
Mirror repos cause scan failures (likely ephemeral storage or timeout).
Scan only eblume/ repos until we investigate the root cause.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 07:11:52 -07:00
2c1f0abefc Deploy Kingfisher v165768b-0fe0eed-nix (tmp permissions fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:54:57 -07:00
0fe0eed35a Fix Kingfisher container: make /tmp world-writable
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 22s
Container runs as user 65534 (nobody) but /tmp was owned by root.
Set sticky bit + world-writable (1777) like a standard /tmp.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:53:34 -07:00
14f366f993 Deploy Kingfisher v165768b-c494b62-nix (/tmp fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:51:17 -07:00
c494b62713 Fix Kingfisher container: add /tmp directory
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 24s
Kingfisher needs a writable temp directory for git clones and scanning.
Nix containers don't create /tmp by default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:49:59 -07:00
b01afb1c1d Deploy Kingfisher v165768b-aa9cc70-nix (bash fix)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:47:26 -07:00
aa9cc709ec Fix Kingfisher container: add bash and coreutils for CronJob shell
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 22s
Nix containers don't include a shell by default. The CronJob needs
/bin/bash for the inline script that generates timestamped filenames.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:45:39 -07:00
f0c6845f0f Deploy custom Kingfisher container v165768b-f9206bf-nix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-30 06:42:24 -07:00
f9206bf10b Build custom Kingfisher container from sporked deploy branch (#318)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 12s
## Summary

- Add Dockerfile for Kingfisher built from source (sporked deploy branch)
- Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime
- Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher`
- Add kingfisher to service-versions.yaml (version tracks upstream main SHA)
- Document spork workflow in CLAUDE.md

## Test plan

- [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29`
- [ ] Verify image on registry: `mise run container-list`
- [ ] Update kustomization newTag
- [ ] Sync ArgoCD kingfisher app from branch
- [ ] Trigger manual CronJob and verify scan completes
- [ ] Verify reports on sifaka

Reviewed-on: #318
2026-03-30 06:34:49 -07:00
99a1a49175 Revert kingfisher skip in container build workflow
Kingfisher will build via Nix on ringtail instead of Dockerfile on
indri, so the skip is no longer needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 21:42:38 -07:00
a842b9c1e8 Skip kingfisher in CI container builds
Kingfisher's Rust + Boost/vectorscan build exhausts indri's memory
(aws-sdk-ec2 alone needs 2-3GB for rustc). Build locally on Gilbert
and push manually until we have a beefier build host.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 20:52:07 -07:00