Add snowflake-proxy as a native systemd service on ringtail to help
censored users reach the Tor network. This is a bridge proxy, not an
exit node — traffic exits through Tor exit nodes elsewhere.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The kubectl image lacks curl/python3. Use the prowler image
(which has Python) with a pure-Python urllib script instead.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prowler's --registry flag doesn't work (registry args not passed
to ImageProvider constructor, prowler-cloud/prowler PR #10128
regression). Use an init container to enumerate images from the
zot catalog API and generate an image list file instead.
See: https://github.com/eblume/prowler/tree/fix/image-provider-registry-args
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Image scan: add https:// scheme to registry URL.
IaC scan: use --scan-repository-url (Prowler clones the repo
itself), removing the need for an init container. The flag
is --scan-path for local dirs, --scan-repository-url for git.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clone repo in init container, scan Dockerfiles and K8s manifests
with Prowler's IaC provider (Trivy). Reports written to
sifaka:/volume1/reports/prowler-iac/.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add Trivy to the Prowler container for image and IaC scanning.
New CronJob (Saturday 3am) scans all blumeops/* images in the
registry for CVEs, embedded secrets, and Dockerfile misconfigs.
Reports written to sifaka:/volume1/reports/prowler-images/.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
C0 follow-up to #309: update kustomization newTag for all containers
rebuilt by the merge (authentik, authentik-redis, ntfy, alloy).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs
- Introduce **attached service pattern**: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates
- Document the pattern in [[review-services]] so future attached services slot in cleanly
- Backfill `parent: grafana` on existing `grafana-sidecar` entry
## Version drift protection
1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build`
2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update
3. Both must agree before commit succeeds
## Test plan
- [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`)
- [ ] Update kustomization `newTag` to branch-built image tag
- [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`)
- [ ] Verify Authentik login, session persistence, and task queue still work
- [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #309
Bump from RC to latest stable (security fixes for config endpoint and
cross-camera auth). Add new 0.17 motion retention tier at 365 days,
reduce continuous from 180 to 30 days.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Uses the grafana_folder annotation to place the dashboard in the
existing folder created by alert rule provisioning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two panels: currently firing alerts (firing/pending/noData/error) and
recent state changes. Refreshes every 30s. Uses Grafana's built-in
alertlist panel type — no datasource needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dramatiq defaults to one worker process per CPU core. On ringtail (16 cores)
this spawned 16 processes, each loading the full Django app, exceeding the
1Gi memory limit and causing a crash loop (228 restarts over 7 days).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Merges `build-container.yaml` and `build-container-nix.yaml` into a single workflow
- Detect job classifies each changed container by presence of `Dockerfile` and/or `default.nix`
- Dockerfile containers build on `k8s` (indri) via Dagger; Nix containers build on `nix-container-builder` (ringtail) via nix-build + skopeo
- Containers with both build files (alloy, nettest, ntfy) get built on both runners
## Test plan
- [ ] Push a change to a Dockerfile-only container (e.g. grafana) — verify it builds on k8s only
- [ ] Push a change to a nix-only container (e.g. jobsync) — verify it builds on nix-container-builder only
- [ ] Push a change to a dual container (e.g. ntfy) — verify it builds on both runners
- [ ] Test workflow_dispatch with a specific container name
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #306
Replace hardcoded image tags in Quick Reference tables with pointers to
kustomization manifests (tags drift with every container release). Fix
Prometheus CNPG scrape target, remove misleading .ts.net URLs, expand
external-secrets stub, add backup/disaster-recovery cross-references.
Limit doc-reviewer agent to one doc per cycle.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove `group: ""` from ignoreDifferences in tailscale-operator and
tailscale-operator-ringtail — ArgoCD normalizes away the empty string
field, so the live state never matches git.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The Fly container pulls from tailscale/tailscale:stable which is still
v1.94.2. The `tailscale wait` command doesn't exist until v1.96.2.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
v1.96.3 exists as a GitHub release but Docker Hub images for both
tailscale/tailscale and tailscale/k8s-operator haven't been published
yet (v1.94.2 is still latest). Revert the image tags; the fly/start.sh
`tailscale wait` improvement and review date stamps are retained.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
## Summary
- Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization)
- Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2)
- Stamp kube-state-metrics review date (already current at v2.18.0)
## Notable upstream changes (v1.94.2 → v1.96.3)
- Go upgraded from 1.25 to 1.26
- `tailscale wait` command — blocks until daemon is running + interface has IP
- AuthKey policy now applies only when users are not logged in (behavioral change)
- Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling)
- UPnP stability fixes
## Deploy plan
1. Merge PR
2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator`
3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...`
4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods`
5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services)
6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change)
## Test plan
- [ ] ArgoCD diff looks clean for both apps before sync
- [ ] Proxy pods on indri come up healthy with v1.96.3 images
- [ ] Proxy pods on ringtail come up healthy with v1.96.3 images
- [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus)
- [ ] Fly proxy rebuild deploys successfully with `tailscale wait`
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #304
## Summary
Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.
**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes
**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Reviewed-on: #303
Replace single aggregate camera_fps check with per-camera FPS validation
and NFS storage accessibility check. Motivated by an outage where Frigate
API responded OK but NFS mount was inaccessible, causing "no frames" in UI.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Also catches kiwix's transmission sidecar up from v4.0.6-r4 to
v4.1.1-r1, matching the torrent service (upgraded in PR #282 but
the kiwix sidecar was missed). No breaking changes — old RPC
protocol is supported through 4.x.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Point all services at the 613f05d images which carry the new
consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4
version, no matching build) and docs/quartz (no 613f05d build).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Every container now carries title, description, version, source, and
vendor labels per the OCI image spec. Version is derived from the
existing CONTAINER_APP_VERSION ARG at build time.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The top-level "uid" in Grafana dashboard JSON is at 2-space indent
near the end of the file, not the first occurrence. Match on ^ "uid"
to avoid clobbering nested datasource uid references.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous sed replaced ALL "uid" fields in dashboard JSON files,
including datasource references inside panels, causing dashboards to
go dark. Scope the replacement to only the first occurrence (the
top-level dashboard UID) using GNU sed 0,/pattern/ addressing.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add init container to pre-populate ConfigMap dashboards before Grafana
starts, eliminating the race between the sidecar and the provisioner
that caused dashboard DB records to be deleted and re-created with new
IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards
fetched from upstream.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Four project-scoped subagents that formalize existing mise task
workflows as constrained, specialized AI agents:
- infra-health: background health monitor (wraps services-check)
- doc-reviewer: persistent-memory documentation reviewer
- change-classifier: C0/C1/C2 triage before work begins
- mikado-navigator: C2 chain state advisor (wraps docs-mikado)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>