blumeops

Author	SHA1	Message	Date
Erich Blume	646fb4f2dc	Add custom Kingfisher container built from sporked feature branches - Dockerfile: deterministic build from pinned CONTAINER_APP_VERSION + FEATURES - Merges named feature branches at specific SHAs for reproducibility - Switch CronJob to custom image with --clone-url-base and --all-organizations - Add kingfisher to service-versions.yaml (version tracks upstream main SHA) - Document spork container builds in new how-to card - Document spork workflow in CLAUDE.md - Update kingfisher service docs for custom image Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 17:10:59 -07:00
Erich Blume	924325ebd5	Fix DinD seccomp profile broken by RuntimeDefault rollout The pod-level RuntimeDefault seccomp profile (`07e9c81`) overrides the DinD sidecar's privileged flag in newer Kubernetes versions, blocking Docker daemon syscalls. Set Unconfined explicitly on the DinD container while keeping RuntimeDefault on the runner container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-29 17:09:57 -07:00
Erich Blume	bb60369956	Simplify Kingfisher CronJob to HTML-only output Remove the second scan pass for JSON — one format is enough for now. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:50:54 -07:00
Erich Blume	35705faca2	Add Kingfisher secret scanner CronJob (#317 ) ## Summary - Deploys MongoDB Kingfisher as a weekly CronJob on minikube-indri - Scans all Forgejo repos (eblume + all orgs) for leaked secrets with live validation - Produces timestamped HTML and JSON reports on sifaka NFS (`/volume1/reports/kingfisher/`) - Forgejo API token sourced from 1Password via ExternalSecret - Uses official `ghcr.io/mongodb/kingfisher:1.91.0` container image - Runs Sunday 4am (after Prowler's 3am k8s scan) ## Resources - CronJob, PV/PVC (sifaka NFS), ExternalSecret - ArgoCD Application with manual sync + CreateNamespace ## Test plan - [x] Sync ArgoCD `apps` app to pick up new kingfisher Application - [x] Set `--revision feature/kingfisher-cronjob` on kingfisher app - [x] Verify ExternalSecret creates the `kingfisher-forgejo-token` Secret - [x] Trigger manual job: `kubectl create job --from=cronjob/kingfisher kingfisher-manual -n kingfisher --context=minikube-indri` - [ ] Verify reports appear on sifaka at `/volume1/reports/kingfisher/` - [ ] After merge: set `--revision main` and re-sync Reviewed-on: #317	2026-03-28 21:39:55 -07:00
Forgejo Actions	7fb6eff388	Update docs release to v1.15.1 - Built changelog from towncrier fragments [skip ci]	2026-03-28 09:15:21 -07:00
Erich Blume	b632cd9ffb	Fix Immich resource limits and probe timeouts Resources were under wrong Helm value keys (server.resources, machine-learning.resources) and never applied to pods. Move to correct bjw-s chart paths (*.controllers.main.containers.main.resources). Increase liveness/readiness probe timeouts from 1s to 5s to prevent kubelet from killing healthy-but-busy pods during ML inference load. Remove CPU limits (keep requests only) to avoid throttling. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:36:32 -07:00
Erich Blume	c78b86c72c	Add offsite backup for immich photo library to BorgBase (#315 ) ## Summary - Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM - Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup - Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label - Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos - Docs updated: `backups.md`, `borgmatic.md` ## Prerequisites (manual) - [x] Create `immich-photos` repo on BorgBase with same SSH key - [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit) - [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run) ## Test plan - [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics` - [ ] Deploy borgmatic role and verify both configs deployed - [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours) - [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom` - [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #315	2026-03-27 19:43:05 -07:00
Erich Blume	ca0c9354ee	Add borgmatic backups for authentik and immich databases (#314 ) ## Summary - Add `authentik` database (blumeops-pg cluster) to borgmatic pg_dump backups - Add `immich` database (immich-pg cluster) to borgmatic pg_dump backups - For immich-pg: new borgmatic managed role with `pg_read_all_data`, ExternalSecret, Tailscale LoadBalancer service, and Caddy L4 TCP proxy on port 5433 - Update backup docs to reflect all four CNPG databases + mealie SQLite ## Deploy plan Deploy order matters — k8s resources must exist before ansible can route to them: 1. ArgoCD (databases app): sync to pick up immich-pg borgmatic role, ExternalSecret, and Tailscale service ``` argocd app set blumeops-pg --revision feature/borgmatic-all-pg-backups argocd app sync blumeops-pg ``` 2. Wait for `immich-pg-tailscale` service to get a Tailscale IP and `immich-pg.tail8d86e.ts.net` to resolve 3. Ansible (caddy): deploy Caddy L4 route for port 5433 ``` mise run provision-indri -- --tags caddy ``` 4. Ansible (borgmatic): deploy updated config and .pgpass ``` mise run provision-indri -- --tags borgmatic ``` 5. Verify: trigger a manual borgmatic run and check all four pg_dump streams succeed ``` borgmatic --verbosity 1 2>&1 \| grep -E '(Dumping\|ERROR)' ``` ## Test plan - [x] `kubectl kustomize` builds cleanly - [x] `ansible --check --diff` for borgmatic and caddy show expected changes - [ ] ArgoCD sync succeeds for databases app - [ ] `immich-pg.tail8d86e.ts.net` resolves - [ ] `pg.ops.eblu.me:5433` accepts connections - [ ] `borgmatic --verbosity 1` dumps all four databases without errors Reviewed-on: #314	2026-03-27 16:59:58 -07:00
Erich Blume	831b82950a	Upgrade nvidia-device-plugin v0.18.2 → v0.19.0 and add reference card Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 07:19:24 -07:00
Erich Blume	2c1652604b	Reduce PodNotReady alert lookback from 5m to 60s The 5-minute lookback window kept stale data from terminated pods visible during rollouts, causing the alert to sit in Pending for ~5 minutes after every routine deployment. 60s still covers two scrape cycles (30s interval) while clearing stale data much faster. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 19:48:37 -07:00
Erich Blume	a37012385f	Tighten ArgoCDAppOutOfSync alert timing to clear faster after sync Reduced `for` from 30m to 5m and lookback window from 5m to 1m. The old values caused alerts to linger long after apps returned to Synced state. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 15:44:09 -07:00
Erich Blume	f97b5c9d5d	Deploy Homepage v1.11.0-e375859 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 10:25:07 -07:00
Erich Blume	796baaa41a	Upgrade External Secrets Operator v2.2.0 + migrate Helm to kustomize (#312 ) ## Summary - Upgrade External Secrets Operator from v1.3.2 (helm-chart-2.0.0) to v2.2.0 - Migrate from Helm chart deployment to static kustomize manifests, matching the repo's kustomize-first pattern - Merge separate `-config` ArgoCD apps into the main operator apps (6 → 4 apps) - Clean up Helm-specific labels (`helm.sh/chart`, `managed-by: Helm`) - Update README example from v1beta1 to v1 API ## Breaking changes assessment Low risk — v2.0.0 removed Alibaba and Device42 providers (we use neither). No templating changes affect us. All ExternalSecrets already use v1 API. ## Deployment steps 1. Sync CRDs first on both clusters (new CRD version) 2. Sync operator apps (now kustomize-based) 3. Verify ClusterSecretStore and all ExternalSecrets are healthy 4. Delete orphaned config apps: `argocd app delete external-secrets-config` and `-config-ringtail` 5. `mise run services-check` Reviewed-on: #312	2026-03-25 15:56:41 -07:00
Erich Blume	b97e37543f	Deploy Tor Snowflake proxy on ringtail (#311 ) ## Summary - Add Snowflake proxy as a native systemd service on ringtail (NixOS) - Uses `pkgs.snowflake` from nixpkgs (v2.11.0) - Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit - Prometheus metrics enabled on localhost:9999 ## What is Snowflake? A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. This is NOT a Tor exit node — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP. ## Changes - `nixos/ringtail/configuration.nix` — new systemd service definition - `docs/reference/services/snowflake-proxy.md` — service reference card - `docs/reference/infrastructure/ringtail.md` — updated systemd services section - `service-versions.yaml` — added entry (type: nixos) ## Deploy plan After review, deploy via `mise run provision-ringtail`. Service starts automatically. ## Test plan - [ ] `mise run provision-ringtail` succeeds - [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active - [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections - [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics Reviewed-on: #311	2026-03-24 20:51:40 -07:00
Forgejo Actions	243a862901	Update docs release to v1.15.0 - Built changelog from towncrier fragments [skip ci]	2026-03-24 19:51:17 -07:00
Erich Blume	a1c2e0833d	Include link to upstream prowler issue	2026-03-24 19:48:43 -07:00
Erich Blume	75fd5b029d	Use prowler image for registry enumeration init container The kubectl image lacks curl/python3. Use the prowler image (which has Python) with a pure-Python urllib script instead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:36:02 -07:00
Erich Blume	d365e79068	Add kubectl image tag to prowler kustomization The image scan init container uses the kubectl image for curl/python. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:34:24 -07:00
Erich Blume	d90be355dd	Work around Prowler --registry bug with init container Prowler's --registry flag doesn't work (registry args not passed to ImageProvider constructor, prowler-cloud/prowler PR #10128 regression). Use an init container to enumerate images from the zot catalog API and generate an image list file instead. See: https://github.com/eblume/prowler/tree/fix/image-provider-registry-args Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 17:29:25 -07:00
Erich Blume	7d1ae1a57e	Fix prowler image and IaC scan arguments Image scan: add https:// scheme to registry URL. IaC scan: use --scan-repository-url (Prowler clones the repo itself), removing the need for an init container. The flag is --scan-path for local dirs, --scan-repository-url for git. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:58:33 -07:00
Erich Blume	7f2d53bc77	Fix prowler image scan registry URL (add https:// scheme) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:57:05 -07:00
Erich Blume	38281a35fd	Update prowler container tag to `6960243` (with Trivy) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:54:36 -07:00
Erich Blume	fe201a495c	Add Prowler IaC scanning of blumeops repo (Saturday 2am) Clone repo in init container, scan Dockerfiles and K8s manifests with Prowler's IaC provider (Trivy). Reports written to sifaka:/volume1/reports/prowler-iac/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:49:38 -07:00
Erich Blume	696024306c	Add Prowler image vulnerability scanning for blumeops containers All checks were successful Build Container / detect (push) Successful in 39s Details Build Container / build-dockerfile (prowler) (push) Successful in 10m15s Details Add Trivy to the Prowler container for image and IaC scanning. New CronJob (Saturday 3am) scans all blumeops/* images in the registry for CVEs, embedded secrets, and Dockerfile misconfigs. Reports written to sifaka:/volume1/reports/prowler-images/. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:43:08 -07:00
Erich Blume	07e9c810ca	Add RuntimeDefault seccomp profiles to all managed workloads Addresses 32 CIS Kubernetes Benchmark failures from Prowler scan (core_seccomp_profile_docker_default). Applied pod-level seccomp RuntimeDefault to 18 deployments/statefulsets and 2 cronjobs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:19:40 -07:00
Erich Blume	87f56f78b3	Update container tags to `d021b35` (post-merge rebuild) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 16:09:07 -07:00
Erich Blume	d021b3534f	Deploy Prowler CIS scanner (#310 ) All checks were successful Build Container / detect (push) Successful in 4s Details Build Container / build-dockerfile (prowler) (push) Successful in 10s Details ## Summary - Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning - Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream) - Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/` - Read-only ClusterRole for pod, RBAC, and control plane inspection - Host path mounts + hostPID for kubelet file permission checks ## Follow-ups - Mirror prowler-cloud/prowler on forge for supply chain control - Build and push container image, update kustomization.yaml newTag - Consider adding k3s-ringtail scanning (core + RBAC checks only) ## Test plan - [ ] Build container: `mise run container-release prowler v5.22.0` - [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag - [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler` - [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri` - [ ] Verify reports appear on sifaka NFS share - [ ] `mise run services-check` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #310	2026-03-24 16:08:09 -07:00
Erich Blume	3b7abbd689	Update container tags to `fd0bebb` (post-merge rebuild) C0 follow-up to #309: update kustomization newTag for all containers rebuilt by the merge (authentik, authentik-redis, ntfy, alloy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 13:39:26 -07:00
Erich Blume	fd0bebb0fc	Localize authentik-redis container (#309 ) All checks were successful Build Container / detect (push) Successful in 3s Details Build Container / build-dockerfile (alloy) (push) Successful in 12s Details Build Container / build-dockerfile (ntfy) (push) Successful in 11s Details Build Container / build-nix (alloy) (push) Successful in 20s Details Build Container / build-nix (authentik) (push) Successful in 6m10s Details Build Container / build-nix (authentik-redis) (push) Successful in 20s Details Build Container / build-nix (ntfy) (push) Successful in 6s Details ## Summary - Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs - Introduce attached service pattern: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates - Document the pattern in [[review-services]] so future attached services slot in cleanly - Backfill `parent: grafana` on existing `grafana-sidecar` entry ## Version drift protection 1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build` 2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update 3. Both must agree before commit succeeds ## Test plan - [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`) - [ ] Update kustomization `newTag` to branch-built image tag - [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`) - [ ] Verify Authentik login, session persistence, and task queue still work - [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #309	2026-03-24 13:27:36 -07:00
Erich Blume	fc45989a6c	Decommission JobSync service (#308 ) All checks were successful Build Container / detect (push) Successful in 3s Details ## Summary - Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation - Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed ## Motivation Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation. ## Remaining manual steps after merge - Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy` - Sync Homepage: `argocd app sync homepage` - Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #308	2026-03-24 08:44:23 -07:00
Erich Blume	bec554110a	Upgrade Frigate 0.17.0-rc2 → 0.17.1, add motion retention tier Bump from RC to latest stable (security fixes for config endpoint and cross-camera auth). Add new 0.17 motion retention tier at 365 days, reduce continuous from 180 to 30 days. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 07:30:18 -07:00
Erich Blume	b96b4dad47	Move Alerts dashboard into Infrastructure Alerts folder Uses the grafana_folder annotation to place the dashboard in the existing folder created by alert rule provisioning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 21:20:14 -07:00
Erich Blume	9024d41230	Add Grafana alerts dashboard for mobile-friendly alert overview Two panels: currently firing alerts (firing/pending/noData/error) and recent state changes. Refreshes every 30s. Uses Grafana's built-in alertlist panel type — no datasource needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 21:16:54 -07:00
Erich Blume	9efe5c97fe	Fix authentik worker OOMKill: limit concurrency to 2 Dramatiq defaults to one worker process per CPU core. On ringtail (16 cores) this spawned 16 processes, each loading the full Django app, exceeding the 1Gi memory limit and causing a crash loop (228 restarts over 7 days). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 21:05:16 -07:00
Erich Blume	4cc26ed5eb	Update ntfy tag to main build v2.19.2-d1dac0c-nix C0 fix-forward: switch from branch-built image to main-built image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-23 10:36:34 -07:00
Erich Blume	d1dac0c241	Upgrade ntfy v2.17.0 → v2.19.2 (#305 ) All checks were successful Build Container (Nix) / detect (push) Successful in 1s Details Build Container / detect (push) Successful in 3s Details Build Container (Nix) / build (ntfy) (push) Successful in 4s Details Build Container / build (ntfy) (push) Successful in 11s Details ## Summary - Upgrade ntfy from v2.17.0 to v2.19.2 - Update Dockerfile and Nix build definitions with new version, commit SHA, and hashes - Add `subPackages = [ "." ]` to Nix build to handle new `tools/loadtest` module in upstream ## Upstream changes (no breaking changes) - v2.18.0: Experimental PostgreSQL backend support - v2.19.0: PostgreSQL read replica support, notification sound throttling - v2.19.1-2: PostgreSQL bug fixes, web push race condition fix ## Test plan - [ ] Container builds complete on Forgejo Actions (both Dockerfile and Nix) - [ ] Update kustomization.yaml `newTag` to the built nix image tag - [ ] `argocd app set ntfy --revision upgrade/ntfy-v2.19.2 && argocd app sync ntfy` - [ ] Verify ntfy health: `curl https://ntfy.ops.eblu.me/v1/health` - [ ] Send a test notification Reviewed-on: #305	2026-03-23 10:32:06 -07:00
Erich Blume	3750428b58	Fix ArgoCD apps app permanent OutOfSync Remove `group: ""` from ignoreDifferences in tailscale-operator and tailscale-operator-ringtail — ArgoCD normalizes away the empty string field, so the live state never matches git. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 20:42:37 -07:00
Erich Blume	e9b8e3d80b	Revert Tailscale operator to v1.94.2 — images not yet published v1.96.3 exists as a GitHub release but Docker Hub images for both tailscale/tailscale and tailscale/k8s-operator haven't been published yet (v1.94.2 is still latest). Revert the image tags; the fly/start.sh `tailscale wait` improvement and review date stamps are retained. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-22 19:41:40 -07:00
Erich Blume	2e46f99820	Upgrade Tailscale operator v1.94.2 → v1.96.3 (#304 ) Some checks failed Deploy Fly.io Proxy / deploy (push) Failing after 7m0s Details ## Summary - Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization) - Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2) - Stamp kube-state-metrics review date (already current at v2.18.0) ## Notable upstream changes (v1.94.2 → v1.96.3) - Go upgraded from 1.25 to 1.26 - `tailscale wait` command — blocks until daemon is running + interface has IP - AuthKey policy now applies only when users are not logged in (behavioral change) - Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling) - UPnP stability fixes ## Deploy plan 1. Merge PR 2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator` 3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...` 4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods` 5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services) 6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change) ## Test plan - [ ] ArgoCD diff looks clean for both apps before sync - [ ] Proxy pods on indri come up healthy with v1.96.3 images - [ ] Proxy pods on ringtail come up healthy with v1.96.3 images - [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus) - [ ] Fly proxy rebuild deploys successfully with `tailscale wait` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #304	2026-03-22 19:31:22 -07:00
Forgejo Actions	262299c82a	Update docs release to v1.14.3 - Built changelog from towncrier fragments [skip ci]	2026-03-22 18:20:41 -07:00
Erich Blume	6d65e6928c	C2: Deploy infrastructure alerting pipeline (#303 ) ## Summary Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications. Design: - Grafana Unified Alerting evaluates rules against Prometheus/Loki - ntfy webhook contact point delivers iOS notifications - Anti-noise policy: page once per 24h per alert group - Every alert links to a runbook in `docs/how-to/alerts/` - services-check eventually queries the alerting API instead of doing its own probes Chain (bottom-up): 1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy 2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure 3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks 4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API 5. `deploy-infra-alerting` — goal card 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: #303	2026-03-22 14:52:56 -07:00
Erich Blume	531a49abeb	C0 update deployment for loki to 3.6.7	2026-03-20 16:06:29 -07:00
Erich Blume	0f0ee2a319	Update docs and kiwix kustomization tags to `613f05d` builds Also catches kiwix's transmission sidecar up from v4.0.6-r4 to v4.1.1-r1, matching the torrent service (upgraded in PR #282 but the kiwix sidecar was missed). No breaking changes — old RPC protocol is supported through 4.x. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 06:40:49 -07:00
Erich Blume	3d2a97aaf9	Update kustomization tags to OCI-labeled builds (`613f05d`) Point all services at the `613f05d` images which carry the new consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4 version, no matching build) and docs/quartz (no `613f05d` build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 06:34:12 -07:00
Erich Blume	c92b949a20	Fix UID sed to target root-level dashboard uid only The top-level "uid" in Grafana dashboard JSON is at 2-space indent near the end of the file, not the first occurrence. Match on ^ "uid" to avoid clobbering nested datasource uid references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:56:50 -07:00
Erich Blume	334fbbb9e3	Fix TeslaMate/UnPoller dashboard UID sed clobbering datasource refs The previous sed replaced ALL "uid" fields in dashboard JSON files, including datasource references inside panels, causing dashboards to go dark. Scope the replacement to only the first occurrence (the top-level dashboard UID) using GNU sed 0,/pattern/ addressing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:53:00 -07:00
Erich Blume	6f88baeb91	Fix Grafana starred dashboards lost on pod restart Add init container to pre-populate ConfigMap dashboards before Grafana starts, eliminating the race between the sidecar and the provisioner that caused dashboard DB records to be deleted and re-created with new IDs. Also stamp stable UIDs on TeslaMate and UnPoller dashboards fetched from upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 12:40:44 -07:00
Erich Blume	86220b7b88	Update Prometheus deployment to v3.10.0-0d27797 C0 fix-forward: update kustomization newTag and mark service reviewed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 08:46:07 -07:00
Erich Blume	21ddc74cdc	Revert PVC size changes, add hostpath comment StatefulSet volumeClaimTemplates are immutable and minikube's hostpath provisioner doesn't enforce PVC size limits anyway. Add comments noting the data grows freely on the 1.8TB backing disk. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:46:17 -07:00
Erich Blume	ef199b70f0	Increase Prometheus and Loki data retention Prometheus: 15d → 10y (3650d), PVC 20Gi → 200Gi Loki: 31d (744h) → 365d (8760h), PVC 20Gi → 50Gi Indri has 1.6 TB free on the minikube backing disk — the previous 15-day Prometheus retention was losing valuable long-term metrics data. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-18 06:44:00 -07:00

1 2 3 4 5 ...

352 commits