Commit graph

418 commits

Author SHA1 Message Date
2bd1611ac1 Document sifaka NFS/Tailscale TUN troubleshooting
Sifaka's Tailscale can revert to userspace networking after package
updates, causing NFS mounts to fail because the NFS daemon sees
127.0.0.1 instead of the client's Tailscale IP. Added troubleshooting
how-to doc and updated sifaka reference card with frigate export and
TUN requirement.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 09:12:00 -07:00
3017f759a7 Migrate Forgejo from Homebrew to source build (#316)
## Summary

- Migrate Forgejo from Homebrew to source-built binary with mcquack LaunchAgent
- Matches the established pattern used by zot, caddy, and alloy
- Upgrades to v14.0.3 (7 security fixes: PKCE bypass, OAuth scope bypass, open redirect, and more)

## Changes

- **Ansible role**: Replace brew install/services with binary stat check + LaunchAgent
- **Paths**: `/opt/homebrew/var/forgejo` → `~/forgejo`, binary at `~/code/3rd/forgejo/forgejo`
- **Run user**: `forgejo` → `erichblume` (LaunchAgent user; SSH git user stays `forgejo`)
- **Docs**: Updated Forgejo reference card, restart-indri guide
- **Service review**: Stamped frigate-notify, cloudnative-pg, blumeops-pg as current

## One-time migration steps (manual, on indri)

1. Clone from Codeberg, add forge mirror remote
2. Check out v14.0.3, build with `make build && make forgejo`
3. Stop brew, `cp -a` data to `~/forgejo`, fix ownership
4. Run `provision-indri --tags forgejo`
5. Verify, then `brew uninstall forgejo`

## Data safety

- `cp -a` preserves everything (repos, SQLite DB, LFS, sessions, OAuth config)
- Brew version stays installed as rollback until verification passes
- No schema changes between 14.0.2 → 14.0.3

Reviewed-on: #316
2026-03-28 08:19:23 -07:00
3cb4303a54 Add changelog for Immich resource/probe fix
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 22:37:15 -07:00
c78b86c72c Add offsite backup for immich photo library to BorgBase (#315)
## Summary

- Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM
- Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup
- Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label
- Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos
- Docs updated: `backups.md`, `borgmatic.md`

## Prerequisites (manual)

- [x] Create `immich-photos` repo on BorgBase with same SSH key
- [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit)
- [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run)

## Test plan

- [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics`
- [ ] Deploy borgmatic role and verify both configs deployed
- [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours)
- [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom`
- [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #315
2026-03-27 19:43:05 -07:00
ca0c9354ee Add borgmatic backups for authentik and immich databases (#314)
## Summary

- Add `authentik` database (blumeops-pg cluster) to borgmatic pg_dump backups
- Add `immich` database (immich-pg cluster) to borgmatic pg_dump backups
- For immich-pg: new borgmatic managed role with `pg_read_all_data`, ExternalSecret, Tailscale LoadBalancer service, and Caddy L4 TCP proxy on port 5433
- Update backup docs to reflect all four CNPG databases + mealie SQLite

## Deploy plan

Deploy order matters — k8s resources must exist before ansible can route to them:

1. **ArgoCD (databases app):** sync to pick up immich-pg borgmatic role, ExternalSecret, and Tailscale service
   ```
   argocd app set blumeops-pg --revision feature/borgmatic-all-pg-backups
   argocd app sync blumeops-pg
   ```
2. **Wait** for `immich-pg-tailscale` service to get a Tailscale IP and `immich-pg.tail8d86e.ts.net` to resolve
3. **Ansible (caddy):** deploy Caddy L4 route for port 5433
   ```
   mise run provision-indri -- --tags caddy
   ```
4. **Ansible (borgmatic):** deploy updated config and .pgpass
   ```
   mise run provision-indri -- --tags borgmatic
   ```
5. **Verify:** trigger a manual borgmatic run and check all four pg_dump streams succeed
   ```
   borgmatic --verbosity 1 2>&1 | grep -E '(Dumping|ERROR)'
   ```

## Test plan

- [x] `kubectl kustomize` builds cleanly
- [x] `ansible --check --diff` for borgmatic and caddy show expected changes
- [ ] ArgoCD sync succeeds for databases app
- [ ] `immich-pg.tail8d86e.ts.net` resolves
- [ ] `pg.ops.eblu.me:5433` accepts connections
- [ ] `borgmatic --verbosity 1` dumps all four databases without errors

Reviewed-on: #314
2026-03-27 16:59:58 -07:00
33463764d1 Add QArt Tuner: QR code art generator with interactive web UI
Single-file Go tool implementing the QArt technique (Russ Cox, 2012)
using only the public rsc.io/qr API. Generates QR codes whose data
modules form a recognizable image by exploiting error correction
freedom via GF(2) Gaussian elimination.

Includes a web UI with live-updating sliders for version, mask,
rotation, dx/dy offset, and scale. Keyboard shortcuts for rapid
iteration. Also works as a CLI for batch generation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 15:33:36 -07:00
66a47738dd Add ringtail post-deploy maintenance: kernel check, generation pruning, GC
Update manage-lockfile doc with post-deploy steps (kernel update detection,
reboot guidance, generation pruning). Add prune-ringtail-generations mise
task that keeps the 5 most recent generations plus the most recent one
matching the booted kernel for safe rollback.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:55:45 -07:00
a5b33591d3 Update ringtail flake inputs (nixpkgs, home-manager)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:37:43 -07:00
831b82950a Upgrade nvidia-device-plugin v0.18.2 → v0.19.0 and add reference card
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:19:24 -07:00
687e972713 Review CV doc and close build-dep review gap
Fix stale CV service doc (URL, forge domain, container tag) and add
guidance for reviewing build-time dependencies in private forge repos
during service reviews.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-27 07:12:38 -07:00
2c1652604b Reduce PodNotReady alert lookback from 5m to 60s
The 5-minute lookback window kept stale data from terminated pods
visible during rollouts, causing the alert to sit in Pending for
~5 minutes after every routine deployment. 60s still covers two
scrape cycles (30s interval) while clearing stale data much faster.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 19:48:37 -07:00
a37012385f Tighten ArgoCDAppOutOfSync alert timing to clear faster after sync
Reduced `for` from 30m to 5m and lookback window from 5m to 1m. The old
values caused alerts to linger long after apps returned to Synced state.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:44:09 -07:00
fc8d2cdb12 Add preserve/* branch protection and document Pyroscope blocker
branch-cleanup: Add PROTECTED_PREFIXES with preserve/* exclusion so
preserved work-in-progress branches are never deleted.

observability.md: Document Pyroscope profiling work on branch
preserve/pyroscope-profiling/pr-313, blocked on ringtail kernel
sysctl settings (kptr_restrict=0, perf_event_paranoid≤1). Also
document Faro/RUM as future potential with privacy considerations.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 15:32:25 -07:00
e375859221 Upgrade Homepage container to v1.11.0
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dockerfile (homepage) (push) Successful in 5m47s
Minor release with new widgets (Tracearr, SparklyFitness), Seerr rename,
and dependency bumps. No breaking changes for our config.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 10:17:36 -07:00
a5e51bd600 Review tailscale-setup tutorial: fix inaccuracies
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-26 07:44:36 -07:00
4ae55f9bf4 Review kubernetes-bootstrap tutorial: fix inaccuracies
- Fix k3s table entry (BlumeOps uses k3s on ringtail)
- Fix broken tailscale serve command (minikube ip returns IP, not port)
- Rewrite NFS section to match actual static PV/PVC binding pattern
- Fix "BluemeOps" typo
- Add last-reviewed frontmatter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-25 16:16:33 -07:00
796baaa41a Upgrade External Secrets Operator v2.2.0 + migrate Helm to kustomize (#312)
## Summary

- Upgrade External Secrets Operator from v1.3.2 (helm-chart-2.0.0) to v2.2.0
- Migrate from Helm chart deployment to static kustomize manifests, matching the repo's kustomize-first pattern
- Merge separate `-config` ArgoCD apps into the main operator apps (6 → 4 apps)
- Clean up Helm-specific labels (`helm.sh/chart`, `managed-by: Helm`)
- Update README example from v1beta1 to v1 API

## Breaking changes assessment

Low risk — v2.0.0 removed Alibaba and Device42 providers (we use neither). No templating changes affect us. All ExternalSecrets already use v1 API.

## Deployment steps

1. Sync CRDs first on both clusters (new CRD version)
2. Sync operator apps (now kustomize-based)
3. Verify ClusterSecretStore and all ExternalSecrets are healthy
4. Delete orphaned config apps: `argocd app delete external-secrets-config` and `-config-ringtail`
5. `mise run services-check`

Reviewed-on: #312
2026-03-25 15:56:41 -07:00
b97e37543f Deploy Tor Snowflake proxy on ringtail (#311)
## Summary

- Add Snowflake proxy as a native systemd service on ringtail (NixOS)
- Uses `pkgs.snowflake` from nixpkgs (v2.11.0)
- Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit
- Prometheus metrics enabled on localhost:9999

## What is Snowflake?

A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. **This is NOT a Tor exit node** — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP.

## Changes

- `nixos/ringtail/configuration.nix` — new systemd service definition
- `docs/reference/services/snowflake-proxy.md` — service reference card
- `docs/reference/infrastructure/ringtail.md` — updated systemd services section
- `service-versions.yaml` — added entry (type: nixos)

## Deploy plan

After review, deploy via `mise run provision-ringtail`. Service starts automatically.

## Test plan

- [ ] `mise run provision-ringtail` succeeds
- [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active
- [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections
- [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics

Reviewed-on: #311
2026-03-24 20:51:40 -07:00
Forgejo Actions
243a862901 Update docs release to v1.15.0
- Built changelog from towncrier fragments

[skip ci]
2026-03-24 19:51:17 -07:00
fe201a495c Add Prowler IaC scanning of blumeops repo (Saturday 2am)
Clone repo in init container, scan Dockerfiles and K8s manifests
with Prowler's IaC provider (Trivy). Reports written to
sifaka:/volume1/reports/prowler-iac/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:49:38 -07:00
696024306c Add Prowler image vulnerability scanning for blumeops containers
All checks were successful
Build Container / detect (push) Successful in 39s
Build Container / build-dockerfile (prowler) (push) Successful in 10m15s
Add Trivy to the Prowler container for image and IaC scanning.
New CronJob (Saturday 3am) scans all blumeops/* images in the
registry for CVEs, embedded secrets, and Dockerfile misconfigs.
Reports written to sifaka:/volume1/reports/prowler-images/.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:43:08 -07:00
07e9c810ca Add RuntimeDefault seccomp profiles to all managed workloads
Addresses 32 CIS Kubernetes Benchmark failures from Prowler scan
(core_seccomp_profile_docker_default). Applied pod-level seccomp
RuntimeDefault to 18 deployments/statefulsets and 2 cronjobs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 16:19:40 -07:00
d021b3534f Deploy Prowler CIS scanner (#310)
All checks were successful
Build Container / detect (push) Successful in 4s
Build Container / build-dockerfile (prowler) (push) Successful in 10s
## Summary
- Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning
- Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream)
- Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/`
- Read-only ClusterRole for pod, RBAC, and control plane inspection
- Host path mounts + hostPID for kubelet file permission checks

## Follow-ups
- Mirror prowler-cloud/prowler on forge for supply chain control
- Build and push container image, update kustomization.yaml newTag
- Consider adding k3s-ringtail scanning (core + RBAC checks only)

## Test plan
- [ ] Build container: `mise run container-release prowler v5.22.0`
- [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag
- [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler`
- [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri`
- [ ] Verify reports appear on sifaka NFS share
- [ ] `mise run services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #310
2026-03-24 16:08:09 -07:00
fd0bebb0fc Localize authentik-redis container (#309)
All checks were successful
Build Container / detect (push) Successful in 3s
Build Container / build-dockerfile (alloy) (push) Successful in 12s
Build Container / build-dockerfile (ntfy) (push) Successful in 11s
Build Container / build-nix (alloy) (push) Successful in 20s
Build Container / build-nix (authentik) (push) Successful in 6m10s
Build Container / build-nix (authentik-redis) (push) Successful in 20s
Build Container / build-nix (ntfy) (push) Successful in 6s
## Summary

- Replace upstream `docker.io/library/redis:7-alpine` (Redis 7.4.8) with a nix-built container using Redis 8.2.3 from nixpkgs
- Introduce **attached service pattern**: `parent` field in service-versions.yaml, `<parent>-<component>` naming convention, and `assert pkgs.redis.version == version` in default.nix to prevent silent version drift on `flake.lock` updates
- Document the pattern in [[review-services]] so future attached services slot in cleanly
- Backfill `parent: grafana` on existing `grafana-sidecar` entry

## Version drift protection

1. `flake.lock` update bumps nixpkgs redis → `assert` in `default.nix` breaks `nix-build`
2. Developer updates `version` in `default.nix` → prek's `container-version-check` demands matching `service-versions.yaml` update
3. Both must agree before commit succeeds

## Test plan

- [ ] Build container from branch on ringtail (`mise run container-build-and-release authentik-redis`)
- [ ] Update kustomization `newTag` to branch-built image tag
- [ ] Sync authentik ArgoCD app from branch (`argocd app set authentik --revision localize-redis && argocd app sync authentik`)
- [ ] Verify Authentik login, session persistence, and task queue still work
- [ ] After merge: C0 follow-up to update `newTag` to the main-built image tag

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #309
2026-03-24 13:27:36 -07:00
fc45989a6c Decommission JobSync service (#308)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary

- Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation
- Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed

## Motivation

Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation.

## Remaining manual steps after merge

- Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy`
- Sync Homepage: `argocd app sync homepage`
- Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #308
2026-03-24 08:44:23 -07:00
0d422f5234 Update tooling dependencies (March 2026) (#307)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 2m51s
## Summary

Monthly tooling dependency update per [[update-tooling-dependencies]].

- **Prek hooks:** trufflehog v3.93.4→v3.94.0, ruff v0.15.2→v0.15.7, shfmt v3.12.0-2→v3.13.0-1, ansible-lint floor→26.3.0, ansible-core floor→2.18
- **Fly.io proxy:** nginx 1.28.2→1.29.6, Grafana Alloy v1.13.1→v1.14.1
- **Forgejo workflows:** actions/checkout v4.3.1→v6.0.2 (SHA-pinned across all 5 workflows)
- **Mise tasks:** tightened Python lower bounds — rich≥14.0.0, typer≥0.24.0, httpx≥0.28.1, pyyaml≥6.0.2

## Test plan

- [x] `prek run --all-files` passes
- [ ] Verify Fly.io deploy succeeds after merge (nginx minor bump + Alloy bump)
- [ ] Spot-check a workflow run with the new actions/checkout v6

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #307
2026-03-24 08:11:46 -07:00
0698013355 Review ArgoCD config tutorial: fix sync policy, typo, add cross-references
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 07:55:00 -07:00
bec554110a Upgrade Frigate 0.17.0-rc2 → 0.17.1, add motion retention tier
Bump from RC to latest stable (security fixes for config endpoint and
cross-camera auth). Add new 0.17 motion retention tier at 365 days,
reduce continuous from 180 to 30 days.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 07:30:18 -07:00
9024d41230 Add Grafana alerts dashboard for mobile-friendly alert overview
Two panels: currently firing alerts (firing/pending/noData/error) and
recent state changes. Refreshes every 30s. Uses Grafana's built-in
alertlist panel type — no datasource needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:16:54 -07:00
9efe5c97fe Fix authentik worker OOMKill: limit concurrency to 2
Dramatiq defaults to one worker process per CPU core. On ringtail (16 cores)
this spawned 16 processes, each loading the full Django app, exceeding the
1Gi memory limit and causing a crash loop (228 restarts over 7 days).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:05:16 -07:00
bd0ff30d3f Unify container build workflows (#306)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary
- Merges `build-container.yaml` and `build-container-nix.yaml` into a single workflow
- Detect job classifies each changed container by presence of `Dockerfile` and/or `default.nix`
- Dockerfile containers build on `k8s` (indri) via Dagger; Nix containers build on `nix-container-builder` (ringtail) via nix-build + skopeo
- Containers with both build files (alloy, nettest, ntfy) get built on both runners

## Test plan
- [ ] Push a change to a Dockerfile-only container (e.g. grafana) — verify it builds on k8s only
- [ ] Push a change to a nix-only container (e.g. jobsync) — verify it builds on nix-container-builder only
- [ ] Push a change to a dual container (e.g. ntfy) — verify it builds on both runners
- [ ] Test workflow_dispatch with a specific container name

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #306
2026-03-23 20:55:50 -07:00
d1dac0c241 Upgrade ntfy v2.17.0 → v2.19.2 (#305)
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 3s
Build Container (Nix) / build (ntfy) (push) Successful in 4s
Build Container / build (ntfy) (push) Successful in 11s
## Summary
- Upgrade ntfy from v2.17.0 to v2.19.2
- Update Dockerfile and Nix build definitions with new version, commit SHA, and hashes
- Add `subPackages = [ "." ]` to Nix build to handle new `tools/loadtest` module in upstream

## Upstream changes (no breaking changes)
- **v2.18.0:** Experimental PostgreSQL backend support
- **v2.19.0:** PostgreSQL read replica support, notification sound throttling
- **v2.19.1-2:** PostgreSQL bug fixes, web push race condition fix

## Test plan
- [ ] Container builds complete on Forgejo Actions (both Dockerfile and Nix)
- [ ] Update kustomization.yaml `newTag` to the built nix image tag
- [ ] `argocd app set ntfy --revision upgrade/ntfy-v2.19.2 && argocd app sync ntfy`
- [ ] Verify ntfy health: `curl https://ntfy.ops.eblu.me/v1/health`
- [ ] Send a test notification

Reviewed-on: #305
2026-03-23 10:32:06 -07:00
06e721841c Review 12 reference docs: fix stale image refs, expand stubs, add cross-refs
Replace hardcoded image tags in Quick Reference tables with pointers to
kustomization manifests (tags drift with every container release). Fix
Prometheus CNPG scrape target, remove misleading .ts.net URLs, expand
external-secrets stub, add backup/disaster-recovery cross-references.
Limit doc-reviewer agent to one doc per cycle.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 09:51:57 -07:00
3750428b58 Fix ArgoCD apps app permanent OutOfSync
Remove `group: ""` from ignoreDifferences in tailscale-operator and
tailscale-operator-ringtail — ArgoCD normalizes away the empty string
field, so the live state never matches git.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 20:42:37 -07:00
e9b8e3d80b Revert Tailscale operator to v1.94.2 — images not yet published
v1.96.3 exists as a GitHub release but Docker Hub images for both
tailscale/tailscale and tailscale/k8s-operator haven't been published
yet (v1.94.2 is still latest). Revert the image tags; the fly/start.sh
`tailscale wait` improvement and review date stamps are retained.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 19:41:40 -07:00
2e46f99820 Upgrade Tailscale operator v1.94.2 → v1.96.3 (#304)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 7m0s
## Summary

- Bump Tailscale operator, proxy containers, and init containers from v1.94.2 to v1.96.3 across both clusters (indri + ringtail via shared base kustomization)
- Replace hand-rolled `until tailscale status` polling loop in `fly/start.sh` with `tailscale wait --timeout 60s` (new in v1.96.2)
- Stamp kube-state-metrics review date (already current at v2.18.0)

## Notable upstream changes (v1.94.2 → v1.96.3)

- Go upgraded from 1.25 to 1.26
- `tailscale wait` command — blocks until daemon is running + interface has IP
- AuthKey policy now applies only when users are not logged in (behavioral change)
- Peer Relay improvements (metrics, EC2 IMDS, UDP socket scaling)
- UPnP stability fixes

## Deploy plan

1. Merge PR
2. Sync tailscale-operator on indri: `argocd app sync tailscale-operator`
3. Sync tailscale-operator on ringtail: `argocd app sync tailscale-operator-ringtail --server ringtail...`
4. Verify proxy pods roll with new image: `kubectl --context=minikube-indri -n tailscale get pods`
5. Verify ingress connectivity (spot-check a few `*.tail8d86e.ts.net` services)
6. Rebuild + deploy Fly proxy container (separate step, picks up `tailscale wait` change)

## Test plan

- [ ] ArgoCD diff looks clean for both apps before sync
- [ ] Proxy pods on indri come up healthy with v1.96.3 images
- [ ] Proxy pods on ringtail come up healthy with v1.96.3 images
- [ ] Tailscale ingress services remain reachable (e.g., grafana, prometheus)
- [ ] Fly proxy rebuild deploys successfully with `tailscale wait`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #304
2026-03-22 19:31:22 -07:00
Forgejo Actions
262299c82a Update docs release to v1.14.3
- Built changelog from towncrier fragments

[skip ci]
2026-03-22 18:20:41 -07:00
2cb0ce4289 Review and correct Tailscale reference doc
Fix wrong ACL path, add missing device tags (ringtail, per-service tags,
ci-gateway, flyio-proxy), correct access matrix (PyPI→DevPI, homelab
grants), add homelab→homelab SSH rule, document auto approvers section,
and add last-reviewed frontmatter.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 18:18:45 -07:00
6d65e6928c C2: Deploy infrastructure alerting pipeline (#303)
## Summary

Mikado chain to replace `mise run services-check` with Grafana Unified Alerting backed by ntfy push notifications.

**Design:**
- Grafana Unified Alerting evaluates rules against Prometheus/Loki
- ntfy webhook contact point delivers iOS notifications
- Anti-noise policy: page once per 24h per alert group
- Every alert links to a runbook in `docs/how-to/alerts/`
- services-check eventually queries the alerting API instead of doing its own probes

**Chain (bottom-up):**
1. `configure-grafana-alerting-pipeline` — enable alerting, ntfy contact point, notification policy
2. `first-alert-and-runbook` — end-to-end proof of concept with blackbox probe failure
3. `port-services-check-alerts` — migrate all services-check probes to alert rules + runbooks
4. `refactor-services-check-to-query-alerts` — rewrite services-check to query Grafana API
5. `deploy-infra-alerting` — goal card

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #303
2026-03-22 14:52:56 -07:00
f1620abb17 Improve Frigate health checks to catch NFS and camera failures
Replace single aggregate camera_fps check with per-camera FPS validation
and NFS storage accessibility check. Motivated by an outage where Frigate
API responded OK but NFS mount was inaccessible, causing "no frames" in UI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 09:55:53 -07:00
613f05dfde Add consistent OCI labels to all container Dockerfiles
All checks were successful
Build Container (Nix) / build (miniflux) (push) Successful in 2s
Build Container (Nix) / build (navidrome) (push) Successful in 2s
Build Container / build (devpi) (push) Successful in 41s
Build Container (Nix) / build (nettest) (push) Successful in 15s
Build Container / build (grafana-sidecar) (push) Successful in 1m27s
Build Container / build (grafana) (push) Successful in 3m23s
Build Container (Nix) / build (ntfy) (push) Successful in 3m19s
Build Container (Nix) / build (prometheus) (push) Successful in 1s
Build Container (Nix) / build (quartz) (push) Successful in 1s
Build Container (Nix) / build (runner-job-image) (push) Successful in 1s
Build Container (Nix) / build (teslamate) (push) Successful in 2s
Build Container (Nix) / build (transmission) (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 1s
Build Container (Nix) / build (unpoller) (push) Successful in 1s
Build Container / build (kiwix-serve) (push) Successful in 1m17s
Build Container / build (kubectl) (push) Successful in 41s
Build Container / build (homepage) (push) Successful in 8m21s
Build Container / build (mealie) (push) Successful in 1m1s
Build Container / build (loki) (push) Successful in 8m21s
Build Container / build (miniflux) (push) Successful in 2m24s
Build Container / build (nettest) (push) Successful in 14s
Build Container / build (ntfy) (push) Successful in 8m33s
Build Container / build (prometheus) (push) Successful in 37s
Build Container / build (quartz) (push) Successful in 19s
Build Container / build (navidrome) (push) Successful in 10m36s
Build Container / build (runner-job-image) (push) Successful in 3m18s
Build Container / build (transmission) (push) Successful in 20s
Build Container / build (transmission-exporter) (push) Successful in 21s
Build Container / build (unpoller) (push) Successful in 11s
Build Container / build (teslamate) (push) Successful in 4m42s
Every container now carries title, description, version, source, and
vendor labels per the OCI image spec. Version is derived from the
existing CONTAINER_APP_VERSION ARG at build time.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 20:42:00 -07:00
0dffdb9974 Add Claude Code subagents for infrastructure workflows
Four project-scoped subagents that formalize existing mise task
workflows as constrained, specialized AI agents:
- infra-health: background health monitor (wraps services-check)
- doc-reviewer: persistent-memory documentation reviewer
- change-classifier: C0/C1/C2 triage before work begins
- mikado-navigator: C2 chain state advisor (wraps docs-mikado)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:57:36 -07:00
ef8c2118a1 Standardize USAGE pragmas and typer parsing across mise tasks
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 11:42:01 -07:00
0d2779762a Upgrade Prometheus to v3.10.0 (#301)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (prometheus) (push) Successful in 2s
Build Container / build (prometheus) (push) Successful in 44m11s
## Summary
- Bump Prometheus from v3.9.1 to v3.10.0 in custom container Dockerfile
- v3.10.0 adds distroless Docker image variants, new PromQL `fill` operators, and performance improvements
- Dagger build tested locally — builds cleanly

## Remaining after merge
- Update `kustomization.yaml` newTag with the auto-built image tag
- Update `service-versions.yaml` (last-reviewed + current-version)
- ArgoCD sync

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #301
2026-03-18 07:47:46 -07:00
528d3da327 Review power.md: add ringtail, mark reviewed
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 07:37:31 -07:00
e0dbcbd997 Update retention changelog to reflect final PVC decision
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:46:55 -07:00
ef199b70f0 Increase Prometheus and Loki data retention
Prometheus: 15d → 10y (3650d), PVC 20Gi → 200Gi
Loki: 31d (744h) → 365d (8760h), PVC 20Gi → 50Gi

Indri has 1.6 TB free on the minikube backing disk — the previous
15-day Prometheus retention was losing valuable long-term metrics data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:44:00 -07:00
3e9873d669 Fix borgmatic backup: use correct kubectl context on indri
The Mealie SQLite dump hook used `minikube-indri` (the context name on
gilbert), but on indri itself the context is just `minikube`. This caused
the before_backup hook to fail, aborting all backups since the hook was added.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:07:44 -07:00
cfe3391f1a Bump Frigate retention and add recording health check
Increase retention: continuous 3→180d, detections 14→30d, alerts 30→730d.
Plenty of NFS headroom (~9.4 TiB free, ~6.6 GB/day for one camera).

Add frigate-recording check to services-check that verifies camera_fps > 0,
which would have caught the 6-day outage from the mqtt config removal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 18:24:11 -07:00
6617e44e5b Fix Frigate crash: re-add required mqtt config section
Frigate's config schema requires an `mqtt` field even when MQTT isn't
used. Commit 40f1568 removed it along with Mosquitto, causing Frigate
to fail validation on startup. Add `mqtt.enabled: false` to satisfy
the schema without needing a broker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 18:10:23 -07:00