blumeops

Author	SHA1	Message	Date
Erich Blume	95364dcb48	Simplify runner image (Dagger Phase 3) (#162 ) All checks were successful Build Container / build (push) Successful in 1m13s Details forgejo-runner-v3.0.0 ## Summary With Phases 1 and 2 complete, the runner image no longer needs most of its bundled tools. This PR strips it down and adds what was missing. Removed (now inside Dagger containers): - Node.js 24.x - Docker CLI + buildx plugin - skopeo - gnupg, lsb-release, xz-utils Added: - `tzdata` — fixes the TZ env var (#159, #160, #161) so `TZ=America/Los_Angeles` actually works - `flyctl` — was being installed from scratch every release Workflow changes: - Remove "Ensure Dagger CLI" bootstrap steps from both workflows (Dagger is in the image) - Remove "Install flyctl" step from build-blumeops (flyctl is in the image) - Remove job-level `TZ` from build-blumeops (moved to runner configmap `runner.envs`) - Set `TZ: America/Los_Angeles` in runner configmap so all job containers inherit it ## Deployment After merge: 1. Build and release the new runner image: `mise run container-release forgejo-runner v2.0.0` 2. Sync the runner: `argocd app sync forgejo-runner` 3. Verify: `kubectl -n forgejo-runner exec deploy/forgejo-runner -c runner -- date` (but the real test is running a docs release and checking the changelog date) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/162	2026-02-11 17:24:20 -08:00
Forgejo Actions	996afbcf6f	Update docs release to v1.6.5 [skip ci]	2026-02-11 17:10:29 -08:00
Erich Blume	e84ffb7d7f	Set TZ on build-blumeops workflow job (#161 ) v1.6.5 ## Summary The runner pod's `TZ` env var (#159, #160) doesn't propagate to workflow job containers — jobs run inside Docker containers spawned by the DinD sidecar, not in the runner process itself. Set `TZ: America/Los_Angeles` at the job level so `uvx towncrier build` uses the correct timezone. This is the actual fix for the Feb 12 changelog dates. The runner pod TZ is still useful for runner daemon logs but doesn't affect job execution. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/161	2026-02-11 17:06:44 -08:00
Forgejo Actions	6ce03df819	Update docs release to v1.6.4 - Built changelog from towncrier fragments [skip ci]	2026-02-12 01:01:23 +00:00
Erich Blume	2a04ab26b7	Mount host zoneinfo into runner for TZ support (#160 ) v1.6.4 ## Summary The `TZ=America/Los_Angeles` env var from #159 has no effect because the `forgejo/runner` image doesn't ship tzdata. Mount the node's `/usr/share/zoneinfo` into the container so the timezone database is available. ## Deployment After merge, sync forgejo-runner and verify: ``` argocd app sync forgejo-runner kubectl -n forgejo-runner exec deploy/forgejo-runner -c runner -- date # Should show PST/PDT, not UTC ``` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/160	2026-02-11 16:57:11 -08:00
Erich Blume	42ebc2b122	Fix Forgejo runner timezone (UTC -> America/Los_Angeles) (#159 ) ## Summary - Set `TZ=America/Los_Angeles` on the Forgejo runner container The runner pod defaults to UTC. When releases are cut in the evening PST, towncrier stamps changelog entries with tomorrow's date (e.g., v1.6.2 shows 2026-02-12 despite being released on the evening of Feb 11 PST). ## Deployment After merge, sync the forgejo-runner ArgoCD app: ``` argocd app sync forgejo-runner ``` The runner pod will restart with the new timezone. Note: the v1.6.2 changelog entry will remain dated 2026-02-12; future entries will use PST dates, so dates may appear non-sequential once. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/159	2026-02-11 16:53:41 -08:00
Forgejo Actions	e5d1e795e0	Update docs release to v1.6.3 [skip ci]	2026-02-12 00:46:35 +00:00
Erich Blume	b0bac91ca9	Fix frontmatter field name for Quartz date display (#158 ) v1.6.3 ## Summary - Rename `date-modified` -> `modified` in all 80 docs and the `docs-check-frontmatter` task Quartz's `CreatedModifiedDate` plugin recognizes `modified`, `lastmod`, `updated`, and `last-modified` — but not `date-modified`. The wrong field name caused Quartz to ignore frontmatter dates entirely and fall through to filesystem timestamps (UTC inside Dagger), showing Feb 12 on pages built late on Feb 11 PST. ## Test plan - [x] `mise run docs-check-frontmatter` passes - [ ] Kick off docs release after merge — verify rendered dates match frontmatter values Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/158	2026-02-11 16:45:12 -08:00
Forgejo Actions	a75089d8ef	Update docs release to v1.6.2 - Built changelog from towncrier fragments [skip ci]	2026-02-12 00:35:02 +00:00
Erich Blume	b197bd5f58	Adopt Dagger CI for docs build (Phase 2) (#157 ) v1.6.2 ## Summary Migrates the docs build pipeline to Dagger (Phase 2 of the Dagger CI adoption plan). - Backfill `date-modified` frontmatter on all 80 docs — Dagger's `--src=.` excludes `.git`, so Quartz can't use git history for page dates. Frontmatter dates work with or without git. - New `docs-check-frontmatter` mise task + pre-commit hook — validates all docs have `title`, `tags`, and `date-modified` - New Dagger functions — `build_changelog` (towncrier in Python container) and `build_docs` (chains changelog → Quartz build in Node container, returns tarball) - Simplified CI workflow — the ~44-line inline Quartz build (clone, npm ci, build, tar, cleanup) is replaced by `dagger call build-docs`. Changelog step remains local on the runner since towncrier needs to modify the host working tree for the git commit. ### Design decisions - Towncrier runs twice in CI: once inside Dagger (for the docs tarball) and once on the runner (for the git commit). This is intentional — Dagger's directory export is additive and can't delete the consumed changelog fragments from the host. - Artifact hosting stays on Forgejo Releases (not migrated to Forgejo Packages as the plan doc originally suggested). That migration can happen independently. - `date-modified` frontmatter preserved even though `build_changelog` installs git — the git there is only for towncrier's `git add` call, not for history. The local iteration story (`dagger call build-docs --src=. --version=dev` with uncommitted changes) depends on frontmatter dates. ### Local iteration ```bash dagger call build-docs --src=. --version=dev export --path=./docs-dev.tar.gz tar tf docs-dev.tar.gz \| head -20 ``` ## Deployment and Testing - [x] `dagger call build-docs --src=. --version=dev` produces valid 1.1MB tarball (149 HTML pages) - [x] Pre-commit hooks pass (including new `docs-check-frontmatter`) - [ ] Full `workflow_dispatch` run after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/157	2026-02-11 16:33:16 -08:00
Erich Blume	738b39a321	Mark Phase 1 verification checklist items as complete Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 15:49:49 -08:00
Erich Blume	1bc2b421a8	Adopt Dagger CI for container builds (Phase 1) (#156 ) All checks were successful Build Container / build (push) Successful in 13s Details forgejo-runner-v2.6.0 nettest-v0.13.0 ## Summary - Add Dagger Python module (`.dagger/`) with `build` and `publish` functions for container images - Replace Docker buildx + skopeo composite action with `dagger call publish` in `build-container.yaml` - BuildKit's native push is compatible with Zot — skopeo workaround eliminated - Add Dagger CLI (v0.19.11) to forgejo-runner Dockerfile, bump runner to v2.6.0 - Bootstrap step in workflow curl-installs dagger if not in runner (for first build on v2.5.1 runner) - Delete old `.forgejo/actions/build-push-image/` composite action - Add GPLv3 LICENSE ## Verified locally - `dagger call build --src=. --container-name=nettest` — builds ✓ - `dagger call publish --src=. --container-name=nettest --version=dagger-test` — pushed to Zot ✓ - `dagger call build --src=. --container-name=forgejo-runner` — new runner image builds ✓ - Dagger CLI accessible inside built runner image ✓ ## Deployment sequence (after merge) 1. `mise run container-tag-and-release forgejo-runner v2.6.0` — old runner bootstraps dagger via curl, builds new runner 2. `argocd app sync forgejo-runner` — runner restarts with v2.6.0 (dagger baked in) 3. `mise run container-tag-and-release nettest v0.13.0` — end-to-end test of new pipeline 4. `mise run container-list` — verify tags ## Not included (future phases) - Phase 2: docs build + Forgejo packages migration - Phase 3: runner simplification (remove skopeo, Node.js, etc.) - Phase 4: future workflows Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/156	2026-02-11 15:38:31 -08:00
Erich Blume	faebc98c3c	Fix blumeops-tasks for Todoist API v1 migration (#155 ) ## Summary - Migrate from deprecated Todoist REST API v2 (`410 Gone`) to new unified API v1 - Add cursor-based pagination for project and task listing endpoints - Switch 1Password credential retrieval from `op item get --fields` to `op read` ## Testing - [x] `mise run blumeops-tasks` returns all 9 tasks successfully - [x] Pre-commit hooks pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/155	2026-02-11 14:33:37 -08:00
Forgejo Actions	362ae22ab7	Update docs release to v1.6.1 - Built changelog from towncrier fragments [skip ci]	2026-02-11 21:37:34 +00:00
Erich Blume	3c4b5b6c10	Add changelog fragment for cache purge BusyBox fix v1.6.1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 13:36:44 -08:00
Erich Blume	cef7611cba	Wrap fly ssh cache purge in sh -c for BusyBox fly ssh console -C doesn't run through a shell, so && was passed as literal arguments to rm. Wrap in sh -c to get proper shell parsing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-11 13:35:11 -08:00
Forgejo Actions	eca01a9546	Update docs release to v1.6.0 - Built changelog from towncrier fragments [skip ci]	2026-02-11 21:33:57 +00:00
Erich Blume	0efcce2984	Purge Fly.io proxy cache after docs release (#154 ) v1.6.0 ## Summary - The Fly.io nginx proxy caches docs responses for 24h (`proxy_cache_valid 200 1d`) - After a release, docs.eblu.me kept serving stale content until the cache expired - This caused v1.5.4 to show v1.5.3 on the CHANGELOG page - Adds `flyctl` install and `fly ssh console` cache purge steps to the build workflow, running after the ArgoCD deploy completes ## Test plan - [ ] Next release should show the correct version on docs.eblu.me/CHANGELOG immediately - [ ] Verify the `fly ssh console` command succeeds in the workflow logs Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/154	2026-02-11 13:33:26 -08:00
Forgejo Actions	ab6661f5dd	Update docs release to v1.5.4 - Built changelog from towncrier fragments [skip ci]	2026-02-11 20:17:12 +00:00
Erich Blume	a59ff04249	Review security-model.md (#153 ) v1.5.4 ## Summary - Fix Ansible secret example: replaced incorrect `op item get --fields` with `op read` to match project convention - Add new "Tailscale Operator Privileges" section documenting the operator's namespaced RBAC and OAuth client permissions - Stamp `last-reviewed: 2026-02-11` ## Review Notes First review of this doc (previously never reviewed). Verified: - All wiki-links resolve - ACL structure matches actual `pulumi/tailscale/policy.hujson` - TruffleHog pre-commit config exists as documented - Ansible `op read` pattern matches actual usage in playbooks/roles Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/153	2026-02-11 12:16:32 -08:00
Erich Blume	834c9fa57b	Bump Fly.io proxy VM to 512MB, fix TruffleHog scanning (#152 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m37s Details ## Summary - Bump Fly.io proxy VM memory from 256MB to 512MB — Alloy was OOM-killed, causing the Grafana Fly.io dashboard to lose metrics - Fix TruffleHog pre-commit hook to scan only staged changes (`--since-commit HEAD`) instead of full repo history - Sanitize example credential URL in Reolink camera plan doc ## Deployment and Testing - [ ] Fly.io deploy triggers automatically on merge (workflow watches `fly/**`) - [ ] After deploy, verify Alloy is running: `fly ssh console -a blumeops-proxy -C "ps aux"` should show alloy process - [ ] Grafana Fly.io dashboard should start populating within ~1 minute Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/152	2026-02-11 12:03:51 -08:00
Erich Blume	651fed8f1a	Transcribe backlog tasks into plan documents (#151 ) ## Summary - adopt-oidc-provider: Dex-based OIDC identity provider for SSO across services (status: Planning — service dependency/recovery design needed) - harden-zot-registry: OIDC + API key auth and tag immutability for zot (depends on OIDC provider + Dagger CI) - forgejo-actions-dashboard: Custom textfile Prometheus exporter + Grafana dashboard for Forgejo Actions CI metrics - operationalize-reolink-camera: Cloud-free Frigate NVR with ONNX detection, NFS ring buffer recording to sifaka (depends on network segmentation) - add-unifi-pulumi-stack: Expanded with NFS security motivation, BlumeOps Services subnet, IoT/appliance segregation, firewall rules ## Test plan - [x] Pre-commit hooks pass (all 3 commits) - [x] `docs-check-links` passes - [x] `docs-check-index` passes - [x] `docs-check-filenames` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/151	2026-02-11 11:47:23 -08:00
Erich Blume	430f2c6ec5	Add plans for Dagger CI/CD and upstream fork strategy (#150 ) ## Summary Two new plan documents in `docs/how-to/plans/`: - adopt-dagger-ci — Migrate CI/CD build logic from Forgejo Actions YAML to Dagger (Python SDK). Forgejo Actions stays as a thin trigger layer. Covers: - Container builds with local iteration (`dagger call build ... terminal`) - Docs builds with Forgejo packages migration (replacing Forgejo releases) - Runner simplification (only Docker + dagger CLI needed) - Secrets handling via Dagger's `Secret` type - Future: forked project builds, Python packages, pre-merge validation - upstream-fork-strategy — Stacked-branch pattern for maintaining forks of upstream projects. Covers: - Daily automated rebase with conflict detection and issue creation - Branch model: `upstream/main` → `blumeops` → `feature/*` - Quartz fork as first instance, enabling `last-reviewed` frontmatter rendering in docs - Upstream PR path for contributing changes back ## Context These plans emerged from evaluating alternatives to the GHA ecosystem (BuildKite, Concourse, Earthly) for CI/CD. Dagger was chosen for its local iteration story, Python-native pipelines, and zero-infrastructure requirements. The fork strategy is a prerequisite for customizing Quartz and other upstream tools. Neither plan is ready for execution yet — they are design documents for future work. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/150	2026-02-11 10:20:14 -08:00
Forgejo Actions	a106f92c38	Update docs release to v1.5.3 - Built changelog from towncrier fragments [skip ci]	2026-02-11 15:53:49 +00:00
Erich Blume	aab19c97fe	Restore docker buildx build (#149 ) All checks were successful Build Container / build (push) Successful in 40s Details nettest-v0.12.0 v1.5.3 ## Summary - Switch build action back to `docker buildx build` now that runner v2.5.1 (with `docker-buildx-plugin`) is deployed ## Test plan - [ ] Merge and tag `nettest-v0.12.0` to verify buildx works end-to-end Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/149	2026-02-10 21:21:19 -08:00
Erich Blume	f0ac04fb8a	Bootstrap buildx: revert to docker build, bump runner to v2.5.1 (#148 ) All checks were successful Build Container / build (push) Successful in 1m56s Details forgejo-runner-v2.5.1 ## Summary - Temporarily revert composite action to `docker build` so we can build the runner image (chicken-and-egg: current runner v2.5.0 doesn't have buildx) - Bump runner label to `v2.5.1` so after sync the new runner image (with buildx) gets used ## Deployment plan 1. Merge this PR 2. Tag `forgejo-runner-v2.5.1` — builds with legacy `docker build` (one last time) 3. Sync forgejo-runner in ArgoCD to pick up the v2.5.1 label 4. Follow-up PR: switch action back to `docker buildx build` 5. Tag `nettest-v0.12.0` to verify buildx works end-to-end Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/148	2026-02-10 21:17:14 -08:00
Erich Blume	2fc5aa82b1	Add docker-buildx-plugin to forgejo-runner (#147 ) Some checks failed Build Container / build (push) Failing after 3s Details ## Summary - Install `docker-buildx-plugin` alongside `docker-ce-cli` in the forgejo-runner image - Fixes `docker buildx build` failing with "unknown flag: --tag" from #146 ## Test plan - [ ] Merge and release `forgejo-runner-v2.5.1` - [ ] Update runner configmap/labels if needed to use new image - [ ] Re-tag `nettest-v0.11.1` (or `v0.12.0`) to verify build-container workflow succeeds Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/147	2026-02-10 21:14:29 -08:00
Erich Blume	cb36f1784f	Switch CI builds to docker buildx (#146 ) Some checks failed Build Container / build (push) Failing after 4s Details nettest-v0.11.1 ## Summary - Replace deprecated `docker build` with `docker buildx build` in the build-push-image composite action - Remove redundant build/run comments from nettest Dockerfile ## Test plan - [ ] Merge and tag `nettest-v1.1.0` (or similar) to trigger the build-container workflow - [ ] Verify the build succeeds without the deprecation warning Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/146	2026-02-10 21:03:41 -08:00
Erich Blume	19741021d6	Doc review for explanation/explanation.md	2026-02-10 16:00:01 -08:00
Erich Blume	0dce806107	Add plan and reference card for UniFi Express 7 Pulumi stack (#145 ) ## Summary - Rewrites the UniFi Pulumi plan doc to use filipowm/unifi Terraform provider via `pulumi package add terraform-provider` (replaces pulumiverse_unifi approach) - Adds network segmentation goals (main/guest/IoT WiFi zones) and API key auth - Creates UniFi reference card (`docs/reference/infrastructure/unifi.md`) with topology diagram - Updates all documentation indexes (plans.md, how-to.md, hosts.md, reference.md) ## What's Deferred Actual stack scaffolding (`pulumi/unifi/`), mise tasks, and `pulumi import` are blocked on switch purchase and cabling. The plan doc captures everything needed for a future execution session. ## Verification - `docs-check-links` passes (all wiki-links resolve) - `docs-check-index` passes (unifi.md referenced in reference.md) - Pre-commit hooks pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/145	2026-02-10 15:36:13 -08:00
Erich Blume	f65d11d55b	Update BorgBase repo ID after recreation (#144 ) ## Summary - Previous BorgBase repo (k04ljcd7) had corrupted segments from interrupted backup attempts - Recreated as u3ugi1x1 (same US region, same SSH key, same append-only settings) - Updates repo path in Ansible defaults and known_hosts hostname in tasks ## Post-merge 1. `mise run provision-indri -- --tags borgmatic` 2. `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'` 3. `mise x -- borgmatic create --repository borgbase-offsite --verbosity 1 --progress` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/144	2026-02-10 13:19:15 -08:00
Erich Blume	0d5f48e2c2	Document op read vs op item get convention (#143 ) ## Summary - Adds guidance to CLAUDE.md: use `op read` for secret values, `op item get` only for metadata - Fixes the argocd login example which used `op item get --fields` - `op item get --fields` wraps multi-line values in quotes, which corrupts keys and other secrets Discovered while verifying the sifaka borg repokey in 1Password — hashes didn't match until we switched to `op read`. Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/143	2026-02-10 13:09:55 -08:00
Erich Blume	d045a5d76a	Add BorgBase offsite backup repository (#142 ) ## Summary - Adds BorgBase as a second borgmatic repository for offsite backups (US region, append-only) - SSH key managed via 1Password, deployed to indri by Ansible - Borgmatic `ssh_command` configured to use the dedicated BorgBase key - BorgBase host key pinned in known_hosts via Ansible ## Post-merge deployment steps 1. Provision borgmatic: `mise run provision-indri -- --tags borgmatic` 2. Initialize the BorgBase repo: `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'` 3. Export and store the borg repokey: `ssh indri 'borg key export ssh://k04ljcd7@k04ljcd7.repo.borgbase.com/./repo'` → save to 1Password 4. Verify first backup: `ssh indri 'mise x -- borgmatic create --repository borgbase-offsite --verbosity 1'` ## BorgBase setup (already done) - Account created, API token in 1Password (`borgbase` item in blumeops vault) - SSH keypair generated, stored in 1Password, public key uploaded to BorgBase (ID: 200815) - Repository `indri-borgmatic` created (ID: k04ljcd7, US region, append-only, 2-day alert) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/142	2026-02-10 12:47:02 -08:00
Erich Blume	54afa0750b	Add how-to guide for restoring 1Password backup from borgmatic (#141 ) ## Summary - New how-to guide at `docs/how-to/restore-1password-backup.md` with step-by-step procedure for extracting and decrypting a 1Password `.1pux` export from borgmatic backup - End-to-end verified: extracted from today's borg archive, decrypted age key with openssl, decrypted .1pux with age → valid 31MB zip with vault data - Cross-links added from: disaster-recovery, 1password, borgmatic, backups policy, and how-to index - Updated disaster-recovery.md from TBD stub to include a procedures table ## Deployment and Testing - [x] Verified full extraction + decryption flow against live borgmatic archive - [x] `docs-check-links` passes — all wiki-links valid - [ ] Review guide for clarity and completeness Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/141	2026-02-10 10:55:00 -08:00
Erich Blume	b5746e62c2	Add migration plan for Forgejo brew-to-source transition (#140 ) ## Summary - Add `docs/how-to/plans/migrate-forgejo-from-brew.md` — full Diataxis-style plan covering background, one-time migration steps, Ansible role changes (with exact code), verification checklist, and future considerations - Add `docs/how-to/plans/plans.md` — new plans subdirectory index for upcoming migration/transition plans - Update `docs/how-to/how-to.md` with a Plans section - Update `docs/tutorials/exploring-the-docs.md` to mention plans in the doc structure table and quick-path sections for Owner and AI audiences ## Test plan - [x] `docs-check-links` passes - [x] `docs-check-index` passes - [x] All pre-commit hooks pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/140	2026-02-10 10:18:53 -08:00
Erich Blume	41dfae1f80	Add CNI conflict troubleshooting to restart-indri how-to (#139 ) ## Summary - Documents a troubleshooting procedure for broken pod networking after unclean shutdown - During minikube recovery, a stale `1-k8s.conflist` CNI config can override kindnet's `10-kindnet.conflist`, causing new pods to use bridge+firewall networking instead of kindnet's ptp — breaking pod-to-pod communication - Covers symptoms (DNS failures, liveness probe timeouts), diagnosis steps, and the fix ## Context Encountered this during the 2026-02-10 power outage. Immich, kiwix, and transmission were all crash-looping for ~8 hours due to the CNI conflict. The minikube ansible role's clean boot detection has been improved (#137) so this may not recur, but the troubleshooting guide is valuable if it does. ## Test plan - [x] Documentation only — no code changes - [x] Pre-commit hooks pass Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/139	2026-02-10 07:24:42 -08:00
Erich Blume	e5d1e979b8	Add power infrastructure reference card (#138 ) ## Summary - New `power.md` reference card documenting the full power chain: AC grid → Anker SOLIX F2000 battery → CyberPower CP1000PFCLCD UPS → homelab - Lists all devices on the UPS (indri, sifaka, UniFi Express 7, Starlink) - Replaced inline UPS entry on indri card with link to the new power card - Added power card to reference index Context: power chain was upgraded — the Anker battery station now sits between grid power and the UPS, providing extended runtime for grid outages. ## Test plan - [x] docs-check-links, docs-check-index, docs-check-filenames all pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/138	2026-02-09 23:03:13 -08:00
Erich Blume	d76d675b29	Fix minikube role skipping start when kubelet/apiserver are stopped (#137 ) ## Summary - After a power loss, minikube's Docker container (host) restarts but kubelet/apiserver remain stopped - The ansible role's status check used `--format='{{.Host}}'` which only examined the host VM state - When host=Running but kubelet/apiserver=Stopped, the role skipped `minikube start` - Fixed to use full `minikube status` exit code (returns non-zero when any component is unhealthy) - Simplified all downstream conditions to use exit code instead of string matching ## Test plan - [x] Verified the fix correctly skips `minikube start` when cluster is already fully running - [x] Pre-commit hooks pass (ansible-lint, yamllint, etc.) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/137	2026-02-09 23:03:01 -08:00
Erich Blume	a5765f9cf2	Add op-backup mise task for encrypted 1Password disaster recovery (#136 ) ## Summary - Adds `mise run op-backup` task that encrypts a 1Password .1pux export with `age` using the master password + secret key as passphrase, SCPs to indri for borgmatic pickup, then deletes the plaintext - Adds `age` to the Brewfile - Borgmatic already backs up `/Users/erichblume/Documents` on indri, which covers the `1password-backup/` subdirectory — no config change needed ## Disaster recovery 1. Restore borgmatic archive to retrieve the `.age` file 2. Open Emergency Kit from safety deposit box 3. `age --decrypt <file>.age > export.1pux` (passphrase: `{master_password}:{secret_key}`) 4. Open `.1pux` with 1Password or unzip to inspect ## Usage ``` # Export all vaults from 1Password desktop app as .1pux, then: mise run op-backup ~/Documents/1Password-export.1pux # Or run without args for interactive prompt: mise run op-backup ``` ## Test plan - [ ] `brew install age` - [ ] Export a test vault from 1Password as .1pux - [ ] Run `mise run op-backup` with the export path - [ ] Verify encrypted file appears on indri at `~/Documents/1password-backup/` - [ ] Verify plaintext .1pux is deleted from gilbert - [ ] Test decryption: `age --decrypt <file>.age > test.1pux` with password:secret_key - [ ] Verify decrypted .1pux can be opened/unzipped 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/136	2026-02-09 20:37:39 -08:00
Erich Blume	85e36cd807	Operations and observability for sifaka NAS (#135 ) ## Summary - Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring - Formalize existing `node_exporter` container under Ansible management - Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus - Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime) - Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS - Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles ## Prerequisites before deploy - [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP) - [ ] Verify `ssh eblume@sifaka 'docker ps'` works - [ ] Run `mise run provision-sifaka` to deploy containers - [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes - [ ] `argocd app sync prometheus` + `argocd app sync grafana-config` ## Test plan - [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics` - [ ] Verify Prometheus targets page shows both sifaka jobs as UP - [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135	2026-02-09 17:44:05 -08:00
Erich Blume	4ee643a81d	Serve friendly error page when Fly.io proxy upstreams are unreachable (#133 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m50s Details ## Summary - Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.) - Stale cache is still served first when available (`proxy_cache_use_stale` takes priority) - Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams - `proxy_intercept_errors on` also catches error responses returned by the upstream itself ## Files Changed - `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo) - `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors` - `fly/Dockerfile` — COPY error.html into image ## Test Plan - [ ] Deploy to Fly.io - [ ] Visit `docs.eblu.me/_error` to verify the page renders - [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504 Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133	2026-02-09 12:01:24 -08:00
Erich Blume	959b6842bc	Zero-downtime Fly.io deploys (#132 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m40s Details ## Summary - Start nginx after Tailscale connects (community best practice for Tailscale sidecars) - Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy - Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379)) - Remove sentinel file and nginx if-check (no longer needed) Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm) - [ ] Confirm `fly checks list` shows the new service-level check passing Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132	2026-02-09 11:34:19 -08:00
Erich Blume	bd61da4f85	Fix 502 errors during Fly.io proxy deploys (#131 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m20s Details ## Summary - Health check (`/healthz`) now returns 503 until Tailscale is connected - `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds - Fly.io keeps the old machine serving traffic during the ~7s startup window Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected. ## Deployment and Testing - [ ] Merge and `fly deploy` from `fly/` directory - [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard) - [ ] Confirm health check transitions from 503 → 200 in `fly logs` Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131	2026-02-09 11:07:36 -08:00
Erich Blume	3415cad38c	Log real client IPs via Fly-Client-IP header (#130 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 59s Details ## Summary - Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header - Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki - Keeps `remote_addr` (the internal proxy IP) for debugging Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs. ## Deployment and Testing - [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory - [ ] Verify in Grafana that new log lines include `client_ip` with real IPs - [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130	2026-02-09 11:02:06 -08:00
Forgejo Actions	92a1081302	Update docs release to v1.5.2 - Built changelog from towncrier fragments [skip ci]	2026-02-09 15:30:21 +00:00
Erich Blume	9e361cf38f	Add docs-review task with last-reviewed frontmatter tracking (#129 ) v1.5.2 ## Summary - New `docs-review` mise task replaces `docs-review-random` — sorts docs by `last-reviewed` frontmatter field (never-reviewed first, then oldest) - Updated review-documentation how-to to explain the new workflow and how to mark cards as reviewed - Updated ai-assistance-guide task table to reference `docs-review` ## Test plan - [x] `mise run docs-review` runs and shows staleness table + most stale doc - [x] `mise run docs-review -- --limit 5` respects the limit flag - [x] All pre-commit checks pass (links, index, filenames) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/129	2026-02-09 07:29:45 -08:00
Erich Blume	c6f8fcd346	Fix fly-deploy WARNING by starting nginx before Tailscale (#128 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m3s Details ## Summary - Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy` - Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time - DNS results cached 30s per worker — no per-request lookup overhead ## Context The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy. ## Test plan - [ ] Merge and let the deploy-fly workflow trigger - [ ] Check runner logs for absence of the WARNING - [ ] Verify `docs.eblu.me` still serves correctly - [ ] Verify `/healthz` still passes Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128	2026-02-09 07:01:58 -08:00
Erich Blume	a0b076172f	Fix Immich/Homepage Ingress host matching, add missing service checks (#127 ) ## Summary - Fix Immich Ingress `host: photos` causing 404 with ProxyGroup (same FQDN mismatch as Prometheus/Loki) - Migrate Homepage from old per-service Tailscale proxy to shared ProxyGroup (was the last holdout) - Add Immich and Navidrome to `services-check` HTTP endpoints ## Deployment Notes - Already tested on branch: Immich and Homepage both return 200 via Caddy - Homepage's old Helm-managed Ingress was deleted manually; ArgoCD may recreate it on sync — prune with `argocd app sync homepage --prune` after merge - Old per-service `ts-homepage-*` pod in tailscale namespace can be cleaned up after confirming ProxyGroup works ## Test Plan - [x] `curl https://photos.ops.eblu.me/` returns 200 - [x] `curl https://go.ops.eblu.me/` returns 200 - [ ] `mise run services-check` fully passes after merge Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/127	2026-02-08 22:12:50 -08:00
Erich Blume	e6cf7e47e0	Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126 ) All checks were successful Deploy Fly.io Proxy / deploy (push) Successful in 1m8s Details ## Summary - Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy - Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test - Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses - Switch Alloy push endpoints from `.ops.eblu.me` (Caddy) to `.tail8d86e.ts.net` (Tailscale Ingress) - Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly ## Manual step (not in PR) Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes. ## Deployment order 1. Pulumi ACLs — `mise run tailnet-preview && mise run tailnet-up` 2. OAuth client — Manual update in Tailscale admin console 3. K8s Ingresses — `argocd app sync apps && argocd app sync docs loki prometheus` 4. Fly.io proxy — `mise run fly-deploy` 5. Verify — `mise run services-check`, check Grafana dashboards ## Test plan - [ ] `mise run tailnet-preview` shows clean diff - [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions - [ ] After deploy: Grafana dashboards show continued log/metric flow - [ ] `curl -sf https://docs.eblu.me` returns 200 - [ ] `mise run services-check` passes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126	2026-02-08 21:54:18 -08:00
Erich Blume	7f41621c7f	Migrate Ansible op calls to op read URI syntax (#125 ) ## Summary - Convert all 12 `op item get ... --fields ... --reveal` calls in Ansible to the newer `op read "op://vault/item/field"` syntax - Remove the `regex_replace` workaround on the Fly deploy token (no longer needed since `op read` returns clean unquoted values) - Covers `ansible/playbooks/indri.yml`, `ansible/roles/caddy/tasks/main.yml`, `ansible/roles/jellyfin_metrics/tasks/main.yml`, and `ansible/roles/alloy/tasks/main.yml` ## Test plan - [x] `mise run provision-indri -- --check --diff` dry run passes (ok=67, failed=0) - [x] No `op item get` calls remain in `ansible/` directory - [x] All pre-commit hooks pass (yaml, ansible-lint, TruffleHog, etc.) - [ ] Full provision run after merge to confirm secrets resolve correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/125	2026-02-08 10:52:43 -08:00

1 2 3 4 5 ...

312 commits