Commit graph

286 commits

Author SHA1 Message Date
2fc5aa82b1 Add docker-buildx-plugin to forgejo-runner (#147)
Some checks failed
Build Container / build (push) Failing after 3s
## Summary
- Install `docker-buildx-plugin` alongside `docker-ce-cli` in the forgejo-runner image
- Fixes `docker buildx build` failing with "unknown flag: --tag" from #146

## Test plan
- [ ] Merge and release `forgejo-runner-v2.5.1`
- [ ] Update runner configmap/labels if needed to use new image
- [ ] Re-tag `nettest-v0.11.1` (or `v0.12.0`) to verify build-container workflow succeeds

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/147
2026-02-10 21:14:29 -08:00
cb36f1784f Switch CI builds to docker buildx (#146)
Some checks failed
Build Container / build (push) Failing after 4s
nettest-v0.11.1
## Summary
- Replace deprecated `docker build` with `docker buildx build` in the build-push-image composite action
- Remove redundant build/run comments from nettest Dockerfile

## Test plan
- [ ] Merge and tag `nettest-v1.1.0` (or similar) to trigger the build-container workflow
- [ ] Verify the build succeeds without the deprecation warning

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/146
2026-02-10 21:03:41 -08:00
19741021d6 Doc review for explanation/explanation.md 2026-02-10 16:00:01 -08:00
0dce806107 Add plan and reference card for UniFi Express 7 Pulumi stack (#145)
## Summary
- Rewrites the UniFi Pulumi plan doc to use filipowm/unifi Terraform provider via `pulumi package add terraform-provider` (replaces pulumiverse_unifi approach)
- Adds network segmentation goals (main/guest/IoT WiFi zones) and API key auth
- Creates UniFi reference card (`docs/reference/infrastructure/unifi.md`) with topology diagram
- Updates all documentation indexes (plans.md, how-to.md, hosts.md, reference.md)

## What's Deferred
Actual stack scaffolding (`pulumi/unifi/`), mise tasks, and `pulumi import` are blocked on switch purchase and cabling. The plan doc captures everything needed for a future execution session.

## Verification
- `docs-check-links` passes (all wiki-links resolve)
- `docs-check-index` passes (unifi.md referenced in reference.md)
- Pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/145
2026-02-10 15:36:13 -08:00
f65d11d55b Update BorgBase repo ID after recreation (#144)
## Summary
- Previous BorgBase repo (k04ljcd7) had corrupted segments from interrupted backup attempts
- Recreated as u3ugi1x1 (same US region, same SSH key, same append-only settings)
- Updates repo path in Ansible defaults and known_hosts hostname in tasks

## Post-merge
1. `mise run provision-indri -- --tags borgmatic`
2. `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'`
3. `mise x -- borgmatic create --repository borgbase-offsite --verbosity 1 --progress`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/144
2026-02-10 13:19:15 -08:00
0d5f48e2c2 Document op read vs op item get convention (#143)
## Summary
- Adds guidance to CLAUDE.md: use `op read` for secret values, `op item get` only for metadata
- Fixes the argocd login example which used `op item get --fields`
- `op item get --fields` wraps multi-line values in quotes, which corrupts keys and other secrets

Discovered while verifying the sifaka borg repokey in 1Password — hashes didn't match until we switched to `op read`.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/143
2026-02-10 13:09:55 -08:00
d045a5d76a Add BorgBase offsite backup repository (#142)
## Summary
- Adds BorgBase as a second borgmatic repository for offsite backups (US region, append-only)
- SSH key managed via 1Password, deployed to indri by Ansible
- Borgmatic `ssh_command` configured to use the dedicated BorgBase key
- BorgBase host key pinned in known_hosts via Ansible

## Post-merge deployment steps
1. Provision borgmatic: `mise run provision-indri -- --tags borgmatic`
2. Initialize the BorgBase repo: `ssh indri 'mise x -- borgmatic init --encryption repokey --repository borgbase-offsite'`
3. Export and store the borg repokey: `ssh indri 'borg key export ssh://k04ljcd7@k04ljcd7.repo.borgbase.com/./repo'` → save to 1Password
4. Verify first backup: `ssh indri 'mise x -- borgmatic create --repository borgbase-offsite --verbosity 1'`

## BorgBase setup (already done)
- Account created, API token in 1Password (`borgbase` item in blumeops vault)
- SSH keypair generated, stored in 1Password, public key uploaded to BorgBase (ID: 200815)
- Repository `indri-borgmatic` created (ID: k04ljcd7, US region, append-only, 2-day alert)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/142
2026-02-10 12:47:02 -08:00
54afa0750b Add how-to guide for restoring 1Password backup from borgmatic (#141)
## Summary
- New how-to guide at `docs/how-to/restore-1password-backup.md` with step-by-step procedure for extracting and decrypting a 1Password `.1pux` export from borgmatic backup
- **End-to-end verified**: extracted from today's borg archive, decrypted age key with openssl, decrypted .1pux with age → valid 31MB zip with vault data
- Cross-links added from: disaster-recovery, 1password, borgmatic, backups policy, and how-to index
- Updated disaster-recovery.md from TBD stub to include a procedures table

## Deployment and Testing
- [x] Verified full extraction + decryption flow against live borgmatic archive
- [x] `docs-check-links` passes — all wiki-links valid
- [ ] Review guide for clarity and completeness

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/141
2026-02-10 10:55:00 -08:00
b5746e62c2 Add migration plan for Forgejo brew-to-source transition (#140)
## Summary
- Add `docs/how-to/plans/migrate-forgejo-from-brew.md` — full Diataxis-style plan covering background, one-time migration steps, Ansible role changes (with exact code), verification checklist, and future considerations
- Add `docs/how-to/plans/plans.md` — new plans subdirectory index for upcoming migration/transition plans
- Update `docs/how-to/how-to.md` with a Plans section
- Update `docs/tutorials/exploring-the-docs.md` to mention plans in the doc structure table and quick-path sections for Owner and AI audiences

## Test plan
- [x] `docs-check-links` passes
- [x] `docs-check-index` passes
- [x] All pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/140
2026-02-10 10:18:53 -08:00
41dfae1f80 Add CNI conflict troubleshooting to restart-indri how-to (#139)
## Summary
- Documents a troubleshooting procedure for broken pod networking after unclean shutdown
- During minikube recovery, a stale `1-k8s.conflist` CNI config can override kindnet's `10-kindnet.conflist`, causing new pods to use bridge+firewall networking instead of kindnet's ptp — breaking pod-to-pod communication
- Covers symptoms (DNS failures, liveness probe timeouts), diagnosis steps, and the fix

## Context
Encountered this during the 2026-02-10 power outage. Immich, kiwix, and transmission were all crash-looping for ~8 hours due to the CNI conflict. The minikube ansible role's clean boot detection has been improved (#137) so this may not recur, but the troubleshooting guide is valuable if it does.

## Test plan
- [x] Documentation only — no code changes
- [x] Pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/139
2026-02-10 07:24:42 -08:00
e5d1e979b8 Add power infrastructure reference card (#138)
## Summary
- New `power.md` reference card documenting the full power chain: AC grid → Anker SOLIX F2000 battery → CyberPower CP1000PFCLCD UPS → homelab
- Lists all devices on the UPS (indri, sifaka, UniFi Express 7, Starlink)
- Replaced inline UPS entry on indri card with link to the new power card
- Added power card to reference index

Context: power chain was upgraded — the Anker battery station now sits between grid power and the UPS, providing extended runtime for grid outages.

## Test plan
- [x] docs-check-links, docs-check-index, docs-check-filenames all pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/138
2026-02-09 23:03:13 -08:00
d76d675b29 Fix minikube role skipping start when kubelet/apiserver are stopped (#137)
## Summary
- After a power loss, minikube's Docker container (host) restarts but kubelet/apiserver remain stopped
- The ansible role's status check used `--format='{{.Host}}'` which only examined the host VM state
- When host=Running but kubelet/apiserver=Stopped, the role skipped `minikube start`
- Fixed to use full `minikube status` exit code (returns non-zero when any component is unhealthy)
- Simplified all downstream conditions to use exit code instead of string matching

## Test plan
- [x] Verified the fix correctly skips `minikube start` when cluster is already fully running
- [x] Pre-commit hooks pass (ansible-lint, yamllint, etc.)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/137
2026-02-09 23:03:01 -08:00
a5765f9cf2 Add op-backup mise task for encrypted 1Password disaster recovery (#136)
## Summary
- Adds `mise run op-backup` task that encrypts a 1Password .1pux export with `age` using the master password + secret key as passphrase, SCPs to indri for borgmatic pickup, then deletes the plaintext
- Adds `age` to the Brewfile
- Borgmatic already backs up `/Users/erichblume/Documents` on indri, which covers the `1password-backup/` subdirectory — no config change needed

## Disaster recovery
1. Restore borgmatic archive to retrieve the `.age` file
2. Open Emergency Kit from safety deposit box
3. `age --decrypt <file>.age > export.1pux` (passphrase: `{master_password}:{secret_key}`)
4. Open `.1pux` with 1Password or unzip to inspect

## Usage
```
# Export all vaults from 1Password desktop app as .1pux, then:
mise run op-backup ~/Documents/1Password-export.1pux

# Or run without args for interactive prompt:
mise run op-backup
```

## Test plan
- [ ] `brew install age`
- [ ] Export a test vault from 1Password as .1pux
- [ ] Run `mise run op-backup` with the export path
- [ ] Verify encrypted file appears on indri at `~/Documents/1password-backup/`
- [ ] Verify plaintext .1pux is deleted from gilbert
- [ ] Test decryption: `age --decrypt <file>.age > test.1pux` with password:secret_key
- [ ] Verify decrypted .1pux can be opened/unzipped

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/136
2026-02-09 20:37:39 -08:00
85e36cd807 Operations and observability for sifaka NAS (#135)
## Summary
- Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring
- Formalize existing `node_exporter` container under Ansible management
- Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus
- Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime)
- Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS
- Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles

## Prerequisites before deploy
- [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP)
- [ ] Verify `ssh eblume@sifaka 'docker ps'` works
- [ ] Run `mise run provision-sifaka` to deploy containers
- [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes
- [ ] `argocd app sync prometheus` + `argocd app sync grafana-config`

## Test plan
- [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics`
- [ ] Verify Prometheus targets page shows both sifaka jobs as UP
- [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135
2026-02-09 17:44:05 -08:00
4ee643a81d Serve friendly error page when Fly.io proxy upstreams are unreachable (#133)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m50s
## Summary
- Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.)
- Stale cache is still served first when available (`proxy_cache_use_stale` takes priority)
- Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams
- `proxy_intercept_errors on` also catches error responses returned by the upstream itself

## Files Changed
- `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo)
- `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors`
- `fly/Dockerfile` — COPY error.html into image

## Test Plan
- [ ] Deploy to Fly.io
- [ ] Visit `docs.eblu.me/_error` to verify the page renders
- [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133
2026-02-09 12:01:24 -08:00
959b6842bc Zero-downtime Fly.io deploys (#132)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m40s
## Summary
- Start nginx after Tailscale connects (community best practice for Tailscale sidecars)
- Switch to `bluegreen` deploy strategy — old machine serves until new one is healthy
- Replace top-level `[checks]` with `[[http_service.checks]]` — only service-level checks gate traffic routing ([confirmed by Fly.io staff](https://community.fly.io/t/clarifying-the-types-of-health-checks/20379))
- Remove sentinel file and nginx if-check (no longer needed)

Supersedes the approach in #131 — that helped (502 window dropped from ~30s to ~3s) but couldn't fully eliminate it because top-level checks don't gate routing and Fly.io's proxy sends traffic as soon as the port is reachable.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (watch `fly logs` and Grafana docs-apm)
- [ ] Confirm `fly checks list` shows the new service-level check passing

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/132
2026-02-09 11:34:19 -08:00
bd61da4f85 Fix 502 errors during Fly.io proxy deploys (#131)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m20s
## Summary
- Health check (`/healthz`) now returns 503 until Tailscale is connected
- `start.sh` creates `/tmp/tailscale-ready` sentinel after `tailscale up` succeeds
- Fly.io keeps the old machine serving traffic during the ~7s startup window

Previously, nginx passed the health check immediately, Fly.io routed traffic to the new machine, but MagicDNS wasn't available yet — causing upstream DNS timeouts and 502s on every request until Tailscale connected.

## Deployment and Testing
- [ ] Merge and `fly deploy` from `fly/` directory
- [ ] Verify deploy completes with zero 502s (check Grafana docs-apm dashboard)
- [ ] Confirm health check transitions from 503 → 200 in `fly logs`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/131
2026-02-09 11:07:36 -08:00
3415cad38c Log real client IPs via Fly-Client-IP header (#130)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 59s
## Summary
- Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header
- Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki
- Keeps `remote_addr` (the internal proxy IP) for debugging

Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs.

## Deployment and Testing
- [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory
- [ ] Verify in Grafana that new log lines include `client_ip` with real IPs
- [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130
2026-02-09 11:02:06 -08:00
Forgejo Actions
92a1081302 Update docs release to v1.5.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-09 15:30:21 +00:00
9e361cf38f Add docs-review task with last-reviewed frontmatter tracking (#129) v1.5.2
## Summary
- New `docs-review` mise task replaces `docs-review-random` — sorts docs by `last-reviewed` frontmatter field (never-reviewed first, then oldest)
- Updated review-documentation how-to to explain the new workflow and how to mark cards as reviewed
- Updated ai-assistance-guide task table to reference `docs-review`

## Test plan
- [x] `mise run docs-review` runs and shows staleness table + most stale doc
- [x] `mise run docs-review -- --limit 5` respects the limit flag
- [x] All pre-commit checks pass (links, index, filenames)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/129
2026-02-09 07:29:45 -08:00
c6f8fcd346 Fix fly-deploy WARNING by starting nginx before Tailscale (#128)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m3s
## Summary
- Start nginx before Tailscale in `start.sh` so port 8080 is bound immediately, eliminating the "app is not listening on the expected address" WARNING during `fly deploy`
- Switch `proxy_pass` to use a variable with `resolver 100.100.100.100 valid=30s` so nginx can start without resolving MagicDNS names at config load time
- DNS results cached 30s per worker — no per-request lookup overhead

## Context
The WARNING was a race condition: Fly checks for listeners right after the machine starts, but `start.sh` ran ~5-10s of Tailscale setup before starting nginx. The health check always passed later, but the warning was noisy.

## Test plan
- [ ] Merge and let the deploy-fly workflow trigger
- [ ] Check runner logs for absence of the WARNING
- [ ] Verify `docs.eblu.me` still serves correctly
- [ ] Verify `/healthz` still passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/128
2026-02-09 07:01:58 -08:00
a0b076172f Fix Immich/Homepage Ingress host matching, add missing service checks (#127)
## Summary

- Fix Immich Ingress `host: photos` causing 404 with ProxyGroup (same FQDN mismatch as Prometheus/Loki)
- Migrate Homepage from old per-service Tailscale proxy to shared ProxyGroup (was the last holdout)
- Add Immich and Navidrome to `services-check` HTTP endpoints

## Deployment Notes

- Already tested on branch: Immich and Homepage both return 200 via Caddy
- Homepage's old Helm-managed Ingress was deleted manually; ArgoCD may recreate it on sync — prune with `argocd app sync homepage --prune` after merge
- Old per-service `ts-homepage-*` pod in tailscale namespace can be cleaned up after confirming ProxyGroup works

## Test Plan

- [x] `curl https://photos.ops.eblu.me/` returns 200
- [x] `curl https://go.ops.eblu.me/` returns 200
- [ ] `mise run services-check` fully passes after merge

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/127
2026-02-08 22:12:50 -08:00
e6cf7e47e0 Restrict flyio-proxy ACLs to dedicated tag:flyio-target endpoints (#126)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m8s
## Summary
- Introduce `tag:flyio-target` so services must explicitly opt in to be reachable by the fly.io proxy
- Replace broad `tag:k8s` and `tag:homelab` grants with the new tag in the ACL rule and test
- Add `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to docs, loki, and prometheus Ingresses
- Switch Alloy push endpoints from `*.ops.eblu.me` (Caddy) to `*.tail8d86e.ts.net` (Tailscale Ingress)
- Update docs: flyio-proxy, caddy, tailscale, forgejo (future public access + security checklist), expose-service-publicly

## Manual step (not in PR)
Update the k8s operator OAuth client in the Tailscale admin console to include `tag:flyio-target` in its scope. Without this, the operator cannot assign the new tag to Ingress proxy nodes.

## Deployment order
1. **Pulumi ACLs** — `mise run tailnet-preview && mise run tailnet-up`
2. **OAuth client** — Manual update in Tailscale admin console
3. **K8s Ingresses** — `argocd app sync apps && argocd app sync docs loki prometheus`
4. **Fly.io proxy** — `mise run fly-deploy`
5. **Verify** — `mise run services-check`, check Grafana dashboards

## Test plan
- [ ] `mise run tailnet-preview` shows clean diff
- [ ] `argocd app diff docs`, `argocd app diff loki`, `argocd app diff prometheus` show only annotation additions
- [ ] After deploy: Grafana dashboards show continued log/metric flow
- [ ] `curl -sf https://docs.eblu.me` returns 200
- [ ] `mise run services-check` passes

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/126
2026-02-08 21:54:18 -08:00
7f41621c7f Migrate Ansible op calls to op read URI syntax (#125)
## Summary
- Convert all 12 `op item get ... --fields ... --reveal` calls in Ansible to the newer `op read "op://vault/item/field"` syntax
- Remove the `regex_replace` workaround on the Fly deploy token (no longer needed since `op read` returns clean unquoted values)
- Covers `ansible/playbooks/indri.yml`, `ansible/roles/caddy/tasks/main.yml`, `ansible/roles/jellyfin_metrics/tasks/main.yml`, and `ansible/roles/alloy/tasks/main.yml`

## Test plan
- [x] `mise run provision-indri -- --check --diff` dry run passes (ok=67, failed=0)
- [x] No `op item get` calls remain in `ansible/` directory
- [x] All pre-commit hooks pass (yaml, ansible-lint, TruffleHog, etc.)
- [ ] Full provision run after merge to confirm secrets resolve correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/125
2026-02-08 10:52:43 -08:00
234c46c302 Filter blumeops-tasks to hide future-dated tasks (#124)
## Summary
- Tasks with a due date are now only shown when due today or earlier
- Recurring tasks stay hidden until their next occurrence is actionable
- Tasks without a due date continue to always display

## Test plan
- [x] Ran `mise run blumeops-tasks` — verified 18 undated tasks display correctly
- [x] Confirmed "BlumeOps doc review" (due tomorrow) is correctly hidden

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/124
2026-02-08 10:38:44 -08:00
Forgejo Actions
c8d0af6644 Update docs release to v1.5.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 18:06:46 +00:00
cc54b4f565 Add Fly.io proxy observability via embedded Alloy (#123)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
v1.5.1
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00
61862b44c0 Add docs.eblu.me and Fly.io to services-check (#122)
## Summary

- Add public docs (`docs.eblu.me`) and Fly.io healthz checks to `mise run services-check`

## Test plan

- [ ] `mise run services-check` includes new "Public services (via Fly.io)" section

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/122
2026-02-08 02:48:15 -08:00
6b1167a345 Fix Fly.io deploy token quoting (#121)
## Summary

- Strip literal quotes from Fly.io deploy token when syncing to Forgejo Actions secrets
- The `op` CLI wraps values containing spaces in quotes; the Fly.io token format is `FlyV1 <key>` (contains a space)
- This caused the CI deploy workflow to fail with a 401 auth error

## Test plan

- [x] `mise run provision-indri -- --tags forgejo_actions_secrets` succeeds
- [ ] Re-trigger deploy-fly workflow after merge — should authenticate successfully

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/121
2026-02-08 02:43:45 -08:00
Forgejo Actions
c46d55060d Update docs release to v1.5.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 10:37:30 +00:00
64a78422b1 Add Fly.io public reverse proxy for docs.eblu.me (#120)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 9s
v1.5.0
## Summary

- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow

## Key details

- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`

## Test plan

- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120
2026-02-08 02:36:19 -08:00
fbedaf2833 docs/expose-service-publicly pt2 - fly.io (#119)
Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/119
2026-02-08 00:38:27 -08:00
ae2445c99a Add how-to guide for public service exposure via Cloudflare (#118)
## Summary
- Adds `docs/how-to/expose-service-publicly.md` documenting the full plan for exposing `docs.eblu.me` to the public internet
- Covers Cloudflare Tunnel + CDN architecture, DNS migration from Gandi, Caddy TLS changes, Pulumi IaC, k8s cloudflared deployment, and verification steps
- Pattern is reusable for future public services
- Marked as "Plan — not yet implemented" status

## Test plan
- [x] `docs-check-links` passes
- [x] `docs-check-index` passes
- [x] All pre-commit hooks pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/118
2026-02-07 22:09:52 -08:00
Forgejo Actions
11c76d4768 Update docs release to v1.4.2
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:45:40 +00:00
dc46eb7def Update all docs titles to human-readable (#117) v1.4.2
## Summary
- Updated frontmatter `title:` in all 63 doc cards from slug-case to human-readable (e.g. `borgmatic` → `Borgmatic`, `ai-assistance-guide` → `AI Assistance Guide`)
- Titles now closely match file stems so `[[wiki-links]]` render naturally without alternate anchor text
- Corrected titles that diverged from stems (e.g. `host-inventory` → `Hosts`, `grafana-alloy` → `Alloy`, `argocd-applications` → `Apps`)
- Deleted `title-test-alpha.md` and `title-test-beta.md` test cards and removed their reference index entry

## Deployment and Testing
- [x] `docs-check-links` passes — all wiki-links valid
- [x] `docs-check-index` passes
- [x] `docs-check-filenames` passes
- [ ] Verify titles render correctly on docs site after deploy

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/117
2026-02-07 21:44:57 -08:00
Forgejo Actions
ab7efd8c1c Update docs release to v1.4.1
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:27:23 +00:00
a7d6d44d3d Remove title slug check and test duplicate titles (#116) v1.4.1
## Summary
- Remove `docs-check-titles` pre-commit hook and mise task — wiki-links resolve by filename stem, not frontmatter title, so slug-format titles and uniqueness aren't needed
- Add two test cards (`title-test-alpha`, `title-test-beta`) with identical `title: Title Test Card` to verify duplicate titles don't break Quartz or obsidian.nvim
- Retitle `index.md` from `blumeops-documentation` to `BlumeOps`
- Add GitHub and Forgejo repo links to homepage intro

## Test plan
- [ ] Deploy to docs site and verify both test cards render and cross-link correctly
- [ ] Verify homepage title renders as "BlumeOps"
- [ ] Verify repo links on homepage work
- [ ] After confirming, remove test cards in a follow-up

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/116
2026-02-07 21:26:18 -08:00
Forgejo Actions
3f5017f732 Update docs release to v1.4.0
- Built changelog from towncrier fragments

[skip ci]
2026-02-08 05:03:34 +00:00
8e4afe77e0 Add Gandi DNS docs and rewrite homepage intro (#115) v1.4.0
## Summary
- New reference card (`docs/reference/infrastructure/gandi.md`) covering DNS records, Pulumi config, TLS integration
- New how-to guide (`docs/how-to/gandi-operations.md`) for DNS deployment and PAT cycling with `pbpaste` shortcut
- Rewritten homepage intro for wider audience ahead of public docs.eblu.me
- Cross-linked from reference index, routing, caddy, and how-to index
- Fixed PAT expiration inaccuracy in `pulumi/gandi/README.md` (max is 90 days, not 30)

## Test plan
- [ ] Verify wiki-links resolve in Quartz build
- [ ] Review gandi reference card for accuracy
- [ ] Review gandi-operations how-to for accuracy
- [ ] Check homepage reads well for external visitors

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/115
2026-02-07 21:02:10 -08:00
2cb76de5a2 Fix doc tag inconsistencies and add missing ai changelog type (#114)
## Summary
- Add missing `ai` changelog fragment type to the update-documentation how-to guide (comment and table)
- Consolidate `cicd` → `ci-cd` tag on forgejo.md
- Consolidate `network` → `networking` tag on routing.md and tailscale.md

Found during random doc review via `docs-review-tags`.

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/114
2026-02-06 18:52:36 -08:00
cb343a2e35 Rename doc-* mise tasks to docs-check-* / docs-review-* (#113)
## Summary
- Rename 4 automated-check tasks: `doc-titles` → `docs-check-titles`, `doc-filenames` → `docs-check-filenames`, `doc-links` → `docs-check-links`, `doc-index` → `docs-check-index`
- Rename 3 interactive-review tasks: `doc-random` → `docs-review-random`, `doc-tags` → `docs-review-tags`, `doc-stale` → `docs-review-stale`
- Update all references in `.pre-commit-config.yaml`, `ai-assistance-guide.md`, and `review-documentation.md`
- Historical changelog entries left as-is

## Test plan
- [x] `mise run docs-check-titles` exits 0
- [x] `mise run docs-check-links` exits 0
- [x] `mise run docs-review-tags` exits 0
- [x] `mise run doc-titles` fails with "no task found"
- [x] All pre-commit hooks pass (including renamed hook IDs)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/113
2026-02-06 07:08:46 -08:00
060c7a24e3 Review exploring-the-docs and add doc consistency checks (#112)
## Summary
- Reviewed and cleaned up exploring-the-docs tutorial: simplified wiki-links, fixed broken replication/ reference, added Related section, corrected zk-docs flags to match CLAUDE.md
- Added orphan detection to doc-links (finds docs not linked from any other doc)
- Added new doc tooling: `doc-index` (checks category index coverage), `doc-stale` (staleness report), `doc-tags` (tag inventory)
- Added `doc-index` as a pre-commit hook
- Updated use-pypi-proxy to document env-var-based proxy toggle for pip/uv
- Updated ai-assistance-guide with new doc task descriptions

## Test plan
- [ ] Run `mise run doc-links` — passes
- [ ] Run `mise run doc-index` — passes
- [ ] Run `mise run doc-stale` — informational output
- [ ] Run `mise run doc-tags` — informational output
- [ ] Pre-commit hooks pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/112
2026-02-05 21:12:06 -08:00
61c5328ec2 Update restart-indri docs after power outage recovery (#111)
## Summary
- Simplified restart-indri startup procedure to match reality (most services autostart via mcquack LaunchAgents and brew services)
- Added minikube tailscale serve port fix step (`mise run provision-indri -- --tags minikube`)
- Added Anker SOLIX F2000 GaNPrime UPS to indri reference card

## Context
After a power outage, discovered that the restart-indri docs overstated what needs manual intervention. Docker Desktop, Forgejo, Caddy, and all mcquack services autostart. Only Amphetamine, AutoMounter, and minikube need manual action.

## Test plan
- [ ] Verify restart-indri doc reads clearly
- [ ] Verify indri ref card UPS entry renders correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/111
2026-02-05 21:05:35 -08:00
3b4ff91469 Fix homepage Admin bookmark icons (#110)
## Summary
- Fix broken Pulumi icon: changed `pulumi` to `si-pulumi` (Simple Icons prefix required)
- Fix broken ArgoCD icon: changed `argocd` to `argo-cd` (Dashboard Icons uses hyphenated name)

## Deployment and Testing
- [ ] Sync homepage app in ArgoCD
- [ ] Verify icons appear on go.ops.eblu.me Admin bookmarks section

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/110
2026-02-05 06:29:39 -08:00
ab1386b015 Fix zk-docs 2026-02-04 17:31:21 -08:00
Forgejo Actions
808bc507d8 Update docs release to v1.3.4
- Built changelog from towncrier fragments

[skip ci]
2026-02-05 01:22:10 +00:00
3da455e49c Enforce unique doc filenames and simple wiki-links (#109) v1.3.4
## Summary
- Rename section index files to match their titles (tutorials.md, reference.md, how-to.md, explanation.md) so all filenames are unique
- Convert all ~47 path-based wiki-links to simple filename format across 15 files
- Update doc-filenames task to no longer skip index.md files
- Update doc-links task to reject path-based links containing '/'

This ensures all wiki-links work correctly in obsidian.nvim by making links resolvable by filename alone.

## Testing
- `mise run doc-filenames` - all unique
- `mise run doc-links` - no broken or path-based links
- `mise run doc-titles` - no duplicates

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/109
2026-02-04 17:21:34 -08:00
Forgejo Actions
a03a9faaad Update docs release to v1.3.3
- Built changelog from towncrier fragments

[skip ci]
2026-02-04 22:40:18 +00:00
7aa0e60b27 Add how-to guide for restarting indri (#108) v1.3.3
## Summary
- Add `docs/how-to/restart-indri.md` with safe shutdown and startup procedures
- Add `docs/reference/services/automounter.md` documenting the SMB share automounter app
- Update indri reference card with GUI applications section
- Update how-to and reference indexes

## Test plan
- [ ] Review restart-indri.md for accuracy
- [ ] Verify AutoMounter details are correct
- [ ] Perform the actual restart using this guide

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/108
2026-02-04 14:39:48 -08:00
74bd5abe54 Add IaC for Forgejo Actions secrets via Ansible (#107)
## Summary
- New `forgejo_actions_secrets` Ansible role syncs repository-level Actions secrets from 1Password to Forgejo via the Forgejo API
- Replaces manual process of copying secrets from 1Password to Forgejo UI
- Documents the one-time PAT setup requirement in forgejo.md

## Manual Setup Required
Before this role can run, a Forgejo PAT must be created:
1. Go to https://forge.ops.eblu.me/user/settings/applications
2. Create a new token with `write:repository` scope
3. Store it in 1Password → "Forgejo Secrets" item → `api-token` field

This has already been done.

## Test Plan
- [x] Ran `mise run provision-indri -- --tags forgejo_actions_secrets` successfully
- [x] Verified secret synced (API returned 204 = updated existing)
- [x] Ansible-lint passes

🤖 Generated with [Claude Code](https://claude.ai/code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/107
2026-02-04 09:11:01 -08:00