Commit graph

6 commits

Author SHA1 Message Date
fedfdb1228 Remove Alpine's default SSH jails for fail2ban
Alpine ships alpine-ssh.conf with sshd and sshd-ddos jails enabled.
These fail on startup because there's no SSH server or /var/log/messages
in the container. Remove the file after install instead of trying to
override via [DEFAULT] (per-jail enabled=true beats DEFAULT).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 08:34:21 -08:00
80c33ccaf1 Add forge.eblu.me to Fly.io proxy with rate limiting and fail2ban
nginx configuration:
- forge.eblu.me server block with WebSocket support, 512m body limit
- Rate limit login/signup/forgot-password at 3r/s per real client IP
  (keyed on Fly-Client-IP header, not Fly's internal remote_addr)
- Static asset caching (7d), no blanket caching for dynamic content
- Security headers (HSTS, X-Frame-Options, X-Content-Type-Options)
- Block /swagger (API docs only available via tailnet)
- X-Real-IP set to real client IP for Forgejo audit logs
- geo-based deny list for fail2ban integration

fail2ban configuration:
- Custom filter matching 401/403 on login paths in nginx JSON log
- Ban after 5 failures in 10 minutes, ban duration 1 hour
- Custom nginx-deny action: writes IPs to deny file and reloads nginx
  (iptables won't work in Fly.io — remote_addr is Fly's proxy IP)
- Ban lists ephemeral across deploys (nginx rate limiting is persistent)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 07:52:58 -08:00
cb9a06bb75 Update tooling dependencies (Feb 2026 cycle) (#254)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m30s
## Summary

Monthly tooling dependency update cycle:

- **Pre-commit hooks**: trufflehog v3.92.5→v3.93.4, ruff v0.14.13→v0.15.2, shellcheck v0.10.0.1→v0.11.0.1, prettier v3.8.0→v3.8.1, actionlint v1.7.10→v1.7.11
- **Fly.io Dockerfile**: pin nginx to 1.28.2-alpine (was unpinned), bump alloy v1.5.1→v1.13.1
- **Mise tasks**: normalize httpx lower bound to >=0.28.0 and typer to >=0.15.0 across all scripts
- **Forgejo workflows**: actions/checkout@v4 is current, no changes needed
- **New how-to doc**: [[update-tooling-dependencies]] documenting this monthly cycle

## No changes needed

- pre-commit-hooks v6.0.0, yamllint v1.38.0, shfmt v3.12.0-2, taplo v0.9.3, ansible-lint 26.1.1 — all already at latest

## Test plan

- [x] `uvx pre-commit run --all-files` — all 24 hooks pass
- [ ] Fly.io deploy (triggered automatically on merge to main via deploy-fly workflow)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/254
2026-02-23 13:08:41 -08:00
4ee643a81d Serve friendly error page when Fly.io proxy upstreams are unreachable (#133)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m50s
## Summary
- Adds a branded 503 error page served when upstreams are unreachable (indri offline, Tailscale tunnel down, emergency shutoff, etc.)
- Stale cache is still served first when available (`proxy_cache_use_stale` takes priority)
- Test endpoint at `docs.eblu.me/_error` to preview the page without killing upstreams
- `proxy_intercept_errors on` also catches error responses returned by the upstream itself

## Files Changed
- `fly/error.html` — Self-contained error page (dark theme, links to BlumeOps repo)
- `fly/nginx.conf` — `error_page`, `internal` location, `/_error` test location, `proxy_intercept_errors`
- `fly/Dockerfile` — COPY error.html into image

## Test Plan
- [ ] Deploy to Fly.io
- [ ] Visit `docs.eblu.me/_error` to verify the page renders
- [ ] Optionally stop indri/Tailscale to confirm the page shows on real 502/503/504

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/133
2026-02-09 12:01:24 -08:00
cc54b4f565 Add Fly.io proxy observability via embedded Alloy (#123)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00
64a78422b1 Add Fly.io public reverse proxy for docs.eblu.me (#120)
Some checks failed
Deploy Fly.io Proxy / deploy (push) Failing after 9s
## Summary

- Adds a Fly.io reverse proxy (`blumeops-proxy`) that tunnels public traffic to homelab services over Tailscale
- First service exposed: `docs.eblu.me` — the Quartz static docs site
- Includes Pulumi IaC for Tailscale auth key/ACLs and Gandi DNS CNAME
- Adds mise tasks (`fly-deploy`, `fly-setup`, `fly-shutoff`) and Forgejo CI workflow

## Key details

- Fly.io Firecracker VMs support TUN devices natively — no userspace networking needed
- Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts
- nginx caches aggressively for the static site; health check is on the default_server block
- ACLs restrict `tag:flyio-proxy` to `tag:k8s` on port 443 only
- DNS CNAME deployed and verified: `docs.eblu.me` → `blumeops-proxy.fly.dev`

## Test plan

- [x] `curl -sf https://blumeops-proxy.fly.dev/healthz` returns `ok`
- [x] `curl -I -H "Host: docs.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 with `X-Cache-Status`
- [x] `curl -I https://docs.eblu.me/` returns 200 with valid Let's Encrypt cert
- [x] `dig forge.ops.eblu.me` still resolves to 100.98.163.89 (private services unaffected)
- [x] Set `FLY_DEPLOY_TOKEN` Forgejo Actions secret for CI auto-deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/120
2026-02-08 02:36:19 -08:00