diff --git a/docs/changelog.d/plan-backlog-to-plans.doc.md b/docs/changelog.d/plan-backlog-to-plans.doc.md new file mode 100644 index 0000000..d1cda08 --- /dev/null +++ b/docs/changelog.d/plan-backlog-to-plans.doc.md @@ -0,0 +1 @@ +Add plan documents for OIDC provider adoption, zot registry hardening, and expanded network segmentation details. diff --git a/docs/how-to/how-to.md b/docs/how-to/how-to.md index a81e205..c9b3230 100644 --- a/docs/how-to/how-to.md +++ b/docs/how-to/how-to.md @@ -56,3 +56,7 @@ Migration and transition plans for upcoming infrastructure changes. | [[add-unifi-pulumi-stack]] | Add Pulumi IaC for UniFi Express 7 | | [[adopt-dagger-ci]] | Adopt Dagger as CI/CD build engine | | [[upstream-fork-strategy]] | Stacked-branch forking strategy for upstream projects | +| [[adopt-oidc-provider]] | Deploy OIDC identity provider for SSO across services | +| [[harden-zot-registry]] | Add authentication and tag immutability to zot registry | +| [[forgejo-actions-dashboard]] | Grafana dashboard for Forgejo Actions CI metrics | +| [[operationalize-reolink-camera]] | Cloud-free NVR with Frigate and ring buffer recording | diff --git a/docs/how-to/plans/add-unifi-pulumi-stack.md b/docs/how-to/plans/add-unifi-pulumi-stack.md index 75746af..d7f4730 100644 --- a/docs/how-to/plans/add-unifi-pulumi-stack.md +++ b/docs/how-to/plans/add-unifi-pulumi-stack.md @@ -75,17 +75,44 @@ The `pulumiverse_unifi` PyPI package bridges from `paultyng/terraform-provider-u Once the stack is operational, we plan to configure these network zones: -| Network | VLAN | Subnet | Purpose | -|---------|------|--------|---------| -| Default LAN | 1 | `192.168.1.0/24` | Main network (indri, gilbert, ringtail, sifaka) | -| Guest | TBD | `192.168.2.0/24` | Guest WiFi, internet-only | -| IoT | TBD | `192.168.3.0/24` | Smart devices, isolated from LAN | +| Network | VLAN | Subnet | Purpose | Devices | +|---------|------|--------|---------|---------| +| BlumeOps Services | TBD | `192.168.10.0/24` | Infrastructure and services | indri, sifaka, k8s pods | +| User Devices | 1 | `192.168.1.0/24` | Trusted personal devices | gilbert, ringtail | +| Guest | TBD | `192.168.2.0/24` | Guest WiFi, internet-only | Visitors | +| IoT / Appliances | TBD | `192.168.3.0/24` | Smart devices, isolated | Frame TV, dishwasher, etc. | -Zone-based firewall rules will enforce: +### Motivation: NFS Share Exposure -- Guest → Internet only (no LAN, no IoT) -- IoT → Internet + limited LAN access (e.g., mDNS) -- LAN → full access +The immediate security driver for segmentation is NFS. Currently, sifaka's NFS exports (`/volume1/torrents`, `/volume1/music`, `/volume1/photos`) whitelist `192.168.1.0/24` and `100.64.0.0/10` (Docker NAT). This means **any device on the WiFi** — including IoT appliances, guest devices, or a compromised smart TV — can mount and write to these shares. + +After segmentation, NFS exports will be restricted to the BlumeOps Services subnet (`192.168.10.0/24`) and the Docker NAT range (`100.64.0.0/10`). Only indri, sifaka, and k8s pods will have NFS access. + +### Zone-Based Firewall Rules + +| Source | Destination | Policy | +|--------|-------------|--------| +| BlumeOps Services | Internet | Allow | +| BlumeOps Services | User Devices | Allow (for management, e.g., SSH from ringtail) | +| User Devices | BlumeOps Services | Allow (trusted users need access to services) | +| User Devices | Internet | Allow | +| Guest | Internet | Allow | +| Guest | All other zones | **Block** | +| IoT / Appliances | Internet | Allow | +| IoT / Appliances | User Devices | **Block** (except mDNS for AirPlay/casting) | +| IoT / Appliances | BlumeOps Services | **Allow specific ports** (Jellyfin, Navidrome for streaming) | + +### NFS Export Changes + +After the network migration, update sifaka's NFS export rules: + +| Share | Before | After | +|-------|--------|-------| +| `/volume1/torrents` | `192.168.1.0/24`, `100.64.0.0/10` | `192.168.10.0/24`, `100.64.0.0/10` | +| `/volume1/music` | `192.168.1.0/24`, `100.64.0.0/10` | `192.168.10.0/24`, `100.64.0.0/10` | +| `/volume1/photos` | `192.168.1.0/24`, `100.64.0.0/10` | `192.168.10.0/24`, `100.64.0.0/10` | + +This is a manual change in the Synology DSM NFS settings (not managed by Pulumi — sifaka's NFS config is outside the UniFi provider's scope). The k8s PersistentVolume definitions (`argocd/manifests/*/pv-nfs.yaml`) resolve sifaka by hostname and don't need subnet changes. These will be declared after the initial import is stable. diff --git a/docs/how-to/plans/adopt-oidc-provider.md b/docs/how-to/plans/adopt-oidc-provider.md new file mode 100644 index 0000000..e49dfba --- /dev/null +++ b/docs/how-to/plans/adopt-oidc-provider.md @@ -0,0 +1,207 @@ +--- +title: "Plan: Adopt OIDC Identity Provider" +tags: + - how-to + - plans + - security + - oidc +--- + +# Plan: Adopt OIDC Identity Provider + +> **Status:** Planning (design sketch — not yet ready to execute) + +## Background + +BlumeOps services currently handle authentication independently — ArgoCD has its own admin password, Grafana has its own login, Forgejo has local accounts, and zot has no auth at all. There is no single sign-on, no centralized user management, and no way to issue scoped API keys or service tokens from a shared identity. + +Adding an OpenID Connect (OIDC) identity provider gives BlumeOps a central authentication layer. Services delegate login to the IdP, and the IdP issues tokens that carry identity and group claims. This unlocks: + +- **SSO across services** — one login for Grafana, ArgoCD, Forgejo, zot, and future services +- **API keys derived from identity** — zot's API key feature requires OIDC; CI service accounts get scoped, expirable tokens tied to a real identity +- **Group-based authorization** — services can make access decisions based on IdP group claims rather than per-service user lists +- **Audit trail** — authentication events flow through one system + +### Goals + +- Deploy a lightweight OIDC provider on the BlumeOps infrastructure +- Configure at least one service (zot) as a relying party to validate the setup +- Establish patterns for adding future OIDC clients (Grafana, ArgoCD, Forgejo) +- Keep complexity appropriate for a single-user homelab + +## Provider Comparison + +| Provider | Language | Resources | UI | OIDC Maturity | Zot Integration | Notes | +|----------|----------|-----------|-----|---------------|-----------------|-------| +| **Dex** | Go | ~20-50MB RAM | None (config-driven) | Mature, purpose-built | Explicitly documented in zot examples | CNCF Sandbox; `staticPasswords` connector for single-user | +| **Authentik** | Python | ~200-300MB RAM, needs PostgreSQL + Redis | Full web UI, visual flow builder | Mature | [Proven community guide](https://integrations.goauthentik.io/infrastructure/zot/) | Best for small teams; heavier than needed for one user | +| **Authelia** | Go | ~30MB RAM | None (YAML config) | Maturing (OIDC provider still on roadmap) | [Unresolved integration issues](https://github.com/authelia/authelia/discussions/7615) | Primarily a forward-auth proxy; OIDC is secondary | +| **Keycloak** | Java | ~500MB+ RAM | Enterprise admin console | Battle-tested | Works via generic OIDC | Massive overkill for homelab | + +### Recommendation: Investigate Dex First + +Dex is the strongest candidate for BlumeOps: + +- **Lightest footprint** — single Go binary, no database dependencies (in-memory or SQLite storage) +- **Designed for exactly this** — Dex is an OIDC provider that federates identity; it's not a full IAM suite bolted onto other things +- **Zot uses Dex in its own examples** — lowest integration risk +- **`staticPasswords` connector** — define the single `eblume` user directly in YAML config, no external user store needed +- **Future flexibility** — if SSO via GitHub or Google is ever wanted, add a connector without changing the architecture +- **CNCF project** — actively maintained, well-documented + +The main trade-off is no web UI for user management — but for a single-user setup, that's a non-issue. Config changes go through the normal PR workflow. + +If Dex proves insufficient during execution (e.g., missing features for a specific service integration), Authentik is the fallback — heavier but more capable. + +## Architecture + +``` + Caddy (TLS termination) + | + +--------------+--------------+ + | | | + Browser SSO CLI / CI k8s services + | | | + v v v + Dex (OIDC IdP) API Keys OIDC tokens + issuer: (generated (validated by + dex.ops.eblu.me after OIDC each service) + | login) + v + staticPasswords + connector (eblume) +``` + +### Deployment Options + +Dex can run as: + +1. **k8s pod** (via ArgoCD) — follows the pattern of other BlumeOps services, gets automatic restarts, lives alongside its consumers +2. **Native on indri** (via Ansible/LaunchAgent) — follows the zot/Forgejo pattern, simpler networking + +The k8s option is preferred since most OIDC consumers (Grafana, ArgoCD) are already in k8s. Evaluate during execution. + +### Endpoints + +| Endpoint | URL | Purpose | +|----------|-----|---------| +| Issuer | `https://dex.ops.eblu.me` | OIDC discovery (`/.well-known/openid-configuration`) | +| Auth | `https://dex.ops.eblu.me/auth` | Browser login redirect | +| Token | `https://dex.ops.eblu.me/token` | Token exchange | +| Callback | Per-client (e.g., `https://registry.ops.eblu.me/zot/auth/callback/oidc`) | OAuth2 redirect URI | + +## Dex Configuration Sketch + +```yaml +issuer: https://dex.ops.eblu.me + +storage: + type: sqlite3 + config: + file: /var/dex/dex.db + +web: + http: 0.0.0.0:5556 + +connectors: + - type: local + id: local + name: Local + +staticPasswords: + - email: eblume@eblume.net + hash: "" # generated at deploy time + username: eblume + userID: "" + +staticClients: + - id: zot-registry + name: Zot Registry + secret: "" + redirectURIs: + - https://registry.ops.eblu.me/zot/auth/callback/oidc + + # Future clients: + # - id: grafana + # ... + # - id: argocd + # ... + # - id: forgejo + # ... +``` + +Secrets (static password hash, client secrets) are stored in 1Password and injected at deploy time — never committed to the repo. + +## Planned OIDC Clients + +Initial rollout targets zot only. Future services to integrate: + +| Service | OIDC Support | Priority | Notes | +|---------|-------------|----------|-------| +| **Zot** | Native (`openid.providers.oidc`) | First (validates IdP) | See [[harden-zot-registry]] | +| **Grafana** | Native (`auth.generic_oauth`) | High | Currently uses default admin password | +| **ArgoCD** | Native (`oidc.config` in `argocd-cm`) | High | Currently uses local admin password | +| **Forgejo** | Native (OAuth2 provider in admin settings) | Medium | Currently uses local accounts | + +## Execution Steps + +1. **Choose deployment method** (k8s vs native) and set up the service + - If k8s: create `argocd/manifests/dex/` with Deployment, Service, ConfigMap + - If native: create `ansible/roles/dex/` following the zot pattern + - Add Caddy reverse proxy entry for `dex.ops.eblu.me` + +2. **Configure Dex** + - Generate static password hash and client secrets + - Store all secrets in 1Password + - Deploy initial config with `staticPasswords` connector and zot as the first client + +3. **Verify OIDC discovery** + - `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid JSON + - Issuer URL matches config + +4. **Integrate first client (zot)** + - This is covered by [[harden-zot-registry]] — configure zot's `openid.providers.oidc` to point at Dex + - Test browser login → API key generation → CLI push flow + +5. **Documentation** + - Create `docs/reference/services/dex.md` reference card + - Update service indexes + - Add changelog fragment + +## Verification Checklist + +- [ ] Dex is running and healthy +- [ ] OIDC discovery endpoint returns valid configuration +- [ ] Browser login flow works (redirect → Dex login → redirect back) +- [ ] At least one client (zot) successfully authenticates via Dex +- [ ] Caddy proxies `dex.ops.eblu.me` correctly +- [ ] `mise run services-check` passes (if health check is added) + +## Open Questions + +- **Service dependency and recovery:** If Dex runs in k8s and k8s goes down, services that depend on Dex for authentication may become inaccessible — potentially including tools needed to bring k8s back up. This circular dependency **must be resolved** before execution. Options include: running Dex natively on indri (outside k8s), ensuring all critical recovery paths have break-glass credentials that bypass OIDC, or designing the system so that OIDC is additive (services fall back to local auth when the IdP is unreachable). This needs its own design pass during implementation planning. +- **Dex vs Authentik:** Dex is the starting recommendation, but evaluate during execution. If multiple services need dynamic user management or a web UI for client registration, Authentik may be worth the extra weight. +- **Storage backend:** SQLite is simplest for single-node. If Dex runs in k8s, it needs a PersistentVolume or could use the k8s CRD storage backend instead. +- **Tailscale ACL interaction:** Should the Dex endpoint be tailnet-only, or accessible from the public internet (for potential external SSO)? Start with tailnet-only. +- **Token lifetime and refresh:** Dex defaults are reasonable, but may need tuning for long-running CI jobs. + +## Future Considerations + +- **Additional connectors** — add GitHub or Google as upstream identity sources for SSO convenience +- **Group claims** — define groups in Dex config (e.g., `admin`, `ci`) and use them for authorization across services +- **Mutual TLS** — Dex supports mTLS for service-to-service token exchange, which could harden the CI credential path + +## Reference Pattern Files + +| File | Purpose | +|------|---------| +| `argocd/manifests/grafana-config/` | Example k8s service with ConfigMap-based config | +| `ansible/roles/zot/` | Example native service deployment pattern | +| `pulumi/tailscale/` | Example of secrets injection from 1Password | + +## Related + +- [[harden-zot-registry]] — first OIDC client (execute after this plan) +- [[zot]] — container registry reference +- [[cluster]] — k8s cluster (potential Dex host) +- [[indri]] — native service host (alternative Dex host) diff --git a/docs/how-to/plans/forgejo-actions-dashboard.md b/docs/how-to/plans/forgejo-actions-dashboard.md new file mode 100644 index 0000000..00d6c7d --- /dev/null +++ b/docs/how-to/plans/forgejo-actions-dashboard.md @@ -0,0 +1,198 @@ +--- +title: "Plan: Forgejo Actions Dashboard" +tags: + - how-to + - plans + - forgejo + - monitoring + - grafana +--- + +# Plan: Forgejo Actions Dashboard + +> **Status:** Planned (not yet executed) + +## Background + +BlumeOps CI/CD runs on Forgejo Actions. There is currently no visibility into CI health — no metrics on job success/failure rates, durations, queue depth, or runner status. When a build fails silently or takes longer than expected, the only way to notice is to check the Forgejo web UI manually. + +### Goals + +- **Grafana dashboard** showing CI health at a glance: recent runs, pass/fail rates, durations, queue depth +- **Prometheus metrics** for Forgejo Actions data, following the established textfile exporter pattern +- **Alerting foundation** — once metrics exist, alerts can be added later (e.g., "no successful build in 24h") + +## Current State + +### What Forgejo Exposes + +**Built-in `/metrics` endpoint:** No Actions data. The Prometheus endpoint (currently disabled in `app.ini`) only exposes platform-level counters (`gitea_repositories`, `gitea_issues`, etc.). There is an [open feature request](https://codeberg.org/forgejo/forgejo/issues/4803) to add Actions metrics, but it is not yet implemented. + +**API (v11+):** Rich Actions data is available via REST API. BlumeOps runs Forgejo v14.0.2, so all endpoints are available: + +| Endpoint | Data | +|----------|------| +| `GET /api/v1/repos/{owner}/{repo}/actions/runs` | Workflow runs: status, duration, timestamps, workflow ID, event, commit SHA | +| `GET /api/v1/repos/{owner}/{repo}/actions/tasks` | Tasks: status, timestamps, workflow ID, run number | +| `GET /api/v1/admin/actions/runners/jobs` | Global job search: status, runner labels, dependencies | +| `GET /api/v1/repos/{owner}/{repo}/actions/runners/jobs` | Per-repo job search | + +### Existing Metrics Pattern + +Custom exporters on indri follow a consistent pattern: + +1. **Bash script** polls a local API and writes `.prom` files +2. **LaunchAgent** runs the script on a schedule (e.g., every 60s) +3. **node_exporter textfile collector** picks up `.prom` files from `/opt/homebrew/var/node_exporter/textfile/` +4. **Alloy** scrapes node_exporter and remote-writes to Prometheus +5. **Grafana dashboard** in a ConfigMap auto-discovered by the sidecar + +Examples: `ansible/roles/zot_metrics/`, `ansible/roles/borgmatic_metrics/`, `ansible/roles/jellyfin_metrics/` + +### Grafana Dashboard Pattern + +Dashboards are stored as ConfigMaps in `argocd/manifests/grafana-config/dashboards/` with label `grafana_dashboard: "1"`. The Grafana sidecar auto-discovers and provisions them. See `configmap-zot.yaml` or `configmap-services.yaml` for examples. + +## Plan + +### 1. Create `forgejo_actions_metrics` Ansible Role + +A new role following the established pattern: + +``` +ansible/roles/forgejo_actions_metrics/ +├── defaults/main.yml # API URL, token var, output dir, repos list +├── tasks/main.yml # Deploy script + LaunchAgent +└── templates/ + ├── forgejo-actions-metrics.sh.j2 # Collection script + └── forgejo-actions-metrics.plist.j2 # LaunchAgent +``` + +**The collection script** polls the Forgejo API and writes Prometheus-format metrics: + +``` +# HELP forgejo_actions_runs_total Total workflow runs by status +# TYPE forgejo_actions_runs_total gauge +forgejo_actions_runs_total{repo="blumeops",status="success"} 42 +forgejo_actions_runs_total{repo="blumeops",status="failure"} 3 +forgejo_actions_runs_total{repo="blumeops",status="running"} 1 + +# HELP forgejo_actions_run_duration_seconds Duration of recent workflow runs +# TYPE forgejo_actions_run_duration_seconds gauge +forgejo_actions_run_duration_seconds{repo="blumeops",workflow="build-blumeops",status="success"} 127 + +# HELP forgejo_actions_jobs_waiting Number of jobs waiting in queue +# TYPE forgejo_actions_jobs_waiting gauge +forgejo_actions_jobs_waiting 0 + +# HELP forgejo_actions_jobs_running Number of jobs currently running +# TYPE forgejo_actions_jobs_running gauge +forgejo_actions_jobs_running 1 + +# HELP forgejo_actions_last_success_timestamp_seconds Unix timestamp of last successful run +# TYPE forgejo_actions_last_success_timestamp_seconds gauge +forgejo_actions_last_success_timestamp_seconds{repo="blumeops",workflow="build-blumeops"} 1707600000 + +# HELP forgejo_actions_up Forgejo Actions API is reachable +# TYPE forgejo_actions_up gauge +forgejo_actions_up 1 +``` + +**Metrics to expose** (refine during implementation): + +| Metric | Type | Labels | Source | +|--------|------|--------|--------| +| `forgejo_actions_up` | gauge | — | API reachability check | +| `forgejo_actions_runs_total` | gauge | `repo`, `status` | `/actions/runs` filtered by status | +| `forgejo_actions_run_duration_seconds` | gauge | `repo`, `workflow`, `status` | Most recent run per workflow | +| `forgejo_actions_jobs_waiting` | gauge | — | `/actions/runners/jobs` filtered by status | +| `forgejo_actions_jobs_running` | gauge | — | `/actions/runners/jobs` filtered by status | +| `forgejo_actions_last_success_timestamp_seconds` | gauge | `repo`, `workflow` | Most recent successful run timestamp | +| `forgejo_actions_last_run_status` | gauge | `repo`, `workflow` | 1=success, 0=failure (last run per workflow) | + +**Authentication:** The script needs a Forgejo API token. The existing `_forgejo_api_token` pattern from the playbook's `pre_tasks` can be reused, or a dedicated read-only token can be created and stored in 1Password. + +**Repos to monitor:** Start with `eblume/blumeops` (the only repo with active workflows). The role should accept a list of repos so more can be added later. + +**Collection interval:** 60 seconds (same as zot_metrics, jellyfin_metrics). + +### 2. Create Grafana Dashboard ConfigMap + +Add `argocd/manifests/grafana-config/dashboards/configmap-forgejo-actions.yaml` with a dashboard showing: + +- **Overview row:** jobs running, jobs waiting, last build status +- **Success/failure trend:** runs by status over time +- **Duration trend:** run duration over time, per workflow +- **Staleness:** time since last successful build per workflow +- **Table:** recent runs with status, duration, commit + +The specific dashboard layout will be designed during implementation — this plan focuses on the data pipeline. + +### 3. Wire Into Ansible Playbook + +Add the new role to `ansible/playbooks/indri.yml` alongside the other metrics roles: + +```yaml +- role: forgejo_actions_metrics + tags: forgejo_actions_metrics +``` + +No changes needed to Alloy — it already picks up all `.prom` files from the textfile directory. + +## Execution Steps + +1. **Create the Ansible role** (`ansible/roles/forgejo_actions_metrics/`) + - Write collection script that queries the Forgejo API + - Write LaunchAgent plist + - Add to `indri.yml` playbook + +2. **Create or reuse API token** + - Check if existing Forgejo API token has sufficient permissions + - If not, create a dedicated read-only token and store in 1Password + +3. **Deploy and verify metrics collection** + - `mise run provision-indri -- --tags forgejo_actions_metrics` + - Verify `.prom` file appears in textfile directory + - Verify metrics appear in Prometheus: `curl 'https://prometheus.ops.eblu.me/api/v1/query?query=forgejo_actions_up'` + +4. **Create Grafana dashboard ConfigMap** + - Build dashboard JSON (can use Grafana UI, then export) + - Wrap in ConfigMap with `grafana_dashboard: "1"` label + - Sync via ArgoCD + +5. **Update documentation** + - Add changelog fragment + - Update `docs/reference/services/forgejo.md` if it exists, or note in the plan's reference card + +## Verification Checklist + +- [ ] Collection script runs without errors on indri +- [ ] `.prom` file in `/opt/homebrew/var/node_exporter/textfile/` has expected metrics +- [ ] Metrics queryable in Prometheus +- [ ] Grafana dashboard loads and shows data +- [ ] LaunchAgent survives indri restart +- [ ] `mise run services-check` passes + +## Open Questions + +- **Scope of repos:** Start with `eblume/blumeops` only, or also monitor mirrored repos that have workflows? +- **Historical depth:** How far back should the script query? The API paginates — querying the last N runs (e.g., 50) per repo is likely sufficient rather than scanning all history. +- **Runner health:** The Forgejo API does not expose a runner list endpoint. Runner health could be inferred (if jobs stay in "waiting" too long, the runner is likely down), but direct runner metrics aren't available without querying the Forgejo database directly. + +## Reference Pattern Files + +| File | Purpose | +|------|---------| +| `ansible/roles/zot_metrics/` | Textfile exporter role pattern (simplest example) | +| `ansible/roles/borgmatic_metrics/` | More complex exporter with multiple metrics | +| `ansible/roles/jellyfin_metrics/` | Exporter with API key authentication | +| `argocd/manifests/grafana-config/dashboards/configmap-zot.yaml` | Dashboard ConfigMap pattern | +| `argocd/manifests/grafana-config/dashboards/configmap-services.yaml` | Multi-panel dashboard example | +| `ansible/roles/forgejo/templates/app.ini.j2` | Forgejo configuration | +| `ansible/roles/alloy/templates/config.alloy.j2` | Alloy config (textfile collector) | + +## Related + +- [[forgejo]] — Forgejo service reference +- [[cluster]] — Grafana and Prometheus run here +- [[grafana]] — Dashboard host diff --git a/docs/how-to/plans/harden-zot-registry.md b/docs/how-to/plans/harden-zot-registry.md new file mode 100644 index 0000000..a8f9618 --- /dev/null +++ b/docs/how-to/plans/harden-zot-registry.md @@ -0,0 +1,210 @@ +--- +title: "Plan: Harden Zot Registry" +tags: + - how-to + - plans + - zot + - registry + - security +--- + +# Plan: Harden Zot Registry + +> **Status:** Planned (not yet executed) +> **Sequence:** Execute after [[adopt-dagger-ci]] and [[adopt-oidc-provider]] — the Dagger migration will change how images are built and pushed, and the OIDC provider supplies the identity layer that zot's auth and API key features depend on. + +## Background + +Zot is the BlumeOps OCI container registry, running natively on [[indri]]. It serves two roles: a pull-through cache for upstream registries (Docker Hub, GHCR, Quay) and the private image store for `blumeops/*` images. + +Currently, zot has **no authentication** — the security boundary is the Tailscale ACL. This was an acceptable starting point, but has two gaps: + +1. **Any tailnet client can push images** — there's no distinction between pull (which k8s pods need) and push (which only CI should do). A compromised service or misconfigured pod could overwrite production images. +2. **Tags are mutable** — pushing the same tag twice silently overwrites the previous image. There's no protection against accidental or malicious tag clobbering. + +### Goals + +- **Authenticated push** — only CI (Forgejo Actions / Dagger) can push images; all other clients are pull-only +- **Tag immutability** — once a version tag is pushed, it cannot be overwritten +- **No disruption to pulls** — k8s pods and pull-through caching continue to work without authentication +- **Minimal complexity** — use zot's built-in OIDC and API key features with the BlumeOps identity provider + +## Current State + +### Push Mechanism + +Images are currently pushed via the composite action at `.forgejo/actions/build-push-image/action.yaml`: + +1. `docker buildx build` creates the image +2. `docker save` exports to a tarball +3. `skopeo copy` pushes to `registry.ops.eblu.me` (no credentials needed) + +The action pushes two tags per build: a version tag (e.g., `v1.2.0`) and the git commit SHA. + +### Zot Configuration + +The config template (`ansible/roles/zot/templates/config.json.j2`) has no `accessControl` or `http.auth` section. The HTTP listener binds to `0.0.0.0:5050` with no TLS (Caddy terminates TLS at `registry.ops.eblu.me`). + +## Plan + +### 1. Add Authentication for Push (OIDC + API Keys) + +Zot supports native OIDC authentication with a built-in API key feature designed for exactly this use case. The approach: + +1. **OIDC for browser login** — zot delegates authentication to the BlumeOps OIDC provider (see [[adopt-oidc-provider]]). Human users log in via browser redirect. +2. **API keys for CI** — after logging in via OIDC, generate a scoped API key for Forgejo CI / Dagger. API keys are zot-native tokens (`zak_...`) that work with `docker login`, `skopeo`, and Dagger's `with_registry_auth()`. They can be scoped to specific repositories and given expiration dates. +3. **Access control** — `anonymousPolicy` allows unauthenticated pull; push requires authentication. + +```json +{ + "http": { + "auth": { + "openid": { + "providers": { + "oidc": { + "name": "BlumeOps", + "credentialsFile": "/Users/erichblume/.config/zot/oidc-credentials.json", + "issuer": "https://dex.ops.eblu.me", + "scopes": ["openid", "profile", "email"] + } + } + }, + "apikey": true + }, + "accessControl": { + "repositories": { + "**": { + "anonymousPolicy": ["read"], + "defaultPolicy": ["read", "create", "update"], + "policies": [ + { + "users": ["eblume"], + "actions": ["read", "create", "update", "delete"] + } + ] + } + }, + "adminPolicy": { + "users": ["eblume"], + "actions": ["read", "create", "update", "delete"] + } + } + } +} +``` + +The OIDC credentials file (client ID and secret) is deployed by Ansible from 1Password — never committed to the repo. + +**CI push flow after setup:** +1. Log in to zot UI via browser (OIDC redirect to Dex) +2. Generate an API key: `POST /zot/auth/apikey` with label `forgejo-ci`, scoped to `blumeops/**` +3. Store the key in 1Password (`op://blumeops/zot-ci-apikey/credential`) +4. CI uses the key: `docker login -u eblume -p zak_... registry.ops.eblu.me` + +This ensures: +- k8s pods, minikube containerd, and pull-through caching all continue to work anonymously (read-only) +- Push requires a valid API key tied to an OIDC identity +- No standalone password files (htpasswd) to manage — identity flows from the central IdP + +### 2. Enforce Tag Immutability + +Zot does not have a built-in tag immutability feature at the registry level. Options to consider during execution: + +- **Registry-side:** Check if newer zot versions (post-2.1) have added immutability policies. If so, configure in `config.json`. +- **Push-side enforcement:** The simpler approach — check whether a tag already exists before pushing. The current build-push-image action (and its eventual Dagger replacement) should query the registry API (`GET /v2//tags/list`) and **fail the build** if the version tag already exists. Commit SHA tags are inherently unique and don't need this check. + +The push-side approach is pragmatic: it prevents accidental overwrites in the normal CI flow. Combined with authenticated push, a tag can only be overwritten by someone with CI credentials who deliberately bypasses the check. + +> **See:** `.forgejo/actions/build-push-image/action.yaml` — this is where the pre-push tag check would be added in the current workflow. After [[adopt-dagger-ci]], the equivalent check goes in the Dagger `Container.publish()` wrapper. + +### 3. Update Ansible Role + +The `ansible/roles/zot/` role needs: + +- **New template:** `oidc-credentials.json.j2` (client ID and secret for the Dex OIDC client) +- **Updated config template:** `config.json.j2` gains `http.auth` (openid + apikey) and `accessControl` sections +- **Updated config template:** `config.json.j2` gains `externalUrl` set to `https://registry.ops.eblu.me` (required for OIDC callback redirects behind Caddy) +- **New variables:** `zot_oidc_client_id` and `zot_oidc_client_secret` sourced from 1Password in the playbook's `pre_tasks` +- **Handler:** restart zot LaunchAgent after config changes (already exists) + +### 4. Update CI Push Credentials + +After [[adopt-dagger-ci]], the Dagger module will use the zot API key for registry auth: + +```python +api_key = dag.set_secret("registry-api-key", + os.environ["ZOT_CI_API_KEY"]) +container.with_registry_auth("registry.ops.eblu.me", "eblume", api_key) +container.publish("registry.ops.eblu.me/blumeops/image:tag") +``` + +### 5. Update Minikube Containerd Config + +The minikube containerd config (`ansible/roles/minikube/tasks/main.yml`) currently talks to zot without credentials. Since anonymous pull remains allowed, **no changes are needed** for containerd. + +## Execution Steps + +1. **Prerequisite: OIDC provider is running** (see [[adopt-oidc-provider]]) + - Dex (or chosen provider) is deployed and serving `https://dex.ops.eblu.me` + - A zot OIDC client is registered with the provider + +2. **Update Ansible role** + - Add OIDC credentials template + - Update `config.json.j2` with auth (openid + apikey) and access control + - Store OIDC client credentials in 1Password + - Test with `mise run provision-indri -- --tags zot --check --diff` + +3. **Deploy and verify pulls still work** + - `mise run provision-indri -- --tags zot` + - Verify anonymous pull: `curl -sf https://registry.ops.eblu.me/v2/_catalog` + - Verify unauthenticated push fails: `skopeo copy ... docker://registry.ops.eblu.me/blumeops/test:fail` (should get 401) + +4. **Set up OIDC login and generate CI API key** + - Log in to zot UI via browser (OIDC flow through Dex) + - Generate an API key for CI use, store in 1Password + - Verify authenticated push works: `docker login -u eblume -p zak_... registry.ops.eblu.me` + +5. **Add tag immutability check to push workflow** + - Add pre-push tag existence check to Dagger module (or build-push-image action) + - Test by attempting to push an existing tag + +6. **Update documentation** + - Update `docs/reference/services/zot.md` security model section + - Add changelog fragment + +## Verification Checklist + +- [ ] Anonymous pull works (k8s pods, containerd, curl) +- [ ] Pull-through caching still works (pull an uncached image from docker.io) +- [ ] Unauthenticated push is rejected (401) +- [ ] OIDC browser login works (redirect to Dex and back) +- [ ] API key generation works from zot UI +- [ ] Authenticated push with API key succeeds +- [ ] Pushing a duplicate version tag fails (immutability check) +- [ ] Pushing a new commit SHA tag succeeds +- [ ] Grafana dashboard still shows zot metrics +- [ ] `mise run services-check` passes + +## Open Questions + +- **Immutability granularity:** Should immutability apply only to semver tags (`v*`) or also to commit SHA tags? SHA tags are unique by nature, so immutability is only meaningful for version tags. +- **API key rotation:** API keys can have expiration dates. Decide on a rotation policy — e.g., annual expiry with a reminder, or no expiry with manual rotation. + +## Reference Pattern Files + +| File | Purpose | +|------|---------| +| `ansible/roles/zot/templates/config.json.j2` | Current zot config (no auth) | +| `ansible/roles/zot/tasks/main.yml` | Zot deployment tasks | +| `ansible/roles/zot/defaults/main.yml` | Zot default variables | +| `.forgejo/actions/build-push-image/action.yaml` | Current image push workflow (skopeo) | +| `ansible/roles/minikube/tasks/main.yml` | Containerd registry mirror config | +| `docs/reference/services/zot.md` | Zot reference documentation | + +## Related + +- [[adopt-oidc-provider]] — OIDC identity provider (execute first) +- [[adopt-dagger-ci]] — CI/CD engine migration (execute first) +- [[zot]] — Zot reference card +- [[forgejo]] — CI platform that pushes images +- [[cluster]] — Registry consumer diff --git a/docs/how-to/plans/operationalize-reolink-camera.md b/docs/how-to/plans/operationalize-reolink-camera.md new file mode 100644 index 0000000..a235168 --- /dev/null +++ b/docs/how-to/plans/operationalize-reolink-camera.md @@ -0,0 +1,283 @@ +--- +title: "Plan: Operationalize ReoLink Camera" +tags: + - how-to + - plans + - security + - surveillance + - frigate +--- + +# Plan: Operationalize ReoLink Camera + +> **Status:** Planned (not yet executed) +> **Depends on:** [[add-unifi-pulumi-stack]] — the camera must be on the IoT VLAN, isolated from the rest of the network. + +## Background + +A ReoLink Elite Floodlight WiFi outdoor camera has been purchased. The goal is to operate it in a fully **cloud-free, privacy-first** configuration — no ReoLink cloud account, no Ring-style surveillance state participation. All video stays on local infrastructure. + +### Goals + +- **NVR recording to sifaka** — continuous and event-based recording stored on the Synology NAS via NFS, not on any cloud service +- **No SD card** — the camera does not need one when recording externally; avoid relying on on-device storage +- **Cloud-free** — disable UID/P2P, block internet access at the firewall, operate as a pure LAN device +- **Object detection and alerting** — detect people, vehicles, animals and send notifications without relying on ReoLink's cloud AI features +- **Ring buffer retention** — automatic storage management so recordings don't fill the NAS +- **IoT VLAN isolation** — camera lives on the isolated IoT/Appliances network with only the required ports open to the services subnet + +## ReoLink Elite Floodlight WiFi + +### Capabilities + +| Feature | Details | +|---------|---------| +| Resolution | 4K/8MP (5120x1552 stitched dual-lens panoramic, 180°) | +| Codec | H.265 (HEVC) main stream, H.264 sub stream | +| Connectivity | WiFi 6 (802.11ax) dual-band | +| RTSP | Yes (disabled by default, enable in settings) — `rtsp://admin:@:554/Preview_01_main` | +| ONVIF | Yes (port 8000, disabled by default) | +| HTTP API | Yes — `https:///cgi-bin/api.cgi?cmd=&user=admin&password=` | +| Floodlight control | Via HTTP API (`SetWhiteLed`) — brightness, mode (off/smart/always/timer) | +| On-device AI | Person/vehicle/pet detection (runs locally on camera, fires ONVIF events) | + +### Cloud-Free Operation + +The camera operates fully without internet: + +1. **Disable UID (P2P):** Settings > Network > Advanced > Enable UID → Off +2. **Block internet at firewall:** IoT VLAN rule denies all outbound to WAN +3. **No ReoLink cloud account needed** — initial setup via app on local network, skip account prompts + +What works without internet: RTSP, ONVIF, HTTP API, on-device AI detection, floodlight control, live view. + +What is lost: remote app access (use VPN/Tailscale instead), push notifications (use Frigate alerting), OTA firmware updates (manual firmware files instead). + +### SD Card: Not Required + +Confirmed: the camera streams RTSP and fires ONVIF events without an SD card. On-device recording/playback and local AI video search require an SD card, but both are unnecessary when using an external NVR. + +### Required Network Ports + +| Port | Protocol | Purpose | Who connects | +|------|----------|---------|-------------| +| 554 | TCP (RTSP) | Video streaming | Frigate (services subnet) | +| 443 | TCP (HTTPS) | API control | Home Assistant / scripts (services subnet) | +| 8000 | TCP (ONVIF) | Event subscriptions | Home Assistant (services subnet) | + +These ports need to be allowed from the BlumeOps Services subnet (`192.168.10.0/24`) to the camera's IP on the IoT VLAN (`192.168.3.0/24`). All other traffic to/from the camera is blocked. + +## Frigate NVR + +Frigate is the clear choice for homelab NVR — open-source, container-native, sophisticated retention, native Prometheus metrics, and first-class ReoLink support. + +### Architecture + +Frigate runs as a container in the k8s cluster on indri. It consumes the camera's RTSP streams via go2rtc (an embedded RTSP restreaming proxy that handles connection reliability), performs object detection on the sub stream, and writes recordings to sifaka via NFS. + +``` +ReoLink Camera (IoT VLAN) + │ + │ RTSP (port 554) + ▼ +Frigate (k8s pod on indri) + ├── go2rtc — RTSP restream proxy + ├── FFmpeg — stream decoding + ├── ONNX detector — object detection (CPU) + ├── /media/frigate — NFS mount to sifaka + └── /db — local SQLite (emptyDir or PV) + │ + ├──→ Prometheus (/api/metrics endpoint) + └──→ MQTT (detection events) +``` + +### Object Detection on M1 + +Indri is an Apple M1 Mac mini. Inside minikube's Linux VM, the Apple Neural Engine is not accessible. Detection options: + +- **ONNX (CPU):** Works on ARM64. For a single camera, M1's performance cores handle detection comfortably. This is the starting point. +- **Hailo-8L (future):** If more cameras are added, a USB-attached Hailo-8L accelerator (~$30) could be passed through to the VM. Evaluate only if CPU detection proves insufficient. + +### Recording and Retention (Ring Buffer) + +Frigate's retention system is the most sophisticated of any homelab NVR: + +```yaml +record: + enabled: true + retain: + days: 3 # Keep ALL continuous recordings for 3 days + mode: all + alerts: + retain: + days: 30 # Keep alert clips (person/vehicle) for 30 days + mode: active_objects + detections: + retain: + days: 14 # Keep all detection clips for 14 days + mode: motion +``` + +**Safety mechanism:** When less than 1 hour of storage remains, the oldest 2 hours of recordings are deleted automatically (checked every 5 minutes). + +**Recordings are written directly from the camera stream — no re-encoding.** This means zero CPU cost for recording; CPU is only used for detection on the sub stream. + +### Storage Estimates + +For a single 4K H.265 camera at moderate quality: + +| Strategy | Per Day | 30 Days | Notes | +|----------|---------|---------|-------| +| 24/7 continuous | ~80-130 GB | 2.4-3.9 TB | Upper bound | +| Motion-only | ~8-26 GB | 240-780 GB | Depends on scene activity | +| Detection-only (active objects) | ~2-13 GB | 60-390 GB | Most efficient | +| Hybrid: 3d continuous + 30d events | — | ~600 GB-1 TB | Recommended starting point | + +A dedicated 2 TB NFS share on sifaka gives comfortable headroom for the hybrid approach with one camera. + +### NFS Storage Setup + +Mount an NFS share from sifaka into the Frigate pod: + +- **Recordings:** NFS PersistentVolume (e.g., `sifaka:/volume1/frigate`) mounted at `/media/frigate` +- **Database:** Local storage (emptyDir or a hostPath PV) mounted at `/db` — SQLite performs poorly over NFS + +This follows the same pattern as Navidrome (`argocd/manifests/navidrome/pv-nfs.yaml`) and Immich (`argocd/manifests/immich/pv-nfs.yaml`). + +**NFS export on sifaka:** Add `/volume1/frigate` with access restricted to the BlumeOps Services subnet (`192.168.10.0/24`) and Docker NAT range (`100.64.0.0/10`). + +### Prometheus Metrics + +Frigate exposes a native `/api/metrics` Prometheus endpoint with: + +- `frigate_cpu_usage_percent`, `frigate_mem_usage_percent` +- `frigate_camera_fps`, `frigate_detection_fps`, `frigate_process_fps` +- `frigate_skipped_fps`, `frigate_detection_enabled` + +A pre-built [Grafana dashboard](https://grafana.com/grafana/dashboards/18226-frigate/) exists. Add a Prometheus scrape target and a Grafana dashboard ConfigMap. + +### Alerting + +Options for detection-based notifications (no Home Assistant required): + +- **[frigate-notify](https://github.com/0x2142/frigate-notify):** Standalone notification service supporting Telegram, Ntfy, Pushover, Discord, webhooks, and more. Runs as a separate container, subscribes to Frigate's MQTT events. +- **MQTT events:** Frigate publishes to MQTT on every detection — can be consumed by any MQTT subscriber. +- **Home Assistant automations:** If HA is added later, full integration with notification channels. + +### ReoLink-Specific Configuration + +ReoLink cameras need go2rtc as an intermediary (direct RTSP from Frigate can drop connections). Frigate config sketch: + +```yaml +go2rtc: + streams: + front_floodlight: + - "ffmpeg:http://admin:password@192.168.3.X/flv?port=1935&app=bcs&stream=channel0_main.bcs#video=copy#audio=copy#audio=opus" + front_floodlight_sub: + - "ffmpeg:http://admin:password@192.168.3.X/flv?port=1935&app=bcs&stream=channel0_sub.bcs" + +cameras: + front_floodlight: + enabled: true + ffmpeg: + inputs: + - path: rtsp://127.0.0.1:8554/front_floodlight + input_args: preset-rtsp-restream + roles: [record] + - path: rtsp://127.0.0.1:8554/front_floodlight_sub + input_args: preset-rtsp-restream + roles: [detect] + detect: + enabled: true + width: 640 + height: 480 + objects: + track: [person, car, dog, cat] +``` + +Camera settings to apply: enable RTSP and ONVIF, set "fluency first" encoding mode, set interframe space to 1x. + +## Execution Steps + +1. **Prerequisite: Network segmentation** (see [[add-unifi-pulumi-stack]]) + - Camera on IoT VLAN (`192.168.3.0/24`) + - Firewall rules allowing ports 554, 443, 8000 from services subnet + +2. **Camera initial setup** + - Connect to WiFi (IoT SSID) + - Set static IP or DHCP reservation + - Enable RTSP, ONVIF in camera settings + - Disable UID/P2P + - Set admin password, store in 1Password + - Block internet access at firewall + +3. **Create NFS share on sifaka** + - Create `/volume1/frigate` shared folder in Synology DSM + - Set NFS permissions: `192.168.10.0/24` and `100.64.0.0/10` + +4. **Deploy Frigate to k8s** + - Create `argocd/manifests/frigate/` with Deployment, Service, ConfigMap, PV/PVC + - NFS PV for recordings, local storage for database + - Configure go2rtc + camera streams + - Start with CPU detection (ONNX) + +5. **Deploy MQTT broker** (if not already present) + - Frigate needs MQTT for event publishing + - Evaluate lightweight options: Mosquitto as a k8s pod + +6. **Set up alerting** + - Deploy frigate-notify (or equivalent) as a sidecar or separate pod + - Configure notification channel (Ntfy, Telegram, or similar) + +7. **Add Prometheus scrape target and Grafana dashboard** + - Add Frigate to `argocd/manifests/prometheus/configmap.yaml` + - Add `configmap-frigate.yaml` dashboard to `argocd/manifests/grafana-config/dashboards/` + +8. **Update documentation** + - Create reference card for the camera and Frigate + - Add changelog fragment + - Update sifaka NFS export documentation + +## Verification Checklist + +- [ ] Camera streams accessible via RTSP from services subnet +- [ ] Camera has no internet access (blocked at firewall) +- [ ] Frigate pod is running and showing live camera feed in web UI +- [ ] Recordings appearing in NFS share on sifaka +- [ ] Object detection working (person/vehicle detected in Frigate UI) +- [ ] Retention policy active (old recordings cleaned up automatically) +- [ ] Alerts firing on detection events +- [ ] Prometheus metrics visible in Grafana dashboard +- [ ] `mise run services-check` passes + +## Open Questions + +- **MQTT broker:** Is there an existing MQTT broker in the cluster, or does one need to be deployed? Mosquitto is lightweight and standard. +- **Home Assistant:** Frigate works standalone, but HA adds richer automation (e.g., turn on floodlight when person detected, arm/disarm by time of day). Evaluate whether to add HA as a future plan. +- **Sifaka NFS share sizing:** How much space to allocate on the NAS? Start with 2 TB and monitor. The hybrid retention strategy keeps this manageable. +- **Additional cameras:** If more cameras are added later, CPU detection may become a bottleneck. At that point, evaluate a Hailo-8L USB accelerator or a dedicated Frigate host (e.g., RPi5). +- **Floodlight automation:** The ReoLink HTTP API supports floodlight control. Could be automated to turn on when Frigate detects a person at night — but this requires either HA or a custom script listening to MQTT events. + +## Future Considerations + +- **Home Assistant** — adds powerful automation for camera + floodlight + notifications +- **License plate recognition** — Frigate supports LPR with appropriate models +- **Multiple cameras** — the pattern scales; add more cameras to the same Frigate instance +- **Frigate+** ($50/yr) — improved detection models trained on community data, fewer false positives + +## Reference Pattern Files + +| File | Purpose | +|------|---------| +| `argocd/manifests/navidrome/pv-nfs.yaml` | NFS PersistentVolume pattern | +| `argocd/manifests/immich/pv-nfs.yaml` | NFS PV with ReadWriteMany | +| `argocd/manifests/grafana-config/dashboards/configmap-zot.yaml` | Grafana dashboard ConfigMap pattern | +| `argocd/manifests/prometheus/configmap.yaml` | Prometheus scrape target config | +| `docs/reference/storage/sifaka.md` | NFS export documentation | + +## Related + +- [[add-unifi-pulumi-stack]] — network segmentation (IoT VLAN for camera) +- [[sifaka]] — NAS storage for recordings +- [[cluster]] — k8s cluster hosting Frigate +- [[grafana]] — monitoring dashboards diff --git a/docs/how-to/plans/plans.md b/docs/how-to/plans/plans.md index cff2c38..1e34bab 100644 --- a/docs/how-to/plans/plans.md +++ b/docs/how-to/plans/plans.md @@ -17,3 +17,7 @@ Plans differ from regular how-to guides in that they describe work that has been | [[add-unifi-pulumi-stack]] | Planned | Add Pulumi IaC for UniFi Express 7 home network | | [[adopt-dagger-ci]] | Planned | Adopt Dagger as CI/CD build engine, migrate docs artifacts to Forgejo packages | | [[upstream-fork-strategy]] | Planned | Stacked-branch forking strategy for tracking upstream projects | +| [[adopt-oidc-provider]] | Planning | Deploy OIDC identity provider for SSO across services | +| [[harden-zot-registry]] | Planned | Add authentication and tag immutability to zot registry | +| [[forgejo-actions-dashboard]] | Planned | Grafana dashboard and custom Prometheus exporter for Forgejo Actions CI metrics | +| [[operationalize-reolink-camera]] | Planned | Cloud-free NVR with Frigate, object detection, and ring buffer recording to sifaka |