diff --git a/docs/README.md b/docs/README.md index d7c425e..81c9e5a 100644 --- a/docs/README.md +++ b/docs/README.md @@ -111,15 +111,17 @@ Task-oriented instructions for specific operations. **How-to URL:** https://docs.ops.eblu.me/how-to/ -### Phase 5: Explanation +### Phase 5: Explanation (Complete) Understanding-oriented discussion of concepts and decisions. -- [ ] Create `explanation/` directory -- [ ] "Why GitOps?" - Philosophy and approach -- [ ] "Architecture Overview" - How everything fits together -- [ ] "Security Model" - Tailscale, secrets management, etc. -- [ ] "Decision Log" - ADRs (Architecture Decision Records) -- [ ] Update `exploring-the-docs` with Explanation section +- [x] Create `explanation/` directory +- [x] "Why GitOps?" - Philosophy and approach +- [x] "Architecture Overview" - How everything fits together +- [x] "Security Model" - Tailscale, secrets management, etc. +- [ ] "Decision Log" - ADRs (Architecture Decision Records) - deferred +- [x] Update `exploring-the-docs` with Explanation section + +**Explanation URL:** https://docs.ops.eblu.me/explanation/ ### Phase 6: Integration & Cleanup - [ ] Migrate remaining useful content from `docs/zk/` diff --git a/docs/changelog.d/phase5-explanation.doc.md b/docs/changelog.d/phase5-explanation.doc.md new file mode 100644 index 0000000..6b4d3cb --- /dev/null +++ b/docs/changelog.d/phase5-explanation.doc.md @@ -0,0 +1 @@ +Add Phase 5 explanation docs: why GitOps, architecture overview, and security model diff --git a/docs/explanation/architecture.md b/docs/explanation/architecture.md new file mode 100644 index 0000000..74593f2 --- /dev/null +++ b/docs/explanation/architecture.md @@ -0,0 +1,149 @@ +--- +title: architecture +tags: + - explanation + - architecture +--- + +# Architecture Overview + +> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure. + +How all the BlumeOps pieces fit together. + +## Physical Layer + +Two always-on devices form the infrastructure backbone: + +``` +┌─────────────────┐ ┌─────────────────┐ +│ Indri │ │ Sifaka │ +│ Mac Mini M1 │────▶│ Synology NAS │ +│ (compute) │ │ (storage) │ +└─────────────────┘ └─────────────────┘ + │ + │ Tailscale + ▼ +┌─────────────────┐ +│ Gilbert │ +│ MacBook Air │ +│ (workstation) │ +└─────────────────┘ +``` + +- **[[indri]]** runs all services (native and containerized) +- **[[sifaka]]** provides bulk storage and backup targets +- **[[gilbert]]** is the development workstation + +## Network Layer + +[[tailscale]] provides the network fabric: + +- All devices on tailnet `tail8d86e.ts.net` +- ACLs control access between devices and services +- MagicDNS provides `*.tail8d86e.ts.net` hostnames +- No port forwarding or public IPs needed + +## Service Routing + +Two DNS domains route to services: + +| Domain | Mechanism | Reachable from | +|--------|-----------|----------------| +| `*.ops.eblu.me` | Caddy reverse proxy on indri | Everywhere (k8s pods, containers, tailnet) | +| `*.tail8d86e.ts.net` | Tailscale MagicDNS | Tailnet clients only | + +See [[routing]] for details on when to use which. + +## Compute Layer + +Services run in two places: + +### Native on Indri (Ansible) + +Some services run directly on macOS: +- [[forgejo]] - Git forge (needs filesystem access) +- [[zot]] - Container registry (k8s depends on it) +- [[jellyfin]] - Media server (needs VideoToolbox hardware transcoding) +- [[borgmatic]] - Backups (needs host filesystem access) + +Managed via Ansible roles in `ansible/roles/`. + +### Kubernetes (ArgoCD) + +Most services run in minikube on indri: +- [[grafana]], [[prometheus]], [[loki]] - Observability +- [[miniflux]], [[navidrome]], [[kiwix]] - Applications +- [[postgresql]] - Shared database (CloudNativePG) + +Managed via ArgoCD from `argocd/manifests/`. + +## Data Flow + +``` +┌──────────────┐ +│ Git Repo │ +│ (Forgejo) │ +└──────┬───────┘ + │ push + ▼ +┌──────────────┐ ┌──────────────┐ +│ ArgoCD │────▶│ Kubernetes │ +│ (watches) │sync │ (runs) │ +└──────────────┘ └──────────────┘ + │ + ┌────────────────────┼────────────────────┐ + ▼ ▼ ▼ +┌──────────────┐ ┌──────────────┐ ┌──────────────┐ +│ Service │ │ Service │ │ Service │ +└──────────────┘ └──────────────┘ └──────────────┘ +``` + +1. Code pushed to [[forgejo]] +2. [[argocd]] detects changes (or manual sync triggered) +3. ArgoCD applies manifests to cluster +4. Services start/update in Kubernetes + +## Observability + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ Alloy │────▶│ Prometheus │────▶│ Grafana │ +│ (collector) │ │ (metrics) │ │ (dashboards)│ +└─────────────┘ └─────────────┘ └─────────────┘ + │ ▲ + │ ┌─────────────┐ │ + └───────────▶│ Loki │────────────┘ + │ (logs) │ + └─────────────┘ +``` + +[[alloy]] runs in two places: +- On indri: collects host metrics and logs +- In k8s: collects pod logs and service probes + +See [[observability]] for details. + +## Secrets Flow + +``` +┌─────────────┐ ┌─────────────┐ ┌─────────────┐ +│ 1Password │────▶│ 1Password │────▶│ External │ +│ (vault) │ │ Connect │ │ Secrets │ +└─────────────┘ └─────────────┘ └─────────────┘ + │ + ▼ + ┌─────────────┐ + │ K8s Secret │ + └─────────────┘ +``` + +Secrets live in 1Password and flow to Kubernetes via [[external-secrets]]. + +For Ansible, secrets are fetched via `op` CLI in playbook pre_tasks. + +## Related + +- [[why-gitops]] - Philosophy behind this approach +- [[security-model]] - Access control and secrets +- [[routing]] - Service routing details diff --git a/docs/explanation/index.md b/docs/explanation/index.md new file mode 100644 index 0000000..4690290 --- /dev/null +++ b/docs/explanation/index.md @@ -0,0 +1,22 @@ +--- +title: explanation +tags: + - explanation +--- + +# Explanation + +Understanding-oriented content explaining the "why" behind BlumeOps design decisions. + +## Philosophy + +| Article | Description | +|---------|-------------| +| [[why-gitops]] | Why infrastructure-as-code and GitOps for a homelab | + +## Design + +| Article | Description | +|---------|-------------| +| [[architecture]] | How all the pieces fit together | +| [[security-model]] | Network security, secrets, and access control | diff --git a/docs/explanation/security-model.md b/docs/explanation/security-model.md new file mode 100644 index 0000000..e819218 --- /dev/null +++ b/docs/explanation/security-model.md @@ -0,0 +1,139 @@ +--- +title: security-model +tags: + - explanation + - security +--- + +# Security Model + +> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure. + +How BlumeOps handles network security, secrets, and access control. + +## Network Security: Tailscale + +The foundational security decision is using [[tailscale]] as the network layer. + +### Zero Trust Networking + +BlumeOps has no public IP addresses or port forwarding. All services are only accessible via Tailscale: + +- **No attack surface** from the public internet +- **Encrypted by default** - WireGuard encryption for all traffic +- **Identity-based access** - ACLs based on user/device identity, not IP addresses + +### Defense in Depth + +Even within the tailnet, access is restricted: + +``` +Internet ──X──▶ Services (no public access) + +Tailnet: + Admin ────────▶ All services + Member ───────▶ User-facing services only + Homelab tag ──▶ NAS (for backups) +``` + +See [[tailscale]] for the full ACL matrix. + +## Secrets Management + +Secrets follow a hierarchy: + +### Source of Truth: 1Password + +All secrets originate in 1Password's `blumeops` vault: +- API keys, tokens, passwords +- SSH keys and certificates +- OAuth credentials + +### Kubernetes: External Secrets Operator + +[[external-secrets]] syncs secrets from 1Password to Kubernetes: + +``` +1Password ──▶ 1Password Connect ──▶ ExternalSecret ──▶ K8s Secret +``` + +Services reference native Kubernetes Secrets; they don't know about 1Password. + +### Ansible: op CLI + +Ansible playbooks fetch secrets at runtime via `op` CLI: + +```yaml +- name: Fetch secret + command: op item get --fields password --reveal + delegate_to: localhost +``` + +Secrets are held in memory as Ansible facts, never written to disk. + +### Git Repository + +The repository is public. Secrets must never be committed: +- `.gitignore` excludes sensitive patterns +- Pre-commit hooks scan for potential secrets (TruffleHog) +- All config files use references to secrets, not values + +## Access Control Philosophy + +### Principle of Least Privilege + +Services and devices get minimum necessary access: + +| Entity | Access | +|--------|--------| +| Admin users | Everything | +| Member users | User-facing services only | +| Homelab servers | Only what they need (NAS for backups) | +| K8s pods | No Tailscale access (use Caddy proxy) | + +### Tagged Devices vs User Devices + +Important Tailscale concept: +- **User devices** (like gilbert) have user identity and inherit user ACLs +- **Tagged devices** (like indri with `tag:homelab`) lose user identity + +Don't tag user devices - it breaks user-based access rules. + +## Authentication Patterns + +### Service-to-Service + +Internal services use: +- Kubernetes service discovery (no auth needed within cluster) +- Tailscale identity for cross-host communication + +### User-to-Service + +Users authenticate via: +- Service-specific credentials (stored in 1Password) +- Some services support Tailscale identity (future) + +### AI/Automation Access + +Claude Code and automation use: +- SSH keys for git operations +- ArgoCD tokens for deployments +- 1Password CLI for secret retrieval (requires user approval) + +## What's Not Protected + +Honest assessment of security boundaries: + +- **Local network attacks** - If someone is on your home WiFi, they could potentially access the NAS directly +- **Physical access** - No disk encryption on servers (trade-off for reliability) +- **Supply chain** - Container images from upstream registries +- **Operator error** - Misconfigured ACLs or leaked credentials + +The model assumes a trusted home network and focuses on protecting against internet-based attacks. + +## Related + +- [[tailscale]] - ACL configuration +- [[1password]] - Secrets management +- [[external-secrets]] - Kubernetes secrets +- [[architecture]] - Overall system design diff --git a/docs/explanation/why-gitops.md b/docs/explanation/why-gitops.md new file mode 100644 index 0000000..6a98d24 --- /dev/null +++ b/docs/explanation/why-gitops.md @@ -0,0 +1,70 @@ +--- +title: why-gitops +tags: + - explanation + - philosophy +--- + +# Why GitOps? + +> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure. + +BlumeOps uses GitOps principles for managing personal infrastructure. This might seem like overkill for a homelab, but there are good reasons. + +## The Problem with Manual Infrastructure + +Traditional server management involves SSHing into machines and running commands. This works, but creates problems: + +- **Drift**: The actual state diverges from what you think it is +- **Amnesia**: You forget what you changed and why +- **Fragility**: One bad command can break things with no easy rollback +- **Bus factor**: Only you know how it works (even AI assistants struggle without context) + +## Git as the Source of Truth + +GitOps inverts the model: instead of pushing changes to servers, you commit desired state to Git, and automation pulls it into reality. + +**Benefits:** +- Every change is tracked with commit history +- Pull requests enable review before deployment +- Rollback is just `git revert` +- The repo *is* the documentation + +## Why This Matters for a Homelab + +A personal homelab isn't a production environment, but it shares the same challenges: + +1. **Memory is unreliable** - Six months from now, you won't remember why you configured Caddy that way +2. **Experimentation is constant** - You try things, break things, want to undo things +3. **AI assistance needs context** - Claude can help much more effectively when it can read your infrastructure as code + +## The BlumeOps Approach + +BlumeOps uses layered GitOps: + +| Layer | Tool | What it manages | +|-------|------|-----------------| +| **Tailnet** | [[reference/infrastructure/tailscale|Pulumi]] | ACLs, tags, DNS | +| **Host config** | [[reference/ansible/roles|Ansible]] | Services on [[indri]] | +| **Kubernetes** | [[argocd|ArgoCD]] | Containerized workloads | + +Each layer has its own reconciliation loop: +- Pulumi applies on `mise run tailnet-up` +- Ansible applies on `mise run provision-indri` +- ArgoCD watches Git and syncs manually or automatically + +## Trade-offs + +GitOps isn't free: + +- **Learning curve** - You need to understand Ansible, ArgoCD, Pulumi +- **Indirection** - Can't just `apt install` something; need to add it to config +- **Complexity** - More moving parts than a simple server + +But for BlumeOps, the trade-off is worth it. The infrastructure is complex enough that managing it imperatively would be error-prone, and the GitOps approach enables effective AI-assisted operations. + +## Related + +- [[architecture]] - How the pieces fit together +- [[argocd]] - Kubernetes GitOps +- [[reference/ansible/roles|Ansible roles]] - Host configuration diff --git a/docs/index.md b/docs/index.md index ce01b1f..9b66ba2 100644 --- a/docs/index.md +++ b/docs/index.md @@ -9,7 +9,9 @@ Welcome to the BlumeOps documentation. ## Sections - [[tutorials/index | Tutorials]] - Learning-oriented guides for getting started -- [[reference/index | Reference]] - Technical reference cards for services, infrastructure, and operations +- [[reference/index | Reference]] - Technical specifications and service details +- [[how-to/index | How-to]] - Task-oriented instructions for common operations +- [[explanation/index | Explanation]] - Understanding the "why" behind BlumeOps ## About diff --git a/docs/tutorials/exploring-the-docs.md b/docs/tutorials/exploring-the-docs.md index b0e49b9..113c26b 100644 --- a/docs/tutorials/exploring-the-docs.md +++ b/docs/tutorials/exploring-the-docs.md @@ -20,7 +20,7 @@ The docs follow the [Diataxis](https://diataxis.fr/) framework: | **[[tutorials/index | Tutorials]]** | Learning-oriented | "I'm new and want to understand" | | **[[reference/index | Reference]]** | Information-oriented | "I need specific technical details" | | **[[how-to/index | How-to]]** | Task-oriented | "I need to do X" | -| **Explanation** (planned) | Understanding-oriented | "I want to understand why" | +| **[[explanation/index | Explanation]]** | Understanding-oriented | "I want to understand why" | ## Quick Paths by Audience @@ -42,9 +42,9 @@ Context for effective assistance: ### For External Readers Understanding what this is: +- [[explanation/index|Explanation]] covers the "why" behind design decisions - [[reference/index|Reference]] shows what's actually running - Browse service pages to see specific implementations -- The repo's README has project context ### For Contributors @@ -58,6 +58,7 @@ Getting started with changes: Replicators are people who want to build their own similar homelab GitOps setup, using BlumeOps as inspiration. - [[replicating-blumeops]] provides the overview +- [[explanation/index|Explanation]] covers architecture and design rationale - The `replication/` tutorials go deep on components - Reference pages show specific configuration choices