Add Phase 5: explanation documentation (#96)

## Summary
- Create `docs/explanation/` directory with index and three explanation articles
- why-gitops: Philosophy of GitOps for homelabs (memory, rollback, AI context)
- architecture: How pieces fit together (ASCII diagrams of hosts, data flow, secrets)
- security-model: Tailscale zero-trust, 1Password secrets, access control philosophy
- Update docs/index.md with How-to and Explanation section links
- Update exploring-the-docs to link Explanation section

Decision log deferred to future work.

## Deployment and Testing
- [x] Pre-commit hooks pass (including doc-links validator)
- [ ] Build and deploy to docs.ops.eblu.me to verify rendering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/96
This commit is contained in:
Erich Blume 2026-02-03 20:33:39 -08:00
commit 0a28622751
8 changed files with 396 additions and 10 deletions

View file

@ -0,0 +1,149 @@
---
title: architecture
tags:
- explanation
- architecture
---
# Architecture Overview
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
How all the BlumeOps pieces fit together.
## Physical Layer
Two always-on devices form the infrastructure backbone:
```
┌─────────────────┐ ┌─────────────────┐
│ Indri │ │ Sifaka │
│ Mac Mini M1 │────▶│ Synology NAS │
│ (compute) │ │ (storage) │
└─────────────────┘ └─────────────────┘
│ Tailscale
┌─────────────────┐
│ Gilbert │
│ MacBook Air │
│ (workstation) │
└─────────────────┘
```
- **[[indri]]** runs all services (native and containerized)
- **[[sifaka]]** provides bulk storage and backup targets
- **[[gilbert]]** is the development workstation
## Network Layer
[[tailscale]] provides the network fabric:
- All devices on tailnet `tail8d86e.ts.net`
- ACLs control access between devices and services
- MagicDNS provides `*.tail8d86e.ts.net` hostnames
- No port forwarding or public IPs needed
## Service Routing
Two DNS domains route to services:
| Domain | Mechanism | Reachable from |
|--------|-----------|----------------|
| `*.ops.eblu.me` | Caddy reverse proxy on indri | Everywhere (k8s pods, containers, tailnet) |
| `*.tail8d86e.ts.net` | Tailscale MagicDNS | Tailnet clients only |
See [[routing]] for details on when to use which.
## Compute Layer
Services run in two places:
### Native on Indri (Ansible)
Some services run directly on macOS:
- [[forgejo]] - Git forge (needs filesystem access)
- [[zot]] - Container registry (k8s depends on it)
- [[jellyfin]] - Media server (needs VideoToolbox hardware transcoding)
- [[borgmatic]] - Backups (needs host filesystem access)
Managed via Ansible roles in `ansible/roles/`.
### Kubernetes (ArgoCD)
Most services run in minikube on indri:
- [[grafana]], [[prometheus]], [[loki]] - Observability
- [[miniflux]], [[navidrome]], [[kiwix]] - Applications
- [[postgresql]] - Shared database (CloudNativePG)
Managed via ArgoCD from `argocd/manifests/`.
## Data Flow
```
┌──────────────┐
│ Git Repo │
│ (Forgejo) │
└──────┬───────┘
│ push
┌──────────────┐ ┌──────────────┐
│ ArgoCD │────▶│ Kubernetes │
│ (watches) │sync │ (runs) │
└──────────────┘ └──────────────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Service │ │ Service │ │ Service │
└──────────────┘ └──────────────┘ └──────────────┘
```
1. Code pushed to [[forgejo]]
2. [[argocd]] detects changes (or manual sync triggered)
3. ArgoCD applies manifests to cluster
4. Services start/update in Kubernetes
## Observability
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Alloy │────▶│ Prometheus │────▶│ Grafana │
│ (collector) │ │ (metrics) │ │ (dashboards)│
└─────────────┘ └─────────────┘ └─────────────┘
│ ▲
│ ┌─────────────┐ │
└───────────▶│ Loki │────────────┘
│ (logs) │
└─────────────┘
```
[[alloy]] runs in two places:
- On indri: collects host metrics and logs
- In k8s: collects pod logs and service probes
See [[observability]] for details.
## Secrets Flow
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1Password │────▶│ 1Password │────▶│ External │
│ (vault) │ │ Connect │ │ Secrets │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ K8s Secret │
└─────────────┘
```
Secrets live in 1Password and flow to Kubernetes via [[external-secrets]].
For Ansible, secrets are fetched via `op` CLI in playbook pre_tasks.
## Related
- [[why-gitops]] - Philosophy behind this approach
- [[security-model]] - Access control and secrets
- [[routing]] - Service routing details

22
docs/explanation/index.md Normal file
View file

@ -0,0 +1,22 @@
---
title: explanation
tags:
- explanation
---
# Explanation
Understanding-oriented content explaining the "why" behind BlumeOps design decisions.
## Philosophy
| Article | Description |
|---------|-------------|
| [[why-gitops]] | Why infrastructure-as-code and GitOps for a homelab |
## Design
| Article | Description |
|---------|-------------|
| [[architecture]] | How all the pieces fit together |
| [[security-model]] | Network security, secrets, and access control |

View file

@ -0,0 +1,139 @@
---
title: security-model
tags:
- explanation
- security
---
# Security Model
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
How BlumeOps handles network security, secrets, and access control.
## Network Security: Tailscale
The foundational security decision is using [[tailscale]] as the network layer.
### Zero Trust Networking
BlumeOps has no public IP addresses or port forwarding. All services are only accessible via Tailscale:
- **No attack surface** from the public internet
- **Encrypted by default** - WireGuard encryption for all traffic
- **Identity-based access** - ACLs based on user/device identity, not IP addresses
### Defense in Depth
Even within the tailnet, access is restricted:
```
Internet ──X──▶ Services (no public access)
Tailnet:
Admin ────────▶ All services
Member ───────▶ User-facing services only
Homelab tag ──▶ NAS (for backups)
```
See [[tailscale]] for the full ACL matrix.
## Secrets Management
Secrets follow a hierarchy:
### Source of Truth: 1Password
All secrets originate in 1Password's `blumeops` vault:
- API keys, tokens, passwords
- SSH keys and certificates
- OAuth credentials
### Kubernetes: External Secrets Operator
[[external-secrets]] syncs secrets from 1Password to Kubernetes:
```
1Password ──▶ 1Password Connect ──▶ ExternalSecret ──▶ K8s Secret
```
Services reference native Kubernetes Secrets; they don't know about 1Password.
### Ansible: op CLI
Ansible playbooks fetch secrets at runtime via `op` CLI:
```yaml
- name: Fetch secret
command: op item get <id> --fields password --reveal
delegate_to: localhost
```
Secrets are held in memory as Ansible facts, never written to disk.
### Git Repository
The repository is public. Secrets must never be committed:
- `.gitignore` excludes sensitive patterns
- Pre-commit hooks scan for potential secrets (TruffleHog)
- All config files use references to secrets, not values
## Access Control Philosophy
### Principle of Least Privilege
Services and devices get minimum necessary access:
| Entity | Access |
|--------|--------|
| Admin users | Everything |
| Member users | User-facing services only |
| Homelab servers | Only what they need (NAS for backups) |
| K8s pods | No Tailscale access (use Caddy proxy) |
### Tagged Devices vs User Devices
Important Tailscale concept:
- **User devices** (like gilbert) have user identity and inherit user ACLs
- **Tagged devices** (like indri with `tag:homelab`) lose user identity
Don't tag user devices - it breaks user-based access rules.
## Authentication Patterns
### Service-to-Service
Internal services use:
- Kubernetes service discovery (no auth needed within cluster)
- Tailscale identity for cross-host communication
### User-to-Service
Users authenticate via:
- Service-specific credentials (stored in 1Password)
- Some services support Tailscale identity (future)
### AI/Automation Access
Claude Code and automation use:
- SSH keys for git operations
- ArgoCD tokens for deployments
- 1Password CLI for secret retrieval (requires user approval)
## What's Not Protected
Honest assessment of security boundaries:
- **Local network attacks** - If someone is on your home WiFi, they could potentially access the NAS directly
- **Physical access** - No disk encryption on servers (trade-off for reliability)
- **Supply chain** - Container images from upstream registries
- **Operator error** - Misconfigured ACLs or leaked credentials
The model assumes a trusted home network and focuses on protecting against internet-based attacks.
## Related
- [[tailscale]] - ACL configuration
- [[1password]] - Secrets management
- [[external-secrets]] - Kubernetes secrets
- [[architecture]] - Overall system design

View file

@ -0,0 +1,70 @@
---
title: why-gitops
tags:
- explanation
- philosophy
---
# Why GitOps?
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
BlumeOps uses GitOps principles for managing personal infrastructure. This might seem like overkill for a homelab, but there are good reasons.
## The Problem with Manual Infrastructure
Traditional server management involves SSHing into machines and running commands. This works, but creates problems:
- **Drift**: The actual state diverges from what you think it is
- **Amnesia**: You forget what you changed and why
- **Fragility**: One bad command can break things with no easy rollback
- **Bus factor**: Only you know how it works (even AI assistants struggle without context)
## Git as the Source of Truth
GitOps inverts the model: instead of pushing changes to servers, you commit desired state to Git, and automation pulls it into reality.
**Benefits:**
- Every change is tracked with commit history
- Pull requests enable review before deployment
- Rollback is just `git revert`
- The repo *is* the documentation
## Why This Matters for a Homelab
A personal homelab isn't a production environment, but it shares the same challenges:
1. **Memory is unreliable** - Six months from now, you won't remember why you configured Caddy that way
2. **Experimentation is constant** - You try things, break things, want to undo things
3. **AI assistance needs context** - Claude can help much more effectively when it can read your infrastructure as code
## The BlumeOps Approach
BlumeOps uses layered GitOps:
| Layer | Tool | What it manages |
|-------|------|-----------------|
| **Tailnet** | [[reference/infrastructure/tailscale|Pulumi]] | ACLs, tags, DNS |
| **Host config** | [[reference/ansible/roles|Ansible]] | Services on [[indri]] |
| **Kubernetes** | [[argocd|ArgoCD]] | Containerized workloads |
Each layer has its own reconciliation loop:
- Pulumi applies on `mise run tailnet-up`
- Ansible applies on `mise run provision-indri`
- ArgoCD watches Git and syncs manually or automatically
## Trade-offs
GitOps isn't free:
- **Learning curve** - You need to understand Ansible, ArgoCD, Pulumi
- **Indirection** - Can't just `apt install` something; need to add it to config
- **Complexity** - More moving parts than a simple server
But for BlumeOps, the trade-off is worth it. The infrastructure is complex enough that managing it imperatively would be error-prone, and the GitOps approach enables effective AI-assisted operations.
## Related
- [[architecture]] - How the pieces fit together
- [[argocd]] - Kubernetes GitOps
- [[reference/ansible/roles|Ansible roles]] - Host configuration