Add Phase 5: explanation documentation (#96)

## Summary
- Create `docs/explanation/` directory with index and three explanation articles
- why-gitops: Philosophy of GitOps for homelabs (memory, rollback, AI context)
- architecture: How pieces fit together (ASCII diagrams of hosts, data flow, secrets)
- security-model: Tailscale zero-trust, 1Password secrets, access control philosophy
- Update docs/index.md with How-to and Explanation section links
- Update exploring-the-docs to link Explanation section

Decision log deferred to future work.

## Deployment and Testing
- [x] Pre-commit hooks pass (including doc-links validator)
- [ ] Build and deploy to docs.ops.eblu.me to verify rendering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/96
This commit is contained in:
Erich Blume 2026-02-03 20:33:39 -08:00
commit 0a28622751
8 changed files with 396 additions and 10 deletions

View file

@ -111,15 +111,17 @@ Task-oriented instructions for specific operations.
**How-to URL:** https://docs.ops.eblu.me/how-to/ **How-to URL:** https://docs.ops.eblu.me/how-to/
### Phase 5: Explanation ### Phase 5: Explanation (Complete)
Understanding-oriented discussion of concepts and decisions. Understanding-oriented discussion of concepts and decisions.
- [ ] Create `explanation/` directory - [x] Create `explanation/` directory
- [ ] "Why GitOps?" - Philosophy and approach - [x] "Why GitOps?" - Philosophy and approach
- [ ] "Architecture Overview" - How everything fits together - [x] "Architecture Overview" - How everything fits together
- [ ] "Security Model" - Tailscale, secrets management, etc. - [x] "Security Model" - Tailscale, secrets management, etc.
- [ ] "Decision Log" - ADRs (Architecture Decision Records) - [ ] "Decision Log" - ADRs (Architecture Decision Records) - deferred
- [ ] Update `exploring-the-docs` with Explanation section - [x] Update `exploring-the-docs` with Explanation section
**Explanation URL:** https://docs.ops.eblu.me/explanation/
### Phase 6: Integration & Cleanup ### Phase 6: Integration & Cleanup
- [ ] Migrate remaining useful content from `docs/zk/` - [ ] Migrate remaining useful content from `docs/zk/`

View file

@ -0,0 +1 @@
Add Phase 5 explanation docs: why GitOps, architecture overview, and security model

View file

@ -0,0 +1,149 @@
---
title: architecture
tags:
- explanation
- architecture
---
# Architecture Overview
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
How all the BlumeOps pieces fit together.
## Physical Layer
Two always-on devices form the infrastructure backbone:
```
┌─────────────────┐ ┌─────────────────┐
│ Indri │ │ Sifaka │
│ Mac Mini M1 │────▶│ Synology NAS │
│ (compute) │ │ (storage) │
└─────────────────┘ └─────────────────┘
│ Tailscale
┌─────────────────┐
│ Gilbert │
│ MacBook Air │
│ (workstation) │
└─────────────────┘
```
- **[[indri]]** runs all services (native and containerized)
- **[[sifaka]]** provides bulk storage and backup targets
- **[[gilbert]]** is the development workstation
## Network Layer
[[tailscale]] provides the network fabric:
- All devices on tailnet `tail8d86e.ts.net`
- ACLs control access between devices and services
- MagicDNS provides `*.tail8d86e.ts.net` hostnames
- No port forwarding or public IPs needed
## Service Routing
Two DNS domains route to services:
| Domain | Mechanism | Reachable from |
|--------|-----------|----------------|
| `*.ops.eblu.me` | Caddy reverse proxy on indri | Everywhere (k8s pods, containers, tailnet) |
| `*.tail8d86e.ts.net` | Tailscale MagicDNS | Tailnet clients only |
See [[routing]] for details on when to use which.
## Compute Layer
Services run in two places:
### Native on Indri (Ansible)
Some services run directly on macOS:
- [[forgejo]] - Git forge (needs filesystem access)
- [[zot]] - Container registry (k8s depends on it)
- [[jellyfin]] - Media server (needs VideoToolbox hardware transcoding)
- [[borgmatic]] - Backups (needs host filesystem access)
Managed via Ansible roles in `ansible/roles/`.
### Kubernetes (ArgoCD)
Most services run in minikube on indri:
- [[grafana]], [[prometheus]], [[loki]] - Observability
- [[miniflux]], [[navidrome]], [[kiwix]] - Applications
- [[postgresql]] - Shared database (CloudNativePG)
Managed via ArgoCD from `argocd/manifests/`.
## Data Flow
```
┌──────────────┐
│ Git Repo │
│ (Forgejo) │
└──────┬───────┘
│ push
┌──────────────┐ ┌──────────────┐
│ ArgoCD │────▶│ Kubernetes │
│ (watches) │sync │ (runs) │
└──────────────┘ └──────────────┘
┌────────────────────┼────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Service │ │ Service │ │ Service │
└──────────────┘ └──────────────┘ └──────────────┘
```
1. Code pushed to [[forgejo]]
2. [[argocd]] detects changes (or manual sync triggered)
3. ArgoCD applies manifests to cluster
4. Services start/update in Kubernetes
## Observability
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Alloy │────▶│ Prometheus │────▶│ Grafana │
│ (collector) │ │ (metrics) │ │ (dashboards)│
└─────────────┘ └─────────────┘ └─────────────┘
│ ▲
│ ┌─────────────┐ │
└───────────▶│ Loki │────────────┘
│ (logs) │
└─────────────┘
```
[[alloy]] runs in two places:
- On indri: collects host metrics and logs
- In k8s: collects pod logs and service probes
See [[observability]] for details.
## Secrets Flow
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ 1Password │────▶│ 1Password │────▶│ External │
│ (vault) │ │ Connect │ │ Secrets │
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ K8s Secret │
└─────────────┘
```
Secrets live in 1Password and flow to Kubernetes via [[external-secrets]].
For Ansible, secrets are fetched via `op` CLI in playbook pre_tasks.
## Related
- [[why-gitops]] - Philosophy behind this approach
- [[security-model]] - Access control and secrets
- [[routing]] - Service routing details

22
docs/explanation/index.md Normal file
View file

@ -0,0 +1,22 @@
---
title: explanation
tags:
- explanation
---
# Explanation
Understanding-oriented content explaining the "why" behind BlumeOps design decisions.
## Philosophy
| Article | Description |
|---------|-------------|
| [[why-gitops]] | Why infrastructure-as-code and GitOps for a homelab |
## Design
| Article | Description |
|---------|-------------|
| [[architecture]] | How all the pieces fit together |
| [[security-model]] | Network security, secrets, and access control |

View file

@ -0,0 +1,139 @@
---
title: security-model
tags:
- explanation
- security
---
# Security Model
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
How BlumeOps handles network security, secrets, and access control.
## Network Security: Tailscale
The foundational security decision is using [[tailscale]] as the network layer.
### Zero Trust Networking
BlumeOps has no public IP addresses or port forwarding. All services are only accessible via Tailscale:
- **No attack surface** from the public internet
- **Encrypted by default** - WireGuard encryption for all traffic
- **Identity-based access** - ACLs based on user/device identity, not IP addresses
### Defense in Depth
Even within the tailnet, access is restricted:
```
Internet ──X──▶ Services (no public access)
Tailnet:
Admin ────────▶ All services
Member ───────▶ User-facing services only
Homelab tag ──▶ NAS (for backups)
```
See [[tailscale]] for the full ACL matrix.
## Secrets Management
Secrets follow a hierarchy:
### Source of Truth: 1Password
All secrets originate in 1Password's `blumeops` vault:
- API keys, tokens, passwords
- SSH keys and certificates
- OAuth credentials
### Kubernetes: External Secrets Operator
[[external-secrets]] syncs secrets from 1Password to Kubernetes:
```
1Password ──▶ 1Password Connect ──▶ ExternalSecret ──▶ K8s Secret
```
Services reference native Kubernetes Secrets; they don't know about 1Password.
### Ansible: op CLI
Ansible playbooks fetch secrets at runtime via `op` CLI:
```yaml
- name: Fetch secret
command: op item get <id> --fields password --reveal
delegate_to: localhost
```
Secrets are held in memory as Ansible facts, never written to disk.
### Git Repository
The repository is public. Secrets must never be committed:
- `.gitignore` excludes sensitive patterns
- Pre-commit hooks scan for potential secrets (TruffleHog)
- All config files use references to secrets, not values
## Access Control Philosophy
### Principle of Least Privilege
Services and devices get minimum necessary access:
| Entity | Access |
|--------|--------|
| Admin users | Everything |
| Member users | User-facing services only |
| Homelab servers | Only what they need (NAS for backups) |
| K8s pods | No Tailscale access (use Caddy proxy) |
### Tagged Devices vs User Devices
Important Tailscale concept:
- **User devices** (like gilbert) have user identity and inherit user ACLs
- **Tagged devices** (like indri with `tag:homelab`) lose user identity
Don't tag user devices - it breaks user-based access rules.
## Authentication Patterns
### Service-to-Service
Internal services use:
- Kubernetes service discovery (no auth needed within cluster)
- Tailscale identity for cross-host communication
### User-to-Service
Users authenticate via:
- Service-specific credentials (stored in 1Password)
- Some services support Tailscale identity (future)
### AI/Automation Access
Claude Code and automation use:
- SSH keys for git operations
- ArgoCD tokens for deployments
- 1Password CLI for secret retrieval (requires user approval)
## What's Not Protected
Honest assessment of security boundaries:
- **Local network attacks** - If someone is on your home WiFi, they could potentially access the NAS directly
- **Physical access** - No disk encryption on servers (trade-off for reliability)
- **Supply chain** - Container images from upstream registries
- **Operator error** - Misconfigured ACLs or leaked credentials
The model assumes a trusted home network and focuses on protecting against internet-based attacks.
## Related
- [[tailscale]] - ACL configuration
- [[1password]] - Secrets management
- [[external-secrets]] - Kubernetes secrets
- [[architecture]] - Overall system design

View file

@ -0,0 +1,70 @@
---
title: why-gitops
tags:
- explanation
- philosophy
---
# Why GitOps?
> **Note:** This article was drafted by AI and reviewed by Erich. I plan to rewrite all explanatory content in my own words - these serve as placeholders to establish the documentation structure.
BlumeOps uses GitOps principles for managing personal infrastructure. This might seem like overkill for a homelab, but there are good reasons.
## The Problem with Manual Infrastructure
Traditional server management involves SSHing into machines and running commands. This works, but creates problems:
- **Drift**: The actual state diverges from what you think it is
- **Amnesia**: You forget what you changed and why
- **Fragility**: One bad command can break things with no easy rollback
- **Bus factor**: Only you know how it works (even AI assistants struggle without context)
## Git as the Source of Truth
GitOps inverts the model: instead of pushing changes to servers, you commit desired state to Git, and automation pulls it into reality.
**Benefits:**
- Every change is tracked with commit history
- Pull requests enable review before deployment
- Rollback is just `git revert`
- The repo *is* the documentation
## Why This Matters for a Homelab
A personal homelab isn't a production environment, but it shares the same challenges:
1. **Memory is unreliable** - Six months from now, you won't remember why you configured Caddy that way
2. **Experimentation is constant** - You try things, break things, want to undo things
3. **AI assistance needs context** - Claude can help much more effectively when it can read your infrastructure as code
## The BlumeOps Approach
BlumeOps uses layered GitOps:
| Layer | Tool | What it manages |
|-------|------|-----------------|
| **Tailnet** | [[reference/infrastructure/tailscale|Pulumi]] | ACLs, tags, DNS |
| **Host config** | [[reference/ansible/roles|Ansible]] | Services on [[indri]] |
| **Kubernetes** | [[argocd|ArgoCD]] | Containerized workloads |
Each layer has its own reconciliation loop:
- Pulumi applies on `mise run tailnet-up`
- Ansible applies on `mise run provision-indri`
- ArgoCD watches Git and syncs manually or automatically
## Trade-offs
GitOps isn't free:
- **Learning curve** - You need to understand Ansible, ArgoCD, Pulumi
- **Indirection** - Can't just `apt install` something; need to add it to config
- **Complexity** - More moving parts than a simple server
But for BlumeOps, the trade-off is worth it. The infrastructure is complex enough that managing it imperatively would be error-prone, and the GitOps approach enables effective AI-assisted operations.
## Related
- [[architecture]] - How the pieces fit together
- [[argocd]] - Kubernetes GitOps
- [[reference/ansible/roles|Ansible roles]] - Host configuration

View file

@ -9,7 +9,9 @@ Welcome to the BlumeOps documentation.
## Sections ## Sections
- [[tutorials/index | Tutorials]] - Learning-oriented guides for getting started - [[tutorials/index | Tutorials]] - Learning-oriented guides for getting started
- [[reference/index | Reference]] - Technical reference cards for services, infrastructure, and operations - [[reference/index | Reference]] - Technical specifications and service details
- [[how-to/index | How-to]] - Task-oriented instructions for common operations
- [[explanation/index | Explanation]] - Understanding the "why" behind BlumeOps
## About ## About

View file

@ -20,7 +20,7 @@ The docs follow the [Diataxis](https://diataxis.fr/) framework:
| **[[tutorials/index | Tutorials]]** | Learning-oriented | "I'm new and want to understand" | | **[[tutorials/index | Tutorials]]** | Learning-oriented | "I'm new and want to understand" |
| **[[reference/index | Reference]]** | Information-oriented | "I need specific technical details" | | **[[reference/index | Reference]]** | Information-oriented | "I need specific technical details" |
| **[[how-to/index | How-to]]** | Task-oriented | "I need to do X" | | **[[how-to/index | How-to]]** | Task-oriented | "I need to do X" |
| **Explanation** (planned) | Understanding-oriented | "I want to understand why" | | **[[explanation/index | Explanation]]** | Understanding-oriented | "I want to understand why" |
## Quick Paths by Audience ## Quick Paths by Audience
@ -42,9 +42,9 @@ Context for effective assistance:
### For External Readers ### For External Readers
Understanding what this is: Understanding what this is:
- [[explanation/index|Explanation]] covers the "why" behind design decisions
- [[reference/index|Reference]] shows what's actually running - [[reference/index|Reference]] shows what's actually running
- Browse service pages to see specific implementations - Browse service pages to see specific implementations
- The repo's README has project context
### For Contributors ### For Contributors
@ -58,6 +58,7 @@ Getting started with changes:
Replicators are people who want to build their own similar homelab GitOps setup, using BlumeOps as inspiration. Replicators are people who want to build their own similar homelab GitOps setup, using BlumeOps as inspiration.
- [[replicating-blumeops]] provides the overview - [[replicating-blumeops]] provides the overview
- [[explanation/index|Explanation]] covers architecture and design rationale
- The `replication/` tutorials go deep on components - The `replication/` tutorials go deep on components
- Reference pages show specific configuration choices - Reference pages show specific configuration choices