Add Phase 4: how-to guides documentation (#95)

## Summary
- Create `docs/how-to/` directory with index and four how-to guides
- deploy-k8s-service: Quick reference for Kubernetes deployments via ArgoCD
- add-ansible-role: Adding new Ansible roles for indri services
- update-tailscale-acls: Modifying Tailscale ACL policies via Pulumi
- troubleshooting: Diagnosing and fixing common issues
- Update exploring-the-docs to include How-to section links
- Update README.md to mark Phase 4 as complete

## Deployment and Testing
- [x] Pre-commit hooks pass (including doc-links validator)
- [ ] Build and deploy to docs.ops.eblu.me to verify rendering

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/95
This commit is contained in:
Erich Blume 2026-02-03 20:17:24 -08:00
commit e311b36b3c
9 changed files with 801 additions and 9 deletions

View file

@ -98,16 +98,18 @@ Learning-oriented content for getting started. Each tutorial explicitly identifi
**Tutorials URL:** https://docs.ops.eblu.me/tutorials/
### Phase 4: How-to Guides
### Phase 4: How-to Guides (Complete)
Task-oriented instructions for specific operations.
- [ ] Create `how-to/` directory
- [ ] Migrate operational content from zk cards
- [ ] "How to deploy a new Kubernetes service"
- [ ] "How to add a new Ansible role"
- [ ] "How to update Tailscale ACLs"
- [ ] "How to troubleshoot common issues"
- [ ] Update `exploring-the-docs` with How-to section
- [x] Create `how-to/` directory
- [x] Migrate operational content from zk cards
- [x] "How to deploy a new Kubernetes service"
- [x] "How to add a new Ansible role"
- [x] "How to update Tailscale ACLs"
- [x] "How to troubleshoot common issues"
- [x] Update `exploring-the-docs` with How-to section
**How-to URL:** https://docs.ops.eblu.me/how-to/
### Phase 5: Explanation
Understanding-oriented discussion of concepts and decisions.

View file

@ -0,0 +1 @@
Add Phase 4 how-to guides: deploy k8s services, add ansible roles, update tailscale ACLs, and troubleshooting

View file

@ -0,0 +1,141 @@
---
title: add-ansible-role
tags:
- how-to
- ansible
---
# Add an Ansible Role
Quick reference for adding a new Ansible role to provision services on [[indri]].
## Create Role Structure
```
ansible/roles/<role>/
├── defaults/main.yml # Default variables
├── tasks/main.yml # Task definitions
├── handlers/main.yml # Handlers (restarts, etc.)
├── templates/ # Jinja2 templates
└── files/ # Static files (optional)
```
## Minimal Role Example
```yaml
# ansible/roles/<role>/defaults/main.yml
---
role_data_dir: ~/Library/Application Support/<service>
role_port: 8080
```
```yaml
# ansible/roles/<role>/tasks/main.yml
---
- name: Ensure data directory exists
ansible.builtin.file:
path: "{{ role_data_dir }}"
state: directory
mode: '0755'
- name: Deploy configuration
ansible.builtin.template:
src: config.j2
dest: "{{ role_data_dir }}/config"
mode: '0644'
notify: Restart service
- name: Deploy LaunchAgent plist
ansible.builtin.template:
src: launchagent.plist.j2
dest: ~/Library/LaunchAgents/mcquack.<service>.plist
mode: '0644'
notify: Restart service
```
```yaml
# ansible/roles/<role>/handlers/main.yml
---
- name: Restart service
ansible.builtin.shell: |
launchctl unload ~/Library/LaunchAgents/mcquack.<service>.plist 2>/dev/null || true
launchctl load ~/Library/LaunchAgents/mcquack.<service>.plist
listen: Restart service
```
## Add Role to Playbook
Edit `ansible/playbooks/indri.yml`:
```yaml
roles:
# ... existing roles ...
- role: <role>
tags: [<role>]
```
## Add Secrets (if needed)
If the role needs secrets from 1Password, add pre_tasks:
```yaml
pre_tasks:
# ... existing pre_tasks ...
- name: Fetch <role> secret
ansible.builtin.command:
cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get <item-id> --fields <field> --reveal
delegate_to: localhost
register: _role_secret
changed_when: false
no_log: true
check_mode: false
tags: [<role>]
- name: Set <role> secret fact
ansible.builtin.set_fact:
role_secret_var: "{{ _role_secret.stdout }}"
no_log: true
tags: [<role>]
```
Then use `role_secret_var` in your role with a guard:
```yaml
# In role's tasks, fetch if not already set (allows running with --tags)
- name: Fetch secret if not set
ansible.builtin.command:
cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get <item-id> --fields <field> --reveal
delegate_to: localhost
register: _role_secret
changed_when: false
no_log: true
check_mode: false
when: role_secret_var is not defined
```
## Test and Deploy
```bash
# Dry run
mise run provision-indri -- --tags <role> --check --diff
# Apply
mise run provision-indri -- --tags <role>
# Verify
ssh indri 'launchctl list | grep <service>'
```
## Add Observability (optional)
For metrics collection, create a companion `<role>_metrics` role that:
1. Writes metrics to `/opt/homebrew/var/node_exporter/textfile/`
2. Runs via a LaunchAgent (cronjob-style)
See [[alloy]] for how metrics are collected from textfiles.
## Related
- [[reference/ansible/roles|Roles]] - Available roles reference
- [[indri]] - Target host
- [[observability]] - Metrics collection

View file

@ -0,0 +1,126 @@
---
title: deploy-k8s-service
tags:
- how-to
- kubernetes
- argocd
---
# Deploy a Kubernetes Service
Quick reference for deploying a new service to BlumeOps Kubernetes via ArgoCD. See [[adding-a-service|the tutorial]] for detailed explanations.
## Create Manifests
```
argocd/manifests/<service>/
├── deployment.yaml
├── service.yaml
└── ingress-tailscale.yaml
```
Namespace should match service name. Use `registry.ops.eblu.me` for images.
## Create ArgoCD Application
```yaml
# argocd/apps/<service>.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: <service>
namespace: argocd
spec:
project: default
source:
repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git
targetRevision: main
path: argocd/manifests/<service>
destination:
server: https://kubernetes.default.svc
namespace: <service>
syncPolicy:
syncOptions:
- CreateNamespace=true
```
## Configure Ingress
Add [[tailscale-operator|Tailscale Ingress]] with Homepage annotations:
```yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: <service>
namespace: <service>
annotations:
gethomepage.dev/enabled: "true"
gethomepage.dev/name: "Service Name"
gethomepage.dev/group: "Apps"
gethomepage.dev/icon: "<service>.png"
gethomepage.dev/href: "https://<service>.ops.eblu.me"
gethomepage.dev/pod-selector: "app=<service>"
spec:
ingressClassName: tailscale
rules:
- host: <service>
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: <service>
port:
number: 80
```
## Add Caddy Route (if needed)
If other pods need to access the service, add to `ansible/roles/caddy/defaults/main.yml`:
```yaml
caddy_services:
- name: <service>
upstream: "https://<service>.tail8d86e.ts.net"
```
Then: `mise run provision-indri -- --tags caddy`
See [[routing]] for when Caddy is needed.
## Deploy
```bash
# Sync apps to pick up new Application
argocd app sync apps
# Test on feature branch first
argocd app set <service> --revision <branch>
argocd app sync <service>
# Verify
kubectl --context=minikube-indri -n <service> get pods
kubectl --context=minikube-indri -n <service> logs -f deployment/<service>
# After PR merge, reset to main
argocd app set <service> --revision main
argocd app sync <service>
```
## Checklist
- [ ] Manifests in `argocd/manifests/<service>/`
- [ ] Application in `argocd/apps/<service>.yaml`
- [ ] Tailscale Ingress with Homepage annotations
- [ ] Caddy route (if pod-to-service access needed)
- [ ] Tested on feature branch
- [ ] PR reviewed and merged
- [ ] Reset to main branch
## Related
- [[adding-a-service]] - Full tutorial with explanations
- [[apps]] - ArgoCD application registry
- [[routing]] - Service routing options

34
docs/how-to/index.md Normal file
View file

@ -0,0 +1,34 @@
---
title: how-to
tags:
- how-to
---
# How-To Guides
Task-oriented instructions for common BlumeOps operations. These guides assume you already understand the basic concepts - see [[tutorials/index|Tutorials]] if you're learning.
## Deployment
| Guide | Description |
|-------|-------------|
| [[deploy-k8s-service]] | Deploy a new service to Kubernetes via ArgoCD |
| [[add-ansible-role]] | Add a new Ansible role for indri services |
## Configuration
| Guide | Description |
|-------|-------------|
| [[update-tailscale-acls]] | Update Tailscale access control policies |
## Documentation
| Guide | Description |
|-------|-------------|
| [[update-documentation]] | Publish docs via build-blumeops workflow |
## Operations
| Guide | Description |
|-------|-------------|
| [[troubleshooting]] | Diagnose and fix common issues |

View file

@ -0,0 +1,228 @@
---
title: troubleshooting
tags:
- how-to
- operations
---
# Troubleshooting Common Issues
Quick reference for diagnosing and fixing common BlumeOps issues.
## General Health Check
Run the comprehensive service health check:
```bash
mise run indri-services-check
```
This checks all services on indri and in Kubernetes.
## Kubernetes Issues
### Pod not starting
```bash
# Check pod status
kubectl --context=minikube-indri -n <namespace> get pods
# Describe pod for events
kubectl --context=minikube-indri -n <namespace> describe pod <pod>
# Check logs
kubectl --context=minikube-indri -n <namespace> logs <pod>
# Previous container logs (if restarting)
kubectl --context=minikube-indri -n <namespace> logs <pod> --previous
```
Common causes:
- **ImagePullBackOff** - Image doesn't exist or registry unreachable
- **CrashLoopBackOff** - Application crashing; check logs
- **Pending** - Insufficient resources or node issues
- **ContainerCreating** - Waiting for volumes or secrets
### ArgoCD sync issues
```bash
# Check app status
argocd app get <app>
# See what will change
argocd app diff <app>
# Force sync
argocd app sync <app> --force
# Sync with prune (removes deleted resources)
argocd app sync <app> --prune
```
**App stuck in "Syncing":**
Check if there are failed hooks or jobs:
```bash
kubectl --context=minikube-indri -n <namespace> get jobs
kubectl --context=minikube-indri -n <namespace> get pods --field-selector=status.phase=Failed
```
**ArgoCD login expired:**
```bash
argocd login argocd.ops.eblu.me --username admin --password "$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get srogeebssulhtb6tnqd7ls6qey --fields password --reveal)"
```
### kubectl connection refused
```bash
# Check if minikube is running (on indri)
ssh indri 'minikube status'
# Restart if needed
ssh indri 'minikube start'
# Verify tailscale is serving the API
ssh indri 'tailscale serve status --json'
```
## Indri Service Issues
### Service not responding
```bash
# Check LaunchAgent status
ssh indri 'launchctl list | grep mcquack'
# Restart a LaunchAgent
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.<service>.plist'
ssh indri 'launchctl load ~/Library/LaunchAgents/mcquack.<service>.plist'
# Check service logs
ssh indri 'tail -50 ~/Library/Logs/mcquack.<service>.err.log'
ssh indri 'tail -50 ~/Library/Logs/mcquack.<service>.out.log'
```
### Forgejo not accessible
```bash
# Check if forgejo is running
ssh indri 'lsof -nP -iTCP:3001 -sTCP:LISTEN'
# Check logs
ssh indri 'tail -50 ~/Library/Logs/mcquack.forgejo.err.log'
# Restart forgejo
ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.forgejo'
```
### Registry (Zot) issues
```bash
# Test registry API
ssh indri 'curl -s http://localhost:5050/v2/_catalog | jq'
# Check if zot is running
ssh indri 'lsof -nP -iTCP:5050 -sTCP:LISTEN'
# Restart zot
ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.zot'
```
## Network Issues
### Service unreachable via *.ops.eblu.me
Caddy handles routing for `*.ops.eblu.me`:
```bash
# Check if Caddy is running
ssh indri 'launchctl list | grep caddy'
# View Caddy logs
ssh indri 'tail -50 ~/Library/Logs/caddy/access.log'
ssh indri 'tail -50 ~/Library/Logs/caddy/error.log'
# Restart Caddy
ssh indri 'launchctl kickstart -k gui/$(id -u)/homebrew.mxcl.caddy'
```
### Tailscale MagicDNS not resolving
```bash
# Check tailscale serve status
ssh indri 'tailscale serve status --json'
# Restart tailscale if needed
ssh indri 'tailscale down && tailscale up'
```
## Observability
### Check metrics
```bash
# Open Grafana
open https://grafana.ops.eblu.me
# Check Prometheus directly
open https://prometheus.ops.eblu.me
```
### Check logs
```bash
# Open Grafana Explore
open https://grafana.ops.eblu.me/explore
# Query Loki directly
curl -G 'https://loki.ops.eblu.me/loki/api/v1/query_range' \
--data-urlencode 'query={service="<service>"}' \
--data-urlencode 'limit=100'
```
### Alloy (metrics/logs collector) issues
```bash
# Indri alloy (host metrics)
ssh indri 'launchctl list | grep alloy'
ssh indri 'tail -50 ~/Library/Logs/alloy/alloy.log'
# K8s alloy (pod logs)
kubectl --context=minikube-indri -n monitoring logs -l app=alloy
```
## Database Issues
### PostgreSQL connection failed
```bash
# Check CNPG cluster status
kubectl --context=minikube-indri -n databases get cluster
# Check PostgreSQL pods
kubectl --context=minikube-indri -n databases get pods -l cnpg.io/cluster=blumeops-pg
# Connect to database
kubectl --context=minikube-indri -n databases exec -it blumeops-pg-1 -- psql -U postgres
```
## Backup Issues
### Check backup status
```bash
# View latest backup info
ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom'
# Run backup manually
ssh indri 'borgmatic --verbosity 1'
# Check backup logs
ssh indri 'tail -100 /opt/homebrew/var/log/borgmatic/borgmatic.log'
```
## Related
- [[observability]] - Metrics and logs
- [[argocd]] - GitOps platform
- [[cluster]] - Kubernetes cluster
- [[routing]] - Service routing

View file

@ -0,0 +1,130 @@
---
title: update-documentation
tags:
- how-to
- documentation
- ci-cd
---
# Update Documentation
How to publish documentation changes to https://docs.ops.eblu.me.
## Quick Release
After merging documentation changes to main:
1. Go to **Actions** > **Build BlumeOps** > **Run workflow**
2. Enter a version (e.g., `v1.2.0`) or leave empty to auto-increment
3. The workflow builds, releases, and deploys automatically
Direct link: https://forge.ops.eblu.me/eblume/blumeops/actions?workflow=build-blumeops.yaml
## What the Workflow Does
The `build-blumeops` workflow (`.forgejo/workflows/build-blumeops.yaml`):
1. **Resolves version** - Uses input or auto-increments from latest release
2. **Builds changelog** - Runs towncrier to collect changelog fragments
3. **Builds docs** - Clones Quartz, builds static site from `docs/`
4. **Creates release** - Uploads `docs-<version>.tar.gz` to Forgejo releases
5. **Updates deployment** - Edits `argocd/manifests/docs/deployment.yaml` with new URL
6. **Commits changes** - Pushes changelog and deployment updates to main
7. **Deploys** - Syncs the `docs` ArgoCD app
## Changelog Fragments (Towncrier)
When making changes, add a changelog fragment to `docs/changelog.d/`:
```bash
# Format: <identifier>.<type>.md
# Types: feature, bugfix, infra, doc, misc
# Using branch name (preferred)
echo "Add new feature X" > docs/changelog.d/my-feature.feature.md
# Orphan fragment (when no branch fits)
echo "Fix bug Y" > docs/changelog.d/+fix-bug.bugfix.md
```
Fragments are automatically collected into `docs/CHANGELOG.md` during release.
**Fragment types:**
| Type | Directory | Description |
|------|-----------|-------------|
| `feature` | `feature/` | New features |
| `bugfix` | `bugfix/` | Bug fixes |
| `infra` | `infra/` | Infrastructure changes |
| `doc` | `doc/` | Documentation updates |
| `misc` | `misc/` | Other (content hidden in changelog) |
## Runner Environment
The workflow runs on the `k8s` label, which uses the [[forgejo]]-runner in Kubernetes:
- **Runner deployment**: `argocd/manifests/forgejo-runner/`
- **Job image**: `registry.ops.eblu.me/blumeops/forgejo-runner:latest`
- **Includes**: Node.js 24, npm, git, jq, Docker CLI, uv/uvx, argocd CLI
The job image is built from `containers/forgejo-runner/Dockerfile`.
## Quartz Static Site Generator
[Quartz](https://quartz.jzhao.xyz/) builds the documentation into a static site with:
- Wiki-link support (`[[page]]` syntax)
- Backlinks panel showing what references each page
- Graph view of document connections
- Full-text search
**Configuration files** (in `docs/`):
- `quartz.config.ts` - Site metadata, plugins, theme
- `quartz.layout.ts` - Page layout components
Quartz is cloned fresh during each build (not vendored) to use the latest version.
## Manual Build (Local)
To test docs locally without triggering a release:
```bash
# Clone Quartz
git clone --depth 1 https://github.com/jackyzha0/quartz.git /tmp/quartz
cd /tmp/quartz
# Install dependencies
npm ci
# Copy config and content
cp /path/to/blumeops/docs/quartz.config.ts .
cp /path/to/blumeops/docs/quartz.layout.ts .
rm -rf content
cp -r /path/to/blumeops/docs content
# Build
npx quartz build
# Serve locally
npx quartz build --serve
```
## Troubleshooting
**Workflow fails on "Resolve version":**
- Check if the version already exists as a release
- Ensure version format is `vX.Y.Z`
**Docs not updating after deploy:**
- Check ArgoCD sync status: `argocd app get docs`
- Verify the pod restarted: `kubectl --context=minikube-indri -n docs get pods`
- Check pod logs for download errors
**Towncrier not finding fragments:**
- Fragments must be in `docs/changelog.d/`
- Must have `.md` extension
- Must match pattern `<name>.<type>.md`
## Related
- [[docs]] - Documentation service reference
- [[forgejo]] - Git forge and CI/CD
- [[argocd]] - GitOps deployment

View file

@ -0,0 +1,128 @@
---
title: update-tailscale-acls
tags:
- how-to
- tailscale
- pulumi
---
# Update Tailscale ACLs
How to modify Tailscale access control policies for the tailnet.
## Prerequisites
- Pulumi CLI installed (`brew install pulumi`)
- Access to 1Password blumeops vault (for OAuth credentials)
## Edit the Policy
The ACL policy lives in `pulumi/policy.hujson` (HuJSON format with comments).
Common changes:
### Add a new ACL rule
```json
{
"acls": [
// ... existing rules ...
{
"action": "accept",
"src": ["autogroup:admin"],
"dst": ["tag:newservice:*"]
}
]
}
```
### Add a new tag
```json
{
"tagOwners": {
// ... existing tags ...
"tag:newservice": ["autogroup:admin"]
}
}
```
### Add a new group
```json
{
"groups": {
// ... existing groups ...
"group:newgroup": ["user1@example.com", "user2@example.com"]
}
}
```
## Preview and Apply
```bash
# Preview changes (always do this first)
mise run tailnet-preview
# Apply changes
mise run tailnet-up
# Skip confirmation prompt
mise run tailnet-up -- --yes
```
## Verify
Check the Tailscale admin console at https://login.tailscale.com/ to confirm changes.
## Common Patterns
### Service-specific access
Grant access to a specific service port:
```json
{
"action": "accept",
"src": ["group:users"],
"dst": ["tag:homelab:8080"]
}
```
### SSH access
```json
{
"ssh": [
{
"action": "check",
"src": ["autogroup:admin"],
"dst": ["tag:servers"],
"users": ["autogroup:nonroot"]
}
]
}
```
### All ports for admins
```json
{
"action": "accept",
"src": ["autogroup:admin"],
"dst": ["*:*"]
}
```
## Troubleshooting
**"Credential expired" error:**
Re-authenticate Pulumi with Tailscale. The OAuth token may need refreshing.
**Changes not taking effect:**
ACL changes are applied immediately. If a device isn't following new rules, try `tailscale down && tailscale up` on that device.
## Related
- [[tailscale]] - ACL reference and current configuration
- [[routing]] - Service routing

View file

@ -19,7 +19,7 @@ The docs follow the [Diataxis](https://diataxis.fr/) framework:
|---------|---------|-------------|
| **[[tutorials/index | Tutorials]]** | Learning-oriented | "I'm new and want to understand" |
| **[[reference/index | Reference]]** | Information-oriented | "I need specific technical details" |
| **How-to** (planned) | Task-oriented | "I need to do X" |
| **[[how-to/index | How-to]]** | Task-oriented | "I need to do X" |
| **Explanation** (planned) | Understanding-oriented | "I want to understand why" |
## Quick Paths by Audience
@ -27,6 +27,7 @@ The docs follow the [Diataxis](https://diataxis.fr/) framework:
### For Erich (Owner)
You probably want quick access to operational details:
- [[how-to/index|How-to guides]] for common operations (deploy, troubleshoot, update ACLs)
- [[reference/index|Reference]] has service URLs, commands, and config locations
- The `zk-docs` mise task still works for legacy zettelkasten access
- [[ai-assistance-guide]] explains how to work effectively with Claude
@ -49,6 +50,7 @@ Understanding what this is:
Getting started with changes:
- [[contributing]] walks through the workflow
- [[how-to/index|How-to guides]] for specific tasks (deploy services, add roles)
- [[reference/index|Reference]] tells you where things live
### For Replicators