diff --git a/docs/README.md b/docs/README.md index 1416361..d7c425e 100644 --- a/docs/README.md +++ b/docs/README.md @@ -98,16 +98,18 @@ Learning-oriented content for getting started. Each tutorial explicitly identifi **Tutorials URL:** https://docs.ops.eblu.me/tutorials/ -### Phase 4: How-to Guides +### Phase 4: How-to Guides (Complete) Task-oriented instructions for specific operations. -- [ ] Create `how-to/` directory -- [ ] Migrate operational content from zk cards -- [ ] "How to deploy a new Kubernetes service" -- [ ] "How to add a new Ansible role" -- [ ] "How to update Tailscale ACLs" -- [ ] "How to troubleshoot common issues" -- [ ] Update `exploring-the-docs` with How-to section +- [x] Create `how-to/` directory +- [x] Migrate operational content from zk cards +- [x] "How to deploy a new Kubernetes service" +- [x] "How to add a new Ansible role" +- [x] "How to update Tailscale ACLs" +- [x] "How to troubleshoot common issues" +- [x] Update `exploring-the-docs` with How-to section + +**How-to URL:** https://docs.ops.eblu.me/how-to/ ### Phase 5: Explanation Understanding-oriented discussion of concepts and decisions. diff --git a/docs/changelog.d/phase4-how-to-guides.doc.md b/docs/changelog.d/phase4-how-to-guides.doc.md new file mode 100644 index 0000000..f428293 --- /dev/null +++ b/docs/changelog.d/phase4-how-to-guides.doc.md @@ -0,0 +1 @@ +Add Phase 4 how-to guides: deploy k8s services, add ansible roles, update tailscale ACLs, and troubleshooting diff --git a/docs/how-to/add-ansible-role.md b/docs/how-to/add-ansible-role.md new file mode 100644 index 0000000..a8ab8d5 --- /dev/null +++ b/docs/how-to/add-ansible-role.md @@ -0,0 +1,141 @@ +--- +title: add-ansible-role +tags: + - how-to + - ansible +--- + +# Add an Ansible Role + +Quick reference for adding a new Ansible role to provision services on [[indri]]. + +## Create Role Structure + +``` +ansible/roles// +├── defaults/main.yml # Default variables +├── tasks/main.yml # Task definitions +├── handlers/main.yml # Handlers (restarts, etc.) +├── templates/ # Jinja2 templates +└── files/ # Static files (optional) +``` + +## Minimal Role Example + +```yaml +# ansible/roles//defaults/main.yml +--- +role_data_dir: ~/Library/Application Support/ +role_port: 8080 +``` + +```yaml +# ansible/roles//tasks/main.yml +--- +- name: Ensure data directory exists + ansible.builtin.file: + path: "{{ role_data_dir }}" + state: directory + mode: '0755' + +- name: Deploy configuration + ansible.builtin.template: + src: config.j2 + dest: "{{ role_data_dir }}/config" + mode: '0644' + notify: Restart service + +- name: Deploy LaunchAgent plist + ansible.builtin.template: + src: launchagent.plist.j2 + dest: ~/Library/LaunchAgents/mcquack..plist + mode: '0644' + notify: Restart service +``` + +```yaml +# ansible/roles//handlers/main.yml +--- +- name: Restart service + ansible.builtin.shell: | + launchctl unload ~/Library/LaunchAgents/mcquack..plist 2>/dev/null || true + launchctl load ~/Library/LaunchAgents/mcquack..plist + listen: Restart service +``` + +## Add Role to Playbook + +Edit `ansible/playbooks/indri.yml`: + +```yaml + roles: + # ... existing roles ... + - role: + tags: [] +``` + +## Add Secrets (if needed) + +If the role needs secrets from 1Password, add pre_tasks: + +```yaml + pre_tasks: + # ... existing pre_tasks ... + - name: Fetch secret + ansible.builtin.command: + cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get --fields --reveal + delegate_to: localhost + register: _role_secret + changed_when: false + no_log: true + check_mode: false + tags: [] + + - name: Set secret fact + ansible.builtin.set_fact: + role_secret_var: "{{ _role_secret.stdout }}" + no_log: true + tags: [] +``` + +Then use `role_secret_var` in your role with a guard: + +```yaml +# In role's tasks, fetch if not already set (allows running with --tags) +- name: Fetch secret if not set + ansible.builtin.command: + cmd: op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get --fields --reveal + delegate_to: localhost + register: _role_secret + changed_when: false + no_log: true + check_mode: false + when: role_secret_var is not defined +``` + +## Test and Deploy + +```bash +# Dry run +mise run provision-indri -- --tags --check --diff + +# Apply +mise run provision-indri -- --tags + +# Verify +ssh indri 'launchctl list | grep ' +``` + +## Add Observability (optional) + +For metrics collection, create a companion `_metrics` role that: +1. Writes metrics to `/opt/homebrew/var/node_exporter/textfile/` +2. Runs via a LaunchAgent (cronjob-style) + +See [[alloy]] for how metrics are collected from textfiles. + +## Related + +- [[reference/ansible/roles|Roles]] - Available roles reference +- [[indri]] - Target host +- [[observability]] - Metrics collection diff --git a/docs/how-to/deploy-k8s-service.md b/docs/how-to/deploy-k8s-service.md new file mode 100644 index 0000000..8dffdce --- /dev/null +++ b/docs/how-to/deploy-k8s-service.md @@ -0,0 +1,126 @@ +--- +title: deploy-k8s-service +tags: + - how-to + - kubernetes + - argocd +--- + +# Deploy a Kubernetes Service + +Quick reference for deploying a new service to BlumeOps Kubernetes via ArgoCD. See [[adding-a-service|the tutorial]] for detailed explanations. + +## Create Manifests + +``` +argocd/manifests// +├── deployment.yaml +├── service.yaml +└── ingress-tailscale.yaml +``` + +Namespace should match service name. Use `registry.ops.eblu.me` for images. + +## Create ArgoCD Application + +```yaml +# argocd/apps/.yaml +apiVersion: argoproj.io/v1alpha1 +kind: Application +metadata: + name: + namespace: argocd +spec: + project: default + source: + repoURL: ssh://forgejo@indri.tail8d86e.ts.net:2200/eblume/blumeops.git + targetRevision: main + path: argocd/manifests/ + destination: + server: https://kubernetes.default.svc + namespace: + syncPolicy: + syncOptions: + - CreateNamespace=true +``` + +## Configure Ingress + +Add [[tailscale-operator|Tailscale Ingress]] with Homepage annotations: + +```yaml +apiVersion: networking.k8s.io/v1 +kind: Ingress +metadata: + name: + namespace: + annotations: + gethomepage.dev/enabled: "true" + gethomepage.dev/name: "Service Name" + gethomepage.dev/group: "Apps" + gethomepage.dev/icon: ".png" + gethomepage.dev/href: "https://.ops.eblu.me" + gethomepage.dev/pod-selector: "app=" +spec: + ingressClassName: tailscale + rules: + - host: + http: + paths: + - path: / + pathType: Prefix + backend: + service: + name: + port: + number: 80 +``` + +## Add Caddy Route (if needed) + +If other pods need to access the service, add to `ansible/roles/caddy/defaults/main.yml`: + +```yaml +caddy_services: + - name: + upstream: "https://.tail8d86e.ts.net" +``` + +Then: `mise run provision-indri -- --tags caddy` + +See [[routing]] for when Caddy is needed. + +## Deploy + +```bash +# Sync apps to pick up new Application +argocd app sync apps + +# Test on feature branch first +argocd app set --revision +argocd app sync + +# Verify +kubectl --context=minikube-indri -n get pods +kubectl --context=minikube-indri -n logs -f deployment/ + +# After PR merge, reset to main +argocd app set --revision main +argocd app sync +``` + +## Checklist + +- [ ] Manifests in `argocd/manifests//` +- [ ] Application in `argocd/apps/.yaml` +- [ ] Tailscale Ingress with Homepage annotations +- [ ] Caddy route (if pod-to-service access needed) +- [ ] Tested on feature branch +- [ ] PR reviewed and merged +- [ ] Reset to main branch + +## Related + +- [[adding-a-service]] - Full tutorial with explanations +- [[apps]] - ArgoCD application registry +- [[routing]] - Service routing options diff --git a/docs/how-to/index.md b/docs/how-to/index.md new file mode 100644 index 0000000..6f1a29a --- /dev/null +++ b/docs/how-to/index.md @@ -0,0 +1,34 @@ +--- +title: how-to +tags: + - how-to +--- + +# How-To Guides + +Task-oriented instructions for common BlumeOps operations. These guides assume you already understand the basic concepts - see [[tutorials/index|Tutorials]] if you're learning. + +## Deployment + +| Guide | Description | +|-------|-------------| +| [[deploy-k8s-service]] | Deploy a new service to Kubernetes via ArgoCD | +| [[add-ansible-role]] | Add a new Ansible role for indri services | + +## Configuration + +| Guide | Description | +|-------|-------------| +| [[update-tailscale-acls]] | Update Tailscale access control policies | + +## Documentation + +| Guide | Description | +|-------|-------------| +| [[update-documentation]] | Publish docs via build-blumeops workflow | + +## Operations + +| Guide | Description | +|-------|-------------| +| [[troubleshooting]] | Diagnose and fix common issues | diff --git a/docs/how-to/troubleshooting.md b/docs/how-to/troubleshooting.md new file mode 100644 index 0000000..b6c62fa --- /dev/null +++ b/docs/how-to/troubleshooting.md @@ -0,0 +1,228 @@ +--- +title: troubleshooting +tags: + - how-to + - operations +--- + +# Troubleshooting Common Issues + +Quick reference for diagnosing and fixing common BlumeOps issues. + +## General Health Check + +Run the comprehensive service health check: + +```bash +mise run indri-services-check +``` + +This checks all services on indri and in Kubernetes. + +## Kubernetes Issues + +### Pod not starting + +```bash +# Check pod status +kubectl --context=minikube-indri -n get pods + +# Describe pod for events +kubectl --context=minikube-indri -n describe pod + +# Check logs +kubectl --context=minikube-indri -n logs + +# Previous container logs (if restarting) +kubectl --context=minikube-indri -n logs --previous +``` + +Common causes: +- **ImagePullBackOff** - Image doesn't exist or registry unreachable +- **CrashLoopBackOff** - Application crashing; check logs +- **Pending** - Insufficient resources or node issues +- **ContainerCreating** - Waiting for volumes or secrets + +### ArgoCD sync issues + +```bash +# Check app status +argocd app get + +# See what will change +argocd app diff + +# Force sync +argocd app sync --force + +# Sync with prune (removes deleted resources) +argocd app sync --prune +``` + +**App stuck in "Syncing":** +Check if there are failed hooks or jobs: +```bash +kubectl --context=minikube-indri -n get jobs +kubectl --context=minikube-indri -n get pods --field-selector=status.phase=Failed +``` + +**ArgoCD login expired:** +```bash +argocd login argocd.ops.eblu.me --username admin --password "$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get srogeebssulhtb6tnqd7ls6qey --fields password --reveal)" +``` + +### kubectl connection refused + +```bash +# Check if minikube is running (on indri) +ssh indri 'minikube status' + +# Restart if needed +ssh indri 'minikube start' + +# Verify tailscale is serving the API +ssh indri 'tailscale serve status --json' +``` + +## Indri Service Issues + +### Service not responding + +```bash +# Check LaunchAgent status +ssh indri 'launchctl list | grep mcquack' + +# Restart a LaunchAgent +ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack..plist' +ssh indri 'launchctl load ~/Library/LaunchAgents/mcquack..plist' + +# Check service logs +ssh indri 'tail -50 ~/Library/Logs/mcquack..err.log' +ssh indri 'tail -50 ~/Library/Logs/mcquack..out.log' +``` + +### Forgejo not accessible + +```bash +# Check if forgejo is running +ssh indri 'lsof -nP -iTCP:3001 -sTCP:LISTEN' + +# Check logs +ssh indri 'tail -50 ~/Library/Logs/mcquack.forgejo.err.log' + +# Restart forgejo +ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.forgejo' +``` + +### Registry (Zot) issues + +```bash +# Test registry API +ssh indri 'curl -s http://localhost:5050/v2/_catalog | jq' + +# Check if zot is running +ssh indri 'lsof -nP -iTCP:5050 -sTCP:LISTEN' + +# Restart zot +ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.zot' +``` + +## Network Issues + +### Service unreachable via *.ops.eblu.me + +Caddy handles routing for `*.ops.eblu.me`: + +```bash +# Check if Caddy is running +ssh indri 'launchctl list | grep caddy' + +# View Caddy logs +ssh indri 'tail -50 ~/Library/Logs/caddy/access.log' +ssh indri 'tail -50 ~/Library/Logs/caddy/error.log' + +# Restart Caddy +ssh indri 'launchctl kickstart -k gui/$(id -u)/homebrew.mxcl.caddy' +``` + +### Tailscale MagicDNS not resolving + +```bash +# Check tailscale serve status +ssh indri 'tailscale serve status --json' + +# Restart tailscale if needed +ssh indri 'tailscale down && tailscale up' +``` + +## Observability + +### Check metrics + +```bash +# Open Grafana +open https://grafana.ops.eblu.me + +# Check Prometheus directly +open https://prometheus.ops.eblu.me +``` + +### Check logs + +```bash +# Open Grafana Explore +open https://grafana.ops.eblu.me/explore + +# Query Loki directly +curl -G 'https://loki.ops.eblu.me/loki/api/v1/query_range' \ + --data-urlencode 'query={service=""}' \ + --data-urlencode 'limit=100' +``` + +### Alloy (metrics/logs collector) issues + +```bash +# Indri alloy (host metrics) +ssh indri 'launchctl list | grep alloy' +ssh indri 'tail -50 ~/Library/Logs/alloy/alloy.log' + +# K8s alloy (pod logs) +kubectl --context=minikube-indri -n monitoring logs -l app=alloy +``` + +## Database Issues + +### PostgreSQL connection failed + +```bash +# Check CNPG cluster status +kubectl --context=minikube-indri -n databases get cluster + +# Check PostgreSQL pods +kubectl --context=minikube-indri -n databases get pods -l cnpg.io/cluster=blumeops-pg + +# Connect to database +kubectl --context=minikube-indri -n databases exec -it blumeops-pg-1 -- psql -U postgres +``` + +## Backup Issues + +### Check backup status + +```bash +# View latest backup info +ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom' + +# Run backup manually +ssh indri 'borgmatic --verbosity 1' + +# Check backup logs +ssh indri 'tail -100 /opt/homebrew/var/log/borgmatic/borgmatic.log' +``` + +## Related + +- [[observability]] - Metrics and logs +- [[argocd]] - GitOps platform +- [[cluster]] - Kubernetes cluster +- [[routing]] - Service routing diff --git a/docs/how-to/update-documentation.md b/docs/how-to/update-documentation.md new file mode 100644 index 0000000..be47f5e --- /dev/null +++ b/docs/how-to/update-documentation.md @@ -0,0 +1,130 @@ +--- +title: update-documentation +tags: + - how-to + - documentation + - ci-cd +--- + +# Update Documentation + +How to publish documentation changes to https://docs.ops.eblu.me. + +## Quick Release + +After merging documentation changes to main: + +1. Go to **Actions** > **Build BlumeOps** > **Run workflow** +2. Enter a version (e.g., `v1.2.0`) or leave empty to auto-increment +3. The workflow builds, releases, and deploys automatically + +Direct link: https://forge.ops.eblu.me/eblume/blumeops/actions?workflow=build-blumeops.yaml + +## What the Workflow Does + +The `build-blumeops` workflow (`.forgejo/workflows/build-blumeops.yaml`): + +1. **Resolves version** - Uses input or auto-increments from latest release +2. **Builds changelog** - Runs towncrier to collect changelog fragments +3. **Builds docs** - Clones Quartz, builds static site from `docs/` +4. **Creates release** - Uploads `docs-.tar.gz` to Forgejo releases +5. **Updates deployment** - Edits `argocd/manifests/docs/deployment.yaml` with new URL +6. **Commits changes** - Pushes changelog and deployment updates to main +7. **Deploys** - Syncs the `docs` ArgoCD app + +## Changelog Fragments (Towncrier) + +When making changes, add a changelog fragment to `docs/changelog.d/`: + +```bash +# Format: ..md +# Types: feature, bugfix, infra, doc, misc + +# Using branch name (preferred) +echo "Add new feature X" > docs/changelog.d/my-feature.feature.md + +# Orphan fragment (when no branch fits) +echo "Fix bug Y" > docs/changelog.d/+fix-bug.bugfix.md +``` + +Fragments are automatically collected into `docs/CHANGELOG.md` during release. + +**Fragment types:** +| Type | Directory | Description | +|------|-----------|-------------| +| `feature` | `feature/` | New features | +| `bugfix` | `bugfix/` | Bug fixes | +| `infra` | `infra/` | Infrastructure changes | +| `doc` | `doc/` | Documentation updates | +| `misc` | `misc/` | Other (content hidden in changelog) | + +## Runner Environment + +The workflow runs on the `k8s` label, which uses the [[forgejo]]-runner in Kubernetes: + +- **Runner deployment**: `argocd/manifests/forgejo-runner/` +- **Job image**: `registry.ops.eblu.me/blumeops/forgejo-runner:latest` +- **Includes**: Node.js 24, npm, git, jq, Docker CLI, uv/uvx, argocd CLI + +The job image is built from `containers/forgejo-runner/Dockerfile`. + +## Quartz Static Site Generator + +[Quartz](https://quartz.jzhao.xyz/) builds the documentation into a static site with: +- Wiki-link support (`[[page]]` syntax) +- Backlinks panel showing what references each page +- Graph view of document connections +- Full-text search + +**Configuration files** (in `docs/`): +- `quartz.config.ts` - Site metadata, plugins, theme +- `quartz.layout.ts` - Page layout components + +Quartz is cloned fresh during each build (not vendored) to use the latest version. + +## Manual Build (Local) + +To test docs locally without triggering a release: + +```bash +# Clone Quartz +git clone --depth 1 https://github.com/jackyzha0/quartz.git /tmp/quartz +cd /tmp/quartz + +# Install dependencies +npm ci + +# Copy config and content +cp /path/to/blumeops/docs/quartz.config.ts . +cp /path/to/blumeops/docs/quartz.layout.ts . +rm -rf content +cp -r /path/to/blumeops/docs content + +# Build +npx quartz build + +# Serve locally +npx quartz build --serve +``` + +## Troubleshooting + +**Workflow fails on "Resolve version":** +- Check if the version already exists as a release +- Ensure version format is `vX.Y.Z` + +**Docs not updating after deploy:** +- Check ArgoCD sync status: `argocd app get docs` +- Verify the pod restarted: `kubectl --context=minikube-indri -n docs get pods` +- Check pod logs for download errors + +**Towncrier not finding fragments:** +- Fragments must be in `docs/changelog.d/` +- Must have `.md` extension +- Must match pattern `..md` + +## Related + +- [[docs]] - Documentation service reference +- [[forgejo]] - Git forge and CI/CD +- [[argocd]] - GitOps deployment diff --git a/docs/how-to/update-tailscale-acls.md b/docs/how-to/update-tailscale-acls.md new file mode 100644 index 0000000..d6f654d --- /dev/null +++ b/docs/how-to/update-tailscale-acls.md @@ -0,0 +1,128 @@ +--- +title: update-tailscale-acls +tags: + - how-to + - tailscale + - pulumi +--- + +# Update Tailscale ACLs + +How to modify Tailscale access control policies for the tailnet. + +## Prerequisites + +- Pulumi CLI installed (`brew install pulumi`) +- Access to 1Password blumeops vault (for OAuth credentials) + +## Edit the Policy + +The ACL policy lives in `pulumi/policy.hujson` (HuJSON format with comments). + +Common changes: + +### Add a new ACL rule + +```json +{ + "acls": [ + // ... existing rules ... + { + "action": "accept", + "src": ["autogroup:admin"], + "dst": ["tag:newservice:*"] + } + ] +} +``` + +### Add a new tag + +```json +{ + "tagOwners": { + // ... existing tags ... + "tag:newservice": ["autogroup:admin"] + } +} +``` + +### Add a new group + +```json +{ + "groups": { + // ... existing groups ... + "group:newgroup": ["user1@example.com", "user2@example.com"] + } +} +``` + +## Preview and Apply + +```bash +# Preview changes (always do this first) +mise run tailnet-preview + +# Apply changes +mise run tailnet-up + +# Skip confirmation prompt +mise run tailnet-up -- --yes +``` + +## Verify + +Check the Tailscale admin console at https://login.tailscale.com/ to confirm changes. + +## Common Patterns + +### Service-specific access + +Grant access to a specific service port: + +```json +{ + "action": "accept", + "src": ["group:users"], + "dst": ["tag:homelab:8080"] +} +``` + +### SSH access + +```json +{ + "ssh": [ + { + "action": "check", + "src": ["autogroup:admin"], + "dst": ["tag:servers"], + "users": ["autogroup:nonroot"] + } + ] +} +``` + +### All ports for admins + +```json +{ + "action": "accept", + "src": ["autogroup:admin"], + "dst": ["*:*"] +} +``` + +## Troubleshooting + +**"Credential expired" error:** +Re-authenticate Pulumi with Tailscale. The OAuth token may need refreshing. + +**Changes not taking effect:** +ACL changes are applied immediately. If a device isn't following new rules, try `tailscale down && tailscale up` on that device. + +## Related + +- [[tailscale]] - ACL reference and current configuration +- [[routing]] - Service routing diff --git a/docs/tutorials/exploring-the-docs.md b/docs/tutorials/exploring-the-docs.md index 37b4c3d..b0e49b9 100644 --- a/docs/tutorials/exploring-the-docs.md +++ b/docs/tutorials/exploring-the-docs.md @@ -19,7 +19,7 @@ The docs follow the [Diataxis](https://diataxis.fr/) framework: |---------|---------|-------------| | **[[tutorials/index | Tutorials]]** | Learning-oriented | "I'm new and want to understand" | | **[[reference/index | Reference]]** | Information-oriented | "I need specific technical details" | -| **How-to** (planned) | Task-oriented | "I need to do X" | +| **[[how-to/index | How-to]]** | Task-oriented | "I need to do X" | | **Explanation** (planned) | Understanding-oriented | "I want to understand why" | ## Quick Paths by Audience @@ -27,6 +27,7 @@ The docs follow the [Diataxis](https://diataxis.fr/) framework: ### For Erich (Owner) You probably want quick access to operational details: +- [[how-to/index|How-to guides]] for common operations (deploy, troubleshoot, update ACLs) - [[reference/index|Reference]] has service URLs, commands, and config locations - The `zk-docs` mise task still works for legacy zettelkasten access - [[ai-assistance-guide]] explains how to work effectively with Claude @@ -49,6 +50,7 @@ Understanding what this is: Getting started with changes: - [[contributing]] walks through the workflow +- [[how-to/index|How-to guides]] for specific tasks (deploy services, add roles) - [[reference/index|Reference]] tells you where things live ### For Replicators