Commit graph

7 commits

Author SHA1 Message Date
16bfe06b7b Fix LaunchDaemon check to use become: true
LaunchDaemons run in the system domain and require sudo to query.
Without become: true, the check always fails and tries to reload.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-22 17:34:23 -08:00
57bf8512dc Log filtering cleanup and observability improvements (#45)
## Summary
- Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue)
- Disable thermal collector on indri Alloy (not supported on macOS M1)
- Add macOS power/thermal metrics collection via powermetrics LaunchDaemon
- Add Power & Thermal section to macOS Grafana dashboard
- Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.)
- Extract more fields from JSON logs (zot compatibility - uses "message" not "msg")
- Silence logfmt parse errors for non-logfmt logs
- Fix JSON escaping in devpi dashboard

## Deployment and Testing
- [x] Deployed Alloy config changes to indri via ansible
- [x] Synced alloy-k8s and grafana-config via ArgoCD
- [x] Verified power metrics appearing in Prometheus
- [x] Verified thermal collector errors stopped
- [x] Verified logfmt parse errors silenced
- [x] Verified devpi dashboard loads correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45
2026-01-22 17:30:08 -08:00
17023085cb Migrate observability stack to Kubernetes (#42)
Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack.

Summary

  - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal)
  - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses
  - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics
  - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net)
  - Add ACL rule for port 9187 (CNPG metrics)
  - Delete obsolete ansible roles for prometheus and loki

Changes

  - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications
  - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS
  - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint
  - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics
  - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints
  - pulumi/policy.hujson - ACL for port 9187
  - Deleted ansible/roles/prometheus/ and ansible/roles/loki/

Deployment and Testing

  - Stop prometheus and loki on indri
  - Sync ArgoCD apps (apps, prometheus, loki, grafana)
  - Run mise run provision-indri -- --tags alloy
  - Verify Grafana dashboards show data

🤖 Generated with https://claude.ai/claude-code

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42
2026-01-22 12:06:02 -08:00
9931829d03 Add pre-commit hooks for code quality (#19)
## Summary
- Add pre-commit framework with hooks for YAML, Ansible, Python, shell, TOML, JSON, and secret detection
- Fix all 91+ ansible-lint violations (variable naming, handler capitalization, changed_when)
- Fix shellcheck warnings in mise-tasks scripts
- Document pre-commit setup in README.md

## Deployment and Testing
- [x] All pre-commit hooks pass (`uvx pre-commit run --all-files`)
- [x] Test ansible playbook with `--check` mode
- [x] Run `mise run indri-services-check` after deploy

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/19
2026-01-16 19:33:02 -08:00
adf6f4fbe9 Add PostgreSQL and Miniflux services to tailnet (#16)
## Summary
- Add PostgreSQL 18 as a new service at `pg.tail8d86e.ts.net:5432`
- Add Miniflux RSS/Atom feed reader at `feed.tail8d86e.ts.net`
- Both services managed via homebrew/brew services
- Pulumi ACL tags added (tag:pg, tag:feed)
- Alloy log collection configured for both services
- Zettelkasten documentation updated

## Manual Setup Required

Before running ansible, the following steps are needed on indri:

### 1. Apply Pulumi tags
```bash
mise run tailnet-up
```
Then apply tags to indri in Tailscale admin console.

### 2. Create 1Password entries
- miniflux PostgreSQL user password
- miniflux admin password (for first run)

### 3. Set PostgreSQL user password (after ansible installs postgres)
```bash
ssh indri '/opt/homebrew/opt/postgresql@18/bin/psql -c "ALTER USER miniflux PASSWORD '\''your-password'\'';"'
```

### 4. Create password files on indri
```bash
ssh indri 'echo "your-db-password" > ~/.miniflux-db-password && chmod 600 ~/.miniflux-db-password'
ssh indri 'echo "your-admin-password" > ~/.miniflux-admin-password && chmod 600 ~/.miniflux-admin-password'
```

### 5. Create ~/.pgpass for borgmatic
```bash
ssh indri 'echo "localhost:5432:miniflux:miniflux:YOUR_PASSWORD" > ~/.pgpass && chmod 600 ~/.pgpass'
```

### 6. Run ansible with first-run admin creation
```bash
mise run provision-indri -- -e miniflux_create_admin=1
```

### 7. Update borgmatic config
Add to `~/.config/borgmatic/config.yaml` on indri:
```yaml
postgresql_databases:
    - name: miniflux
      hostname: localhost
      port: 5432
      username: miniflux
```

### 8. Cleanup after first run
```bash
ssh indri 'rm ~/.miniflux-admin-password'
```

## Test plan
- [ ] Run `mise run tailnet-up` and verify Pulumi changes
- [ ] Apply tags to indri in Tailscale admin
- [ ] Run `mise run provision-indri -- --check --diff` for dry run
- [ ] Run `mise run provision-indri -- -e miniflux_create_admin=1`
- [ ] Approve services in Tailscale admin
- [ ] Verify PostgreSQL: `ssh indri '/opt/homebrew/opt/postgresql@18/bin/pg_isready'`
- [ ] Verify Miniflux: `curl https://feed.tail8d86e.ts.net/healthcheck`
- [ ] Run `mise run indri-services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/16
2026-01-16 12:30:20 -08:00
2a1359a3b6 Fix ansible handler timeouts for alloy and loki restarts (#12)
## Summary
- Use async with poll: 0 for alloy and loki restart handlers
- Fire-and-forget approach prevents ansible from hanging on graceful shutdown

## Test plan
- [x] Manually verified `brew services restart grafana-alloy` works
- [x] Run full ansible playbook and verify it completes without timeout

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/12
2026-01-15 13:56:11 -08:00
242c1880de Add Grafana Alloy and Loki for unified observability (#11)
## Summary
- Add Grafana Alloy to replace node_exporter for metrics collection
- Add Loki for log aggregation and storage
- Configure Alloy to collect logs from all services (grafana, forgejo, prometheus, tailscale, transmission, devpi, kiwix, borgmatic)
- Update Prometheus to accept metrics via remote_write
- Add Loki datasource to Grafana

## Test plan
- [ ] Run \`mise run provision-indri -- --check --diff\` to verify changes
- [ ] Apply with \`mise run provision-indri\`
- [ ] Verify services: \`mise run indri-services-check\`
- [ ] Check Grafana Explore with Loki datasource
- [ ] Query logs: \`{service="grafana"}\`
- [ ] Verify metrics still flowing to Prometheus dashboards

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/11
2026-01-15 12:24:13 -08:00