Observability cleanup and k8s service monitoring (#43) #43

Merged
eblume merged 9 commits from feature/observability-cleanup into main 2026-01-22 13:51:02 -08:00
Owner

Summary

  • Remove stale /opt/homebrew/var/loki from borgmatic backup (Loki migrated to k8s)
  • Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery
  • Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd
  • Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios)
  • Replace stale devpi dashboard with probe-based metrics (status, response time, uptime)
  • Add unified "K8s Services Health" dashboard for service uptime/response monitoring

Manual cleanup already performed

  • Deleted stale textfile metrics on indri: devpi.prom, transmission.prom
  • Deleted stale data directories on indri: /opt/homebrew/var/loki/, /opt/homebrew/var/prometheus/

Deployment and Testing

  • Sync apps application to pick up new alloy-k8s app
  • Deploy alloy-k8s on feature branch: argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s
  • Deploy torrent on feature branch (for transmission exporter): argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent
  • Deploy prometheus on feature branch (for new scrape config): argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus
  • Deploy grafana-config on feature branch (for dashboards): argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config
  • Verify pod logs appear in Loki/Grafana
  • Verify transmission metrics appear in Prometheus
  • Verify service probe metrics appear in Prometheus
  • Run mise run provision-indri -- --tags borgmatic to update borgmatic config
  • After merge, reset apps to main and resync

🤖 Generated with Claude Code

## Summary - Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s) - Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery - Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd - Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios) - Replace stale devpi dashboard with probe-based metrics (status, response time, uptime) - Add unified "K8s Services Health" dashboard for service uptime/response monitoring ## Manual cleanup already performed - Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom` - Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/` ## Deployment and Testing - [x] Sync `apps` application to pick up new alloy-k8s app - [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s` - [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent` - [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus` - [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config` - [x] Verify pod logs appear in Loki/Grafana - [x] Verify transmission metrics appear in Prometheus - [x] Verify service probe metrics appear in Prometheus - [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config - [x] After merge, reset apps to main and resync 🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Remove stale /opt/homebrew/var/loki from borgmatic backup (Loki migrated to k8s)
- Add Alloy k8s DaemonSet for automatic pod log collection to Loki
- Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd
- Replace stale devpi/transmission dashboards with unified services health dashboard
- The new Alloy k8s deployment auto-discovers all pods including new ones

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add transmission-exporter sidecar for full metrics (speed, torrents, etc.)
- Add Prometheus scrape config for transmission metrics
- Update devpi dashboard to use blackbox probe metrics (status, response time, uptime)
- Restore transmission dashboard (will use exporter metrics)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add namespace dropdown variable for filtering all panels
- Add kube-state-metrics for k8s resource tracking
- Show pods, deployments, statefulsets counts
- Show memory/CPU requests by namespace
- Show namespace resource summary table
- Filter pod logs by selected namespace
The metalmatze/transmission-exporter is unmaintained and has JSON parsing
issues with Transmission 4's API changes. Removing:
- Exporter sidecar from transmission deployment
- Transmission dashboard from Grafana
- Prometheus scrape config for transmission
- Metrics port from transmission service

TODO: Write custom transmission exporter

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Use node_memory_wired_bytes (macOS-specific metric) instead of
node_uname_info to populate the instance dropdown. This prevents
Linux hosts like sifaka from appearing in the macOS dashboard.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
eblume merged commit e4a8405de7 into main 2026-01-22 13:51:02 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!43
No description provided.