blumeops/docs/how-to/troubleshooting.md
Erich Blume e720b524d3 Rename indri-services-check to services-check (#103)
## Summary
- Rename `indri-services-check` task to `services-check` since it checks all services (indri native, Kubernetes, HTTP endpoints), not just indri-specific ones
- Update references in CLAUDE.md, ai-assistance-guide.md, and troubleshooting.md

## Deployment and Testing
- [ ] Run `mise run services-check` to verify the task works under its new name

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/103
2026-02-04 07:49:15 -08:00

4.9 KiB

title tags
troubleshooting
how-to
operations

Troubleshooting Common Issues

Quick reference for diagnosing and fixing common BlumeOps issues.

General Health Check

Run the comprehensive service health check:

mise run services-check

This checks all services on indri and in Kubernetes.

Kubernetes Issues

Pod not starting

# Check pod status
kubectl --context=minikube-indri -n <namespace> get pods

# Describe pod for events
kubectl --context=minikube-indri -n <namespace> describe pod <pod>

# Check logs
kubectl --context=minikube-indri -n <namespace> logs <pod>

# Previous container logs (if restarting)
kubectl --context=minikube-indri -n <namespace> logs <pod> --previous

Common causes:

  • ImagePullBackOff - Image doesn't exist or registry unreachable
  • CrashLoopBackOff - Application crashing; check logs
  • Pending - Insufficient resources or node issues
  • ContainerCreating - Waiting for volumes or secrets

ArgoCD sync issues

# Check app status
argocd app get <app>

# See what will change
argocd app diff <app>

# Force sync
argocd app sync <app> --force

# Sync with prune (removes deleted resources)
argocd app sync <app> --prune

App stuck in "Syncing": Check if there are failed hooks or jobs:

kubectl --context=minikube-indri -n <namespace> get jobs
kubectl --context=minikube-indri -n <namespace> get pods --field-selector=status.phase=Failed

ArgoCD login expired:

argocd login argocd.ops.eblu.me --username admin --password "$(op --vault vg6xf6vvfmoh5hqjjhlhbeoaie item get srogeebssulhtb6tnqd7ls6qey --fields password --reveal)"

kubectl connection refused

# Check if minikube is running (on indri)
ssh indri 'minikube status'

# Restart if needed
ssh indri 'minikube start'

# Verify tailscale is serving the API
ssh indri 'tailscale serve status --json'

Indri Service Issues

Service not responding

# Check LaunchAgent status
ssh indri 'launchctl list | grep mcquack'

# Restart a LaunchAgent
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.<service>.plist'
ssh indri 'launchctl load ~/Library/LaunchAgents/mcquack.<service>.plist'

# Check service logs
ssh indri 'tail -50 ~/Library/Logs/mcquack.<service>.err.log'
ssh indri 'tail -50 ~/Library/Logs/mcquack.<service>.out.log'

Forgejo not accessible

# Check if forgejo is running
ssh indri 'lsof -nP -iTCP:3001 -sTCP:LISTEN'

# Check logs
ssh indri 'tail -50 ~/Library/Logs/mcquack.forgejo.err.log'

# Restart forgejo
ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.forgejo'

Registry (Zot) issues

# Test registry API
ssh indri 'curl -s http://localhost:5050/v2/_catalog | jq'

# Check if zot is running
ssh indri 'lsof -nP -iTCP:5050 -sTCP:LISTEN'

# Restart zot
ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.zot'

Network Issues

Service unreachable via *.ops.eblu.me

Caddy handles routing for *.ops.eblu.me:

# Check if Caddy is running
ssh indri 'launchctl list | grep caddy'

# View Caddy logs
ssh indri 'tail -50 ~/Library/Logs/caddy/access.log'
ssh indri 'tail -50 ~/Library/Logs/caddy/error.log'

# Restart Caddy
ssh indri 'launchctl kickstart -k gui/$(id -u)/homebrew.mxcl.caddy'

Tailscale MagicDNS not resolving

# Check tailscale serve status
ssh indri 'tailscale serve status --json'

# Restart tailscale if needed
ssh indri 'tailscale down && tailscale up'

Observability

Check metrics

# Open Grafana
open https://grafana.ops.eblu.me

# Check Prometheus directly
open https://prometheus.ops.eblu.me

Check logs

# Open Grafana Explore
open https://grafana.ops.eblu.me/explore

# Query Loki directly
curl -G 'https://loki.ops.eblu.me/loki/api/v1/query_range' \
  --data-urlencode 'query={service="<service>"}' \
  --data-urlencode 'limit=100'

Alloy (metrics/logs collector) issues

# Indri alloy (host metrics)
ssh indri 'launchctl list | grep alloy'
ssh indri 'tail -50 ~/Library/Logs/alloy/alloy.log'

# K8s alloy (pod logs)
kubectl --context=minikube-indri -n monitoring logs -l app=alloy

Database Issues

PostgreSQL connection failed

# Check CNPG cluster status
kubectl --context=minikube-indri -n databases get cluster

# Check PostgreSQL pods
kubectl --context=minikube-indri -n databases get pods -l cnpg.io/cluster=blumeops-pg

# Connect to database
kubectl --context=minikube-indri -n databases exec -it blumeops-pg-1 -- psql -U postgres

Backup Issues

Check backup status

# View latest backup info
ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom'

# Run backup manually
ssh indri 'borgmatic --verbosity 1'

# Check backup logs
ssh indri 'tail -100 /opt/homebrew/var/log/borgmatic/borgmatic.log'