Now that argocd's Authentik OAuth2 client is public, `argocd login --sso` works for day-to-day use. Promote it to the default in AGENTS.md, argocd-cli reference, and troubleshooting; keep the admin/password flow documented as a break-glass fallback for when Authentik is unavailable. Also drops --grpc-web from every interactive login command — confirmed extraneous (login succeeds without it). Left in CI workflows and `argocd cluster add` untouched; those are different contexts that I didn't re-test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6 KiB
6 KiB
| title | modified | last-reviewed | tags | ||
|---|---|---|---|---|---|
| Troubleshooting | 2026-03-16 | 2026-03-16 |
|
Troubleshooting Common Issues
Quick reference for diagnosing and fixing common BlumeOps issues.
General Health Check
Run the comprehensive service health check:
mise run services-check
This checks all services on indri and in Kubernetes.
Kubernetes Issues (Indri / Minikube)
Most services run on indri's minikube. For ringtail (k3s) services, see the ringtail section below.
Pod not starting
# Check pod status
kubectl --context=minikube-indri -n <namespace> get pods
# Describe pod for events
kubectl --context=minikube-indri -n <namespace> describe pod <pod>
# Check logs
kubectl --context=minikube-indri -n <namespace> logs <pod>
# Previous container logs (if restarting)
kubectl --context=minikube-indri -n <namespace> logs <pod> --previous
Common causes:
- ImagePullBackOff - Image doesn't exist or registry unreachable
- CrashLoopBackOff - Application crashing; check logs
- Pending - Insufficient resources or node issues
- ContainerCreating - Waiting for volumes or secrets
argocd sync issues
# Check app status
argocd app get <app>
# See what will change
argocd app diff <app>
# Force sync
argocd app sync <app> --force
# Sync with prune (removes deleted resources)
argocd app sync <app> --prune
App stuck in "Syncing": Check if there are failed hooks or jobs:
kubectl --context=minikube-indri -n <namespace> get jobs
kubectl --context=minikube-indri -n <namespace> get pods --field-selector=status.phase=Failed
ArgoCD login expired:
argocd login argocd.ops.eblu.me --sso
If Authentik itself is down, fall back to admin:
argocd login argocd.ops.eblu.me --username admin --password "$(op read 'op://vg6xf6vvfmoh5hqjjhlhbeoaie/srogeebssulhtb6tnqd7ls6qey/password')"
kubectl connection refused
# Check if minikube is running (on indri)
ssh indri 'minikube status'
# Restart if needed
ssh indri 'minikube start'
# Verify tailscale is serving the API
ssh indri 'tailscale serve status --json'
Indri Service Issues
Service not responding
# Check LaunchAgent status
ssh indri 'launchctl list | grep mcquack'
# Restart a LaunchAgent
ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.<service>.plist'
ssh indri 'launchctl load ~/Library/LaunchAgents/mcquack.<service>.plist'
# Check service logs
ssh indri 'tail -50 ~/Library/Logs/mcquack.<service>.err.log'
ssh indri 'tail -50 ~/Library/Logs/mcquack.<service>.out.log'
forgejo not accessible
# Check if forgejo is running
ssh indri 'lsof -nP -iTCP:3001 -sTCP:LISTEN'
# Check logs
ssh indri 'tail -50 ~/Library/Logs/mcquack.forgejo.err.log'
# Restart forgejo
ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.forgejo'
Registry (zot) issues
# Test registry API
ssh indri 'curl -s http://localhost:5050/v2/_catalog | jq'
# Check if zot is running
ssh indri 'lsof -nP -iTCP:5050 -sTCP:LISTEN'
# Restart zot
ssh indri 'launchctl kickstart -k gui/$(id -u)/mcquack.zot'
Network Issues
Service unreachable via *.ops.eblu.me
caddy handles routing for *.ops.eblu.me:
# Check if Caddy is running
ssh indri 'launchctl list | grep caddy'
# View Caddy logs
ssh indri 'tail -50 ~/Library/Logs/caddy/access.log'
ssh indri 'tail -50 ~/Library/Logs/caddy/error.log'
# Restart Caddy
ssh indri 'launchctl kickstart -k gui/$(id -u)/homebrew.mxcl.caddy'
Tailscale MagicDNS not resolving
# Check tailscale serve status
ssh indri 'tailscale serve status --json'
# Restart tailscale if needed
ssh indri 'tailscale down && tailscale up'
Observability
Check metrics
# Open [[grafana|Grafana]]
open https://grafana.ops.eblu.me
# Check [[prometheus|Prometheus]] directly
open https://prometheus.ops.eblu.me
Check logs
# Open Grafana Explore
open https://grafana.ops.eblu.me/explore
# Query [[loki|Loki]] directly
curl -G 'https://loki.ops.eblu.me/loki/api/v1/query_range' \
--data-urlencode 'query={service="<service>"}' \
--data-urlencode 'limit=100'
alloy (metrics/logs collector) issues
# Indri alloy (host metrics)
ssh indri 'launchctl list | grep alloy'
ssh indri 'tail -50 ~/Library/Logs/alloy/alloy.log'
# K8s alloy (pod logs)
kubectl --context=minikube-indri -n monitoring logs -l app=alloy
Database Issues
postgresql connection failed
# Check CNPG cluster status
kubectl --context=minikube-indri -n databases get cluster
# Check PostgreSQL pods
kubectl --context=minikube-indri -n databases get pods -l cnpg.io/cluster=blumeops-pg
# Connect to database
kubectl --context=minikube-indri -n databases exec -it blumeops-pg-1 -- psql -U postgres
Backup Issues
Check borgmatic status
# View latest backup info
ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom'
# Run backup manually
ssh indri 'borgmatic --verbosity 1'
# Check backup logs
ssh indri 'tail -100 /opt/homebrew/var/log/borgmatic/borgmatic.log'
Kubernetes Issues (Ringtail / k3s)
ringtail runs GPU workloads (frigate, ntfy) and authentik on a single-node k3s cluster. The same debugging patterns apply, but use --context=k3s-ringtail:
# Check pod status
kubectl --context=k3s-ringtail -n <namespace> get pods
# Describe pod for events
kubectl --context=k3s-ringtail -n <namespace> describe pod <pod>
# Check logs
kubectl --context=k3s-ringtail -n <namespace> logs <pod>
Ringtail unreachable
# Check if ringtail is on the tailnet
tailscale ping ringtail
# SSH in directly
ssh ringtail
If ringtail is unreachable, it may need a physical power cycle. See ringtail for details.
Related
- observability - Metrics and logs
- argocd - GitOps platform
- cluster - Kubernetes cluster
- routing - Service routing
- restart-indri - Shutdown/startup procedure and CNI conflict fix