Add pod state observability to minikube dashboard #83

Merged
eblume merged 2 commits from feature/pod-state-dashboard into main 2026-02-03 07:20:06 -08:00
Owner

Summary

  • Add "Unhealthy Pods" stat panel showing count of pods in error states (ImagePullBackOff, CrashLoopBackOff, etc.) with red background when > 0
  • Add "Pods by Waiting Reason" time series chart showing container waiting states over time
  • Provides visibility into stuck pods that ArgoCD doesn't track (since it manages CronJobs, not the Jobs/Pods they spawn)

Context

This addresses the issue where a zim-watcher cronjob pod was stuck in ImagePullBackOff for 11 days without any alerting. ArgoCD showed the CronJob as "Synced, Healthy" because it only manages the CronJob resource, not its spawned Jobs/Pods.

Deployment and Testing

  • Sync grafana-config app to test branch
  • Verify dashboard renders correctly
  • Confirm "Unhealthy Pods" shows 0 (green) when no issues
  • Reset to main after merge

🤖 Generated with Claude Code

## Summary - Add "Unhealthy Pods" stat panel showing count of pods in error states (ImagePullBackOff, CrashLoopBackOff, etc.) with red background when > 0 - Add "Pods by Waiting Reason" time series chart showing container waiting states over time - Provides visibility into stuck pods that ArgoCD doesn't track (since it manages CronJobs, not the Jobs/Pods they spawn) ## Context This addresses the issue where a `zim-watcher` cronjob pod was stuck in `ImagePullBackOff` for 11 days without any alerting. ArgoCD showed the CronJob as "Synced, Healthy" because it only manages the CronJob resource, not its spawned Jobs/Pods. ## Deployment and Testing - [ ] Sync grafana-config app to test branch - [ ] Verify dashboard renders correctly - [ ] Confirm "Unhealthy Pods" shows 0 (green) when no issues - [ ] Reset to main after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code)
- Add "Unhealthy Pods" stat panel that shows count of pods in error states
  (ImagePullBackOff, CrashLoopBackOff, etc.) with red background when > 0
- Add "Pods by Waiting Reason" time series showing container waiting states
- This provides visibility into stuck pods that ArgoCD doesn't track

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Jobs created by the CronJob will now auto-delete 4 days after completion.
This prevents zombie Jobs from accumulating when the CronJob spec changes
(e.g., image updates), since ArgoCD only tracks the CronJob resource itself,
not the Jobs it spawns.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
eblume merged commit 737371ab59 into main 2026-02-03 07:20:06 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!83
No description provided.