Fix cache hit rate on APM and Fly.io dashboards #177

Merged
eblume merged 1 commit from fix/cache-hit-rate-dashboards into main 2026-02-12 18:40:48 -08:00
Owner

Summary

  • Remove match_all = true from flyio_nginx_cache_requests_total in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty cache_status)
  • Change dashboard queries from rate(...[5m]) to increase(...[$__range]) — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites
  • Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red

Root cause

Health check requests from Fly.io hit the default nginx server block (no proxy_cache), producing entries with empty upstream_cache_status. With match_all = true, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, rate()[5m] on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background).

Deployment

  • Fly.io proxy redeploy needed for Alloy config change
  • ArgoCD sync for dashboard ConfigMap changes

Test plan

  • Redeploy Fly.io proxy
  • Sync grafana-config in ArgoCD
  • Verify CV APM cache hit ratio shows a real percentage (not 100%)
  • Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited
  • Verify Fly.io proxy dashboard cache ratio excludes health checks
## Summary - Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`) - Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites - Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red ## Root cause Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background). ## Deployment - Fly.io proxy redeploy needed for Alloy config change - ArgoCD sync for dashboard ConfigMap changes ## Test plan - [ ] Redeploy Fly.io proxy - [ ] Sync grafana-config in ArgoCD - [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%) - [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited - [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks
The cache_requests_total metric used match_all=true, counting health
check requests (no cache_status) alongside real traffic. Dashboard
queries used rate()[5m] which produced blank or 100% on low-traffic
static sites. Switch to increase()[$__range] for meaningful aggregation
and add null/NaN value mapping to show "No traffic" instead of red.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
eblume merged commit 9c789a1868 into main 2026-02-12 18:40:48 -08:00
Sign in to join this conversation.
No reviewers
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
eblume/blumeops!177
No description provided.