Fix cache hit rate on APM and Fly.io dashboards #177

Merged

eblume merged 1 commit from fix/cache-hit-rate-dashboards into main

2026-02-12 18:40:48 -08:00

eblume commented

2026-02-12 18:40:11 -08:00

Owner

Summary

Remove match_all = true from flyio_nginx_cache_requests_total in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty cache_status)
Change dashboard queries from rate(...[5m]) to increase(...[$__range]) — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites
Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red

Root cause

Health check requests from Fly.io hit the default nginx server block (no proxy_cache), producing entries with empty upstream_cache_status. With match_all = true, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, rate()[5m] on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background).

Deployment

Fly.io proxy redeploy needed for Alloy config change
ArgoCD sync for dashboard ConfigMap changes

Test plan

Redeploy Fly.io proxy
Sync grafana-config in ArgoCD
Verify CV APM cache hit ratio shows a real percentage (not 100%)
Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited
Verify Fly.io proxy dashboard cache ratio excludes health checks

## Summary - Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`) - Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites - Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red ## Root cause Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background). ## Deployment - Fly.io proxy redeploy needed for Alloy config change - ArgoCD sync for dashboard ConfigMap changes ## Test plan - [ ] Redeploy Fly.io proxy - [ ] Sync grafana-config in ArgoCD - [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%) - [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited - [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks

eblume added 1 commit

2026-02-12 18:40:11 -08:00

Fix cache hit rate on APM and Fly.io dashboards 1a2e512b4b

The cache_requests_total metric used match_all=true, counting health
check requests (no cache_status) alongside real traffic. Dashboard
queries used rate()[5m] which produced blank or 100% on low-traffic
static sites. Switch to increase()[$__range] for meaningful aggregation
and add null/NaN value mapping to show "No traffic" instead of red.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>