Fix cache hit rate on APM and Fly.io dashboards (#177)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m19s

## Summary
- Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`)
- Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites
- Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red

## Root cause
Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background).

## Deployment
- Fly.io proxy redeploy needed for Alloy config change
- ArgoCD sync for dashboard ConfigMap changes

## Test plan
- [ ] Redeploy Fly.io proxy
- [ ] Sync grafana-config in ArgoCD
- [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%)
- [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited
- [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177
This commit is contained in:
Erich Blume 2026-02-12 18:40:48 -08:00
commit 9c789a1868
5 changed files with 7 additions and 4 deletions

View file

@ -79,7 +79,6 @@ loki.process "nginx" {
name = "flyio_nginx_cache_requests_total"
description = "Total cache lookups by cache status."
source = "cache_status"
match_all = true
action = "inc"
}
}