blumeops/docs/reference/services/flyio-proxy.md
Erich Blume fe0e913963
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m37s
Switch Fly proxy to upstream keepalive pools (#337)
## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337
2026-04-17 16:39:52 -07:00

6.8 KiB

title modified tags
Fly.io Proxy 2026-04-17
service
networking
fly-io

Fly.io Proxy

Public reverse proxy on Fly.io that exposes selected BlumeOps services to the internet via a Tailscale tunnel back to the homelab.

Quick Reference

Property Value
App blumeops-proxy
Region sjc (San Jose)
Fly.io URL blumeops-proxy.fly.dev
Config fly/ directory in repo
IaC fly/fly.toml (app), Pulumi (DNS + auth key)

Exposed Services

Public domain Backend Service
docs.eblu.me docs.tail8d86e.ts.net docs
cv.eblu.me cv.tail8d86e.ts.net cv
forge.eblu.me forge.tail8d86e.ts.net forgejo

Architecture

Internet traffic hits Fly.io's Anycast edge, terminates TLS with a Let's Encrypt certificate, and is proxied by nginx to the backend service over a Tailscale WireGuard tunnel. See expose-service-publicly for the full architecture diagram.

Upstream Keepalive

Nginx uses upstream blocks with keepalive connection pools to reuse TLS connections through the WireGuard tunnel. This avoids a per-request TLS handshake, which was previously the dominant source of latency (35s+ p50 before keepalive, sub-second after).

Trade-off: DNS for upstream hostnames is resolved once at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule, minikube restart), run mise run fly-reload to re-resolve without a full redeploy. A Grafana alert fires when upstreams are unreachable.

Each upstream requires proxy_ssl_name set to the actual Tailscale hostname — nginx sends the upstream block name as SNI by default, which the Tailscale Ingress proxy won't recognize.

Key Files

File Purpose
fly/fly.toml App configuration
fly/Dockerfile nginx + Tailscale + Alloy container
fly/nginx.conf Reverse proxy, caching, rate limiting, JSON logging
fly/alloy.river Alloy config: log tailing, metric extraction, remote_write
fly/start.sh Entrypoint: start Tailscale, wait for MagicDNS, then nginx + Alloy
pulumi/tailscale/__main__.py Auth key (tag:flyio-proxy)
pulumi/tailscale/policy.hujson ACL grants for proxy
pulumi/gandi/__main__.py DNS CNAMEs

Networking

Fly.io runs Firecracker microVMs which support TUN devices natively. Tailscale runs with a real TUN interface (not userspace networking), so MagicDNS and direct Tailscale IP routing work normally.

The Tailscale auth key is preauthorized=True to avoid device approval hangs on container restarts.

Observability

alloy runs inside the container alongside nginx and Tailscale, providing:

  • Logs: nginx JSON access logs tailed and pushed to loki ({instance="flyio-proxy", job="flyio-nginx"})
  • Metrics: Derived from access logs, pushed to prometheus via remote_write
    • flyio_nginx_http_requests_total — request rate by status/method/host
    • flyio_nginx_http_request_duration_seconds — total request latency histogram (includes proxy overhead)
    • flyio_nginx_upstream_response_time_seconds — backend response time histogram (Forgejo processing only)
    • flyio_nginx_http_response_bytes_total — response bandwidth
    • flyio_nginx_cache_requests_total — cache HIT/MISS/EXPIRED counts

Dashboards

Dashboard Purpose
Docs APM Per-service view for docs.eblu.me: request rate, latency percentiles, cache hit ratio, error rate, bandwidth, access logs
Fly.io Proxy Health Aggregate proxy health: connections, total request rate by host, cache performance, upstream latency, Alloy health

Alloy listens on 127.0.0.1:12345 for self-scraping its /metrics endpoint. All metrics carry instance="flyio-proxy".

Security Considerations

The tag:flyio-proxy ACL grants access only to tag:flyio-target:443. Services must explicitly opt in by adding a tailscale.com/tags: "tag:k8s,tag:flyio-target" annotation to their Tailscale Ingress. This means the proxy can only reach endpoints that have been individually tagged — a compromised nginx config cannot route to arbitrary services on the tailnet.

Currently tagged as tag:flyio-target: docs, cv, forgejo, loki, prometheus. Loki and Prometheus are tagged so that alloy (running inside the container) can push logs and metrics directly via their Tailscale Ingress endpoints — the restricted ACL means Caddy on indri (tag:homelab) is not reachable from the proxy.

Crawler Mitigation

The proxy serves a robots.txt blocking crawlers from expensive endpoints:

  • /mirrors/ — large mirrored repos
  • /user/ — auth endpoints (crawlers follow redirect loops)
  • /users/ — user profile pages
  • /*/archive/ — git bundle generation (DoS vector, see below)
  • /*/releases/download/ — release artifacts

Archive requests (/<owner>/<repo>/archive/*) are 302-redirected to forge.ops.eblu.me (tailnet-only), preventing unauthenticated archive generation. This mitigates a known Forgejo DoS vector where crawlers requesting unique commit SHAs trigger unbounded git bundle generation.

Release downloads are cached at the proxy layer (7-day TTL, keyed by URI) to absorb repeated downloads of the same artifact.

To expose an additional service through the proxy, add the tag:flyio-target annotation to its Tailscale Ingress. See expose-service-publicly for the full workflow.

Spider Trap Mitigation

The SPA fallback (try_files ... /index.html) serves index.html with a 200 for any URI, including non-existent paths. Quartz's relative links (../path) compound when resolved from phantom URLs, creating an infinite tree of unique URIs that crawlers follow indefinitely. In March 2026, Meta's crawler (meta-externalagent/1.1) hit ~49,000 unique URIs over 7 hours this way.

Two nginx location guards in containers/quartz/default.conf mitigate the trap:

  1. /tags/ depth limit/tags/<name> is always flat; anything deeper returns 404.
  2. Global depth-5 cutoff — real content never exceeds depth 4; paths with 5+ segments return 404.

These are applied in the Quartz container's nginx config, not the Fly.io proxy. The proper fix is switching Quartz to root-absolute links (planned for the fork).

Secrets

Secret Source Description
TS_AUTHKEY Pulumi state → fly secrets Tailscale auth key for joining tailnet
FLY_DEPLOY_TOKEN Fly.io → 1Password Deploy token for CI