Deploy Fly.io Proxy / deploy (push) Successful in 1m37s

Details

Switch Fly proxy to upstream keepalive pools (#337 )

## Summary

- Replace per-request DNS resolution (variable-based `proxy_pass`) with static `upstream` blocks and `keepalive` connection pools
- Reuses TLS connections through the Tailscale tunnel instead of handshaking per request
- Add `mise run fly-reload` for nginx config reload without full redeploy (re-resolves upstream DNS)

## Trade-off

DNS is resolved at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule), `mise run fly-reload` is needed. A Grafana alert will be added to detect this.

## Still TODO on this branch

- [ ] Grafana alert for upstream unreachable (triggers fly-reload reminder)
- [ ] Docs pass
- [ ] Deploy from branch and verify latency improvement
- [ ] Changelog fragment

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #337

2026-04-17 16:39:52 -07:00

6.8 KiB

Raw Blame History

title

modified

Fly.io Proxy

Public reverse proxy on Fly.io that exposes selected BlumeOps services to the internet via a Tailscale tunnel back to the homelab.

Quick Reference

Property	Value
App	`blumeops-proxy`
Region	`sjc` (San Jose)
Fly.io URL	`blumeops-proxy.fly.dev`
Config	`fly/` directory in repo
IaC	`fly/fly.toml` (app), Pulumi (DNS + auth key)

Exposed Services

Public domain	Backend	Service
`docs.eblu.me`	`docs.tail8d86e.ts.net`	docs
`cv.eblu.me`	`cv.tail8d86e.ts.net`	cv
`forge.eblu.me`	`forge.tail8d86e.ts.net`	forgejo

Architecture

Internet traffic hits Fly.io's Anycast edge, terminates TLS with a Let's Encrypt certificate, and is proxied by nginx to the backend service over a Tailscale WireGuard tunnel. See expose-service-publicly for the full architecture diagram.

Upstream Keepalive

Nginx uses upstream blocks with keepalive connection pools to reuse TLS connections through the WireGuard tunnel. This avoids a per-request TLS handshake, which was previously the dominant source of latency (35s+ p50 before keepalive, sub-second after).

Trade-off: DNS for upstream hostnames is resolved once at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule, minikube restart), run mise run fly-reload to re-resolve without a full redeploy. A Grafana alert fires when upstreams are unreachable.

Each upstream requires proxy_ssl_name set to the actual Tailscale hostname — nginx sends the upstream block name as SNI by default, which the Tailscale Ingress proxy won't recognize.

Key Files

File	Purpose
`fly/fly.toml`	App configuration
`fly/Dockerfile`	nginx + Tailscale + Alloy container
`fly/nginx.conf`	Reverse proxy, caching, rate limiting, JSON logging
`fly/alloy.river`	Alloy config: log tailing, metric extraction, remote_write
`fly/start.sh`	Entrypoint: start Tailscale, wait for MagicDNS, then nginx + Alloy
`pulumi/tailscale/__main__.py`	Auth key (`tag:flyio-proxy`)
`pulumi/tailscale/policy.hujson`	ACL grants for proxy
`pulumi/gandi/__main__.py`	DNS CNAMEs

Networking

Fly.io runs Firecracker microVMs which support TUN devices natively. Tailscale runs with a real TUN interface (not userspace networking), so MagicDNS and direct Tailscale IP routing work normally.

The Tailscale auth key is preauthorized=True to avoid device approval hangs on container restarts.

Observability

alloy runs inside the container alongside nginx and Tailscale, providing:

Logs: nginx JSON access logs tailed and pushed to loki ({instance="flyio-proxy", job="flyio-nginx"})
Metrics: Derived from access logs, pushed to prometheus via remote_write
- flyio_nginx_http_requests_total — request rate by status/method/host
- flyio_nginx_http_request_duration_seconds — total request latency histogram (includes proxy overhead)
- flyio_nginx_upstream_response_time_seconds — backend response time histogram (Forgejo processing only)
- flyio_nginx_http_response_bytes_total — response bandwidth
- flyio_nginx_cache_requests_total — cache HIT/MISS/EXPIRED counts

Dashboards

Dashboard	Purpose
Docs APM	Per-service view for `docs.eblu.me`: request rate, latency percentiles, cache hit ratio, error rate, bandwidth, access logs
Fly.io Proxy Health	Aggregate proxy health: connections, total request rate by host, cache performance, upstream latency, Alloy health

Alloy listens on 127.0.0.1:12345 for self-scraping its /metrics endpoint. All metrics carry instance="flyio-proxy".

Security Considerations

The tag:flyio-proxy ACL grants access only to tag:flyio-target:443. Services must explicitly opt in by adding a tailscale.com/tags: "tag:k8s,tag:flyio-target" annotation to their Tailscale Ingress. This means the proxy can only reach endpoints that have been individually tagged — a compromised nginx config cannot route to arbitrary services on the tailnet.

Currently tagged as tag:flyio-target: docs, cv, forgejo, loki, prometheus. Loki and Prometheus are tagged so that alloy (running inside the container) can push logs and metrics directly via their Tailscale Ingress endpoints — the restricted ACL means Caddy on indri (tag:homelab) is not reachable from the proxy.

Crawler Mitigation

The proxy serves a robots.txt blocking crawlers from expensive endpoints:

/mirrors/ — large mirrored repos
/user/ — auth endpoints (crawlers follow redirect loops)
/users/ — user profile pages
/*/archive/ — git bundle generation (DoS vector, see below)
/*/releases/download/ — release artifacts

Archive requests (/<owner>/<repo>/archive/*) are 302-redirected to forge.ops.eblu.me (tailnet-only), preventing unauthenticated archive generation. This mitigates a known Forgejo DoS vector where crawlers requesting unique commit SHAs trigger unbounded git bundle generation.

Release downloads are cached at the proxy layer (7-day TTL, keyed by URI) to absorb repeated downloads of the same artifact.

To expose an additional service through the proxy, add the tag:flyio-target annotation to its Tailscale Ingress. See expose-service-publicly for the full workflow.

Spider Trap Mitigation

The SPA fallback (try_files ... /index.html) serves index.html with a 200 for any URI, including non-existent paths. Quartz's relative links (../path) compound when resolved from phantom URLs, creating an infinite tree of unique URIs that crawlers follow indefinitely. In March 2026, Meta's crawler (meta-externalagent/1.1) hit ~49,000 unique URIs over 7 hours this way.

Two nginx location guards in containers/quartz/default.conf mitigate the trap:

/tags/ depth limit — /tags/<name> is always flat; anything deeper returns 404.
Global depth-5 cutoff — real content never exceeds depth 4; paths with 5+ segments return 404.

These are applied in the Quartz container's nginx config, not the Fly.io proxy. The proper fix is switching Quartz to root-absolute links (planned for the fork).

Secrets

Secret	Source	Description
`TS_AUTHKEY`	Pulumi state → `fly secrets`	Tailscale auth key for joining tailnet
`FLY_DEPLOY_TOKEN`	Fly.io → 1Password	Deploy token for CI

expose-service-publicly - Setup guide for adding new public services
manage-flyio-proxy - Operational tasks (deploy, shutoff, troubleshoot)
caddy - Private reverse proxy for *.ops.eblu.me (separate system)
tailscale - WireGuard mesh network
gandi - DNS hosting

6.8 KiB Raw Blame History