Update docs for Caddy routing and direct WireGuard peering

Comprehensive docs pass reflecting the new Fly proxy architecture:
- Fly proxy routes through Caddy on indri (not per-service TS Ingress)
- Direct WireGuard peering via --port=41641 pinning
- DERP relay performance lesson in Tailscale docs
- Caddy now in public traffic path
- indri tagged as flyio-target
- Removed fly-reload references
- Updated architecture diagrams and per-service setup guide
- Added changelog fragment

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
Erich Blume 2026-04-18 09:57:30 -07:00
commit d26a6ae3b2
8 changed files with 81 additions and 108 deletions

View file

@ -1,6 +1,6 @@
---
title: Routing
modified: 2026-04-17
modified: 2026-04-18
tags:
- infrastructure
- networking
@ -46,7 +46,7 @@ DNS points to [[indri]]'s Tailscale IP. TLS via Let's Encrypt (ACME DNS-01 with
## Public Services (`*.eblu.me`)
DNS CNAMEs point to `blumeops-proxy.fly.dev`. TLS via Fly.io-managed Let's Encrypt. Traffic tunnels back to the homelab over Tailscale. Only services tagged `tag:flyio-target` are reachable by the proxy — see [[flyio-proxy]] for details.
DNS CNAMEs point to `blumeops-proxy.fly.dev`. TLS via Fly.io-managed Let's Encrypt. Traffic tunnels back to [[caddy]] on [[indri]] over a direct Tailscale WireGuard connection, then Caddy routes to the service. See [[flyio-proxy]] for details.
| Service | URL | Description |
|---------|-----|-------------|

View file

@ -1,7 +1,7 @@
---
title: Tailscale
modified: 2026-03-22
last-reviewed: 2026-03-22
modified: 2026-04-18
last-reviewed: 2026-04-18
tags:
- infrastructure
- networking
@ -36,7 +36,7 @@ ACLs managed via Pulumi in `pulumi/tailscale/policy.hujson`.
| `tag:k8s` | (Ingress proxy pods) | Kubernetes Tailscale Ingress nodes; each also carries a per-service tag (`tag:grafana`, `tag:kiwix`, `tag:devpi`, `tag:feed`, `tag:pg`) |
| `tag:ci-gateway` | (ephemeral CI containers) | CI containers pushing images to registry |
| `tag:flyio-proxy` | (Fly.io proxy container) | Public reverse proxy |
| `tag:flyio-target` | (designated Ingress endpoints) | Endpoints reachable by the Fly.io proxy |
| `tag:flyio-target` | indri, designated Ingress endpoints | Endpoints reachable by the Fly.io proxy (indri for Caddy routing, Ingress pods for Alloy metrics/logs) |
**Important:** Don't tag user-owned devices (like gilbert) via Pulumi. Tagging converts them to "tagged devices" which lose user identity and break user-based SSH rules. Gilbert is referenced as `tag:workstation` in tagOwners for ownership purposes but remains user-owned so `blume.erich@gmail.com` identity is preserved.
@ -81,6 +81,19 @@ Pulumi uses OAuth client from 1Password (blumeops vault):
- Scopes: acl, dns, devices, services
- Auto-applies `tag:blumeops` to IaC-managed resources
## Direct Peering vs DERP Relay
Just because Tailscale can route traffic does not mean it routes it efficiently. DERP relay servers are a fallback for when direct WireGuard connections cannot be established — they add significant latency (20+ seconds observed under load) because every packet bounces through a relay server.
**Direct peering is critical for any production-like traffic path.** Check with `tailscale ping <host>` — it should say `via <ip>:<port>`, not `via DERP(<region>)`.
Common reasons direct peering fails:
- **k8s pods**: Tailscale Ingress pods behind pod-network NAT cannot hole-punch. Route through a host-level Tailscale node (e.g., Caddy on indri) instead.
- **Cloud VMs**: Some cloud providers block incoming UDP. Pin the WireGuard port (`tailscaled --port=41641`) and expose it as a UDP service if possible.
- **Double NAT / CGNAT**: Multiple NAT layers make hole punching unreliable.
The [[flyio-proxy]] uses `--port=41641` pinning to enable direct peering with indri, and routes through [[caddy]] (host-level Tailscale) to avoid the DERP bottleneck of k8s-hosted Tailscale Ingress pods.
## Related
- [[routing|Routing]] - Service URLs

View file

@ -1,6 +1,6 @@
---
title: Caddy
modified: 2026-03-15
modified: 2026-04-18
tags:
- service
- networking
@ -83,7 +83,9 @@ The token is written to `~/.config/caddy/gandi-token` (chmod 0600) and sourced b
## Security Considerations
Caddy has no authentication layer — it is a plain reverse proxy. Access control relies entirely on Tailscale ACLs restricting which devices can reach indri on port 443. Currently `tag:homelab` and `autogroup:admin` can reach Caddy. The [[flyio-proxy]] no longer routes through Caddy — it pushes logs and metrics directly to [[loki]] and [[prometheus]] via their Tailscale Ingress endpoints.
Caddy has no authentication layer — it is a plain reverse proxy. Access control relies entirely on Tailscale ACLs restricting which devices can reach indri on port 443. Currently `tag:homelab`, `autogroup:admin`, and `tag:flyio-proxy` (via `tag:flyio-target` on indri) can reach Caddy.
The [[flyio-proxy]] routes all public traffic through Caddy. This is the path for `*.eblu.me` requests from the public internet. Caddy sees these as requests from the Fly VM with `Host: *.ops.eblu.me` headers — the same routes used by tailnet clients.
## Custom Build

View file

@ -1,6 +1,6 @@
---
title: Fly.io Proxy
modified: 2026-04-17
modified: 2026-04-18
tags:
- service
- networking
@ -23,23 +23,27 @@ Public reverse proxy on [Fly.io](https://fly.io) that exposes selected BlumeOps
## Exposed Services
| Public domain | Backend | Service |
|---------------|---------|---------|
| `docs.eblu.me` | `docs.tail8d86e.ts.net` | [[docs]] |
| `cv.eblu.me` | `cv.tail8d86e.ts.net` | [[cv]] |
| `forge.eblu.me` | `forge.tail8d86e.ts.net` | [[forgejo]] |
| Public domain | Backend (via Caddy) | Service |
|---------------|---------------------|---------|
| `docs.eblu.me` | `docs.ops.eblu.me` | [[docs]] |
| `cv.eblu.me` | `cv.ops.eblu.me` | [[cv]] |
| `forge.eblu.me` | `forge.ops.eblu.me` | [[forgejo]] |
## Architecture
Internet traffic hits Fly.io's Anycast edge, terminates TLS with a Let's Encrypt certificate, and is proxied by nginx to the backend service over a Tailscale WireGuard tunnel. See [[expose-service-publicly]] for the full architecture diagram.
Internet traffic hits Fly.io's Anycast edge, terminates TLS with a Let's Encrypt certificate, and is proxied by nginx to [[caddy]] on [[indri]] over a direct Tailscale WireGuard tunnel. Caddy then routes to the actual service. See [[expose-service-publicly]] for the full architecture diagram.
### Upstream Keepalive
### Why Caddy, not per-service Tailscale Ingress?
Nginx uses `upstream` blocks with `keepalive` connection pools to reuse TLS connections through the WireGuard tunnel. This avoids a per-request TLS handshake, which was previously the dominant source of latency (35s+ p50 before keepalive, sub-second after).
Previously, nginx connected directly to each service's `*.tail8d86e.ts.net` Tailscale Ingress endpoint. This caused **20+ second latency** because the Tailscale Ingress pods (running inside k8s) are behind pod-network NAT and can only reach the Fly VM via Tailscale DERP relay servers — not direct WireGuard peering.
**Trade-off:** DNS for upstream hostnames is resolved once at config load, not per-request. If Tailscale Ingress pods get new IPs (restart, reschedule, minikube restart), run `mise run fly-reload` to re-resolve without a full redeploy. A Grafana alert fires when upstreams are unreachable.
Routing through Caddy on indri solves this because indri's host-level Tailscale can establish direct WireGuard connections with the Fly VM (45ms round trip). This generalizes to all services regardless of where they run (native on indri, minikube, or ringtail k3s), since Caddy already routes to everything.
Each upstream requires `proxy_ssl_name` set to the actual Tailscale hostname — nginx sends the upstream block name as SNI by default, which the Tailscale Ingress proxy won't recognize.
### Direct WireGuard Peering
The Fly VM pins its Tailscale WireGuard listener to port 41641 (`tailscaled --port=41641`). Combined with well-behaved NAT on both sides (`MappingVariesByDestIP: false`), this allows Tailscale to establish direct peer-to-peer connections via UDP hole punching — no dedicated IPv4 required.
If direct peering fails (observable via `tailscale ping indri` showing "via DERP"), allocate a dedicated IPv4 ($2/month) with `fly ips allocate-v4` to provide a guaranteed inbound UDP path.
## Key Files
@ -58,6 +62,8 @@ Each upstream requires `proxy_ssl_name` set to the actual Tailscale hostname —
Fly.io runs Firecracker microVMs which support TUN devices natively. Tailscale runs with a real TUN interface (not userspace networking), so MagicDNS and direct Tailscale IP routing work normally.
The `tailscaled` process is started with `--port=41641` to pin the WireGuard listener to a fixed port. This is critical for direct peering — without it, hole punching is unreliable. A `[[services]]` block in `fly.toml` exposes this port as UDP, though it is only active when a dedicated IPv4 is allocated.
The Tailscale auth key is `preauthorized=True` to avoid device approval hangs on container restarts.
## Observability
@ -83,9 +89,7 @@ Alloy listens on `127.0.0.1:12345` for self-scraping its `/metrics` endpoint. Al
## Security Considerations
The `tag:flyio-proxy` ACL grants access only to `tag:flyio-target:443`. Services must explicitly opt in by adding a `tailscale.com/tags: "tag:k8s,tag:flyio-target"` annotation to their Tailscale Ingress. This means the proxy can only reach endpoints that have been individually tagged — a compromised nginx config cannot route to arbitrary services on the tailnet.
Currently tagged as `tag:flyio-target`: [[docs]], [[cv]], [[forgejo]], [[loki]], [[prometheus]]. Loki and Prometheus are tagged so that [[alloy|Alloy]] (running inside the container) can push logs and metrics directly via their Tailscale Ingress endpoints — the restricted ACL means Caddy on indri (`tag:homelab`) is not reachable from the proxy.
The `tag:flyio-proxy` ACL grants access only to `tag:flyio-target:443`. Indri carries this tag (for Caddy), and the k8s Tailscale Ingress pods for Loki and Prometheus also carry it so [[alloy|Alloy]] can push logs and metrics directly. A compromised proxy cannot route to arbitrary services on the tailnet — only `tag:flyio-target` endpoints on port 443.
### Crawler Mitigation
@ -101,7 +105,7 @@ Archive requests (`/<owner>/<repo>/archive/*`) are 302-redirected to `forge.ops.
Release downloads are cached at the proxy layer (7-day TTL, keyed by URI) to absorb repeated downloads of the same artifact.
To expose an additional service through the proxy, add the `tag:flyio-target` annotation to its Tailscale Ingress. See [[expose-service-publicly]] for the full workflow.
To expose an additional service through the proxy, add a Caddy route for it and an nginx `server` block. See [[expose-service-publicly]] for the full workflow.
## Spider Trap Mitigation