Review expose-service-publicly doc (#195)

## Summary - Replace stale inline code listings (fly.toml, Dockerfile, start.sh, nginx.conf, mise tasks, CI workflow) with brief descriptions pointing readers to the actual `fly/` and `mise-tasks/` files — prevents future drift - Add observability sidecar section documenting the Alloy integration (logs → Loki, metrics → Prometheus) - Fix broken internal wiki-link (`[[#7. Update Tailscale ACLs if needed]]` → correct heading) - Update per-service nginx templates to current patterns (deferred DNS resolution via `set $upstream` variable, `proxy_intercept_errors`, error pages) - Add `cv.eblu.me` to verification steps (live service not previously documented) - Add `last-reviewed: 2026-02-16` frontmatter - Net -187 lines (58 added, 245 removed) ## Deployment and Testing - [x] All pre-commit hooks pass (link validation, frontmatter, filenames) - [ ] Docs site renders correctly after merge Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/195
2026-02-16 15:49:55 -08:00 · 2026-02-16 15:49:55 -08:00 · 2c55c2316e
commit 2c55c2316e
parent 74294094e3
2 changed files with 58 additions and 245 deletions
--- a/docs/changelog.d/docs-review-expose-service-publicly.doc.md
+++ b/docs/changelog.d/docs-review-expose-service-publicly.doc.md
@ -0,0 +1 @@
+Review expose-service-publicly doc: replace stale inline code with references to actual files, add observability sidecar section, fix broken internal link, update templates to current patterns.
--- a/docs/how-to/expose-service-publicly.md
+++ b/docs/how-to/expose-service-publicly.md
@ -1,6 +1,7 @@
 ---
 title: Expose a Service Publicly
-modified: 2026-02-08
+modified: 2026-02-16
+last-reviewed: 2026-02-16
 tags:
  - how-to
  - fly-io
@ -103,150 +104,30 @@ Create the `fly/` directory at the repository root. This is separate from `conta

 ```
 fly/
-├── README.md           # Setup notes and context
 ├── fly.toml            # Fly.io app configuration
-├── Dockerfile          # nginx + tailscale
+├── Dockerfile          # nginx + tailscale + alloy
 ├── nginx.conf          # Reverse proxy + cache config
-└── start.sh            # Entrypoint: start tailscale, then nginx
+├── start.sh            # Entrypoint: start tailscale, nginx, alloy
+├── alloy.river         # Observability: logs → Loki, metrics → Prometheus
+└── error.html          # Friendly 503 page for upstream failures
 ```

-**`fly/fly.toml`** — app configuration:
+See the actual files in `fly/` for current configuration. Key design points:

-```toml
-app = "blumeops-proxy"
-primary_region = "sjc"
+- **`fly.toml`** — uses bluegreen deploys so the old machine serves traffic until the new one passes health checks. `auto_stop_machines = "off"` keeps the proxy always-on.
+- **`Dockerfile`** — multi-stage build pulling nginx, Tailscale, and [[alloy]] binaries. Alloy runs as a sidecar inside the container for observability (see below).
+- **`start.sh`** — starts `tailscaled` first (MagicDNS must be available before nginx resolves upstreams), then nginx in the background, then Alloy, and blocks on the nginx process.
+- **`nginx.conf`** — uses a `resolver 100.100.100.100` directive so upstream DNS resolution is deferred to request time (not config load time). Each service gets a `server` block with a `set $upstream` variable pattern. Includes a JSON access log format that Alloy tails for log collection and metric extraction. A catch-all server block serves `/healthz` and rejects unknown hosts.
+- **`error.html`** — shown via `proxy_intercept_errors` when upstreams are unreachable (indri offline, tunnel down, etc.). Cached responses still take priority via `proxy_cache_use_stale`.

-[build]
+#### Observability sidecar

-[http_service]
-  internal_port = 8080
-  force_https = true
-  auto_stop_machines = false
-  auto_start_machines = true
-  min_machines_running = 1
+The Fly.io container includes [[alloy]] baked in (`fly/alloy.river`). Alloy tails the nginx JSON access log and:

-[checks]
-  [checks.health]
-    port = 8080
-    type = "http"
-    interval = "30s"
-    timeout = "5s"
-    path = "/healthz"
-```
+- Forwards log lines to [[loki]] via the Tailscale Ingress endpoint
+- Derives Prometheus metrics (`flyio_nginx_http_requests_total`, `flyio_nginx_http_request_duration_seconds`, `flyio_nginx_cache_requests_total`, etc.) and remote-writes them to [[prometheus]]

-**`fly/Dockerfile`** — nginx + tailscale:
-
-```dockerfile
-FROM nginx:alpine
-
-# Copy tailscale binaries from official image
-COPY --from=docker.io/tailscale/tailscale:stable \
-    /usr/local/bin/tailscaled /usr/local/bin/tailscaled
-COPY --from=docker.io/tailscale/tailscale:stable \
-    /usr/local/bin/tailscale /usr/local/bin/tailscale
-
-RUN mkdir -p /var/run/tailscale /var/lib/tailscale \
-    && apk add --no-cache iptables ip6tables
-
-COPY nginx.conf /etc/nginx/nginx.conf
-COPY start.sh /start.sh
-RUN chmod +x /start.sh
-
-EXPOSE 8080
-
-CMD ["/start.sh"]
-```
-
-**`fly/start.sh`** — entrypoint:
-
-```bash
-#!/bin/sh
-set -e
-
-# Start tailscale daemon. Fly.io runs Firecracker microVMs which support
-# TUN devices natively — no need for --tun=userspace-networking.
-tailscaled --statedir=/var/lib/tailscale &
-sleep 2
-
-# Authenticate and join tailnet
-tailscale up --authkey="${TS_AUTHKEY}" --hostname=flyio-proxy
-
-# Wait for tailscale to be ready
-until tailscale status > /dev/null 2>&1; do sleep 1; done
-echo "Tailscale connected"
-
-# Start nginx — MagicDNS resolves *.tail8d86e.ts.net hostnames
-nginx -g "daemon off;"
-```
-
-**`fly/nginx.conf`** — reverse proxy with caching and rate limiting:
-
-> The example below shows a **static site** configuration (docs.eblu.me).
-> For dynamic services, see [[#Considerations for dynamic services]].
-
-```nginx
-worker_processes auto;
-
-events {
-    worker_connections 1024;
-}
-
-http {
-    include /etc/nginx/mime.types;
-    default_type application/octet-stream;
-
-    # Rate limiting zones — define per-service zones as needed
-    limit_req_zone $binary_remote_addr zone=general:10m rate=10r/s;
-
-    # Proxy cache: 200MB, evict after 24h of no access
-    proxy_cache_path /tmp/cache levels=1:2 keys_zone=services:10m
-                     max_size=200m inactive=24h;
-
-    # --- docs.eblu.me (static site) ---
-    server {
-        listen 8080;
-        server_name docs.eblu.me;
-
-        limit_req zone=general burst=20 nodelay;
-
-        location / {
-            proxy_pass https://docs.tail8d86e.ts.net;
-            proxy_ssl_verify off;
-            proxy_ssl_server_name on;
-
-            # Cache aggressively — static site only.
-            # Do NOT use these settings for dynamic services.
-            proxy_cache services;
-            proxy_cache_valid 200 1d;
-            proxy_cache_valid 404 1m;
-            proxy_cache_use_stale error timeout updating;
-            proxy_cache_lock on;
-
-            # Prevent cache-busting: ignore query strings and
-            # client cache-control headers.
-            # Safe for static sites; breaks dynamic services.
-            proxy_cache_key $host$uri;
-            proxy_ignore_headers Cache-Control Set-Cookie;
-
-            add_header X-Cache-Status $upstream_cache_status;
-            add_header X-Clacks-Overhead "GNU Terry Pratchett" always;
-        }
-    }
-
-    # Catch-all: reject unknown hosts, but serve health check
-    server {
-        listen 8080 default_server;
-
-        location /healthz {
-            return 200 "ok\n";
-        }
-
-        location / {
-            return 444;
-        }
-    }
-}
-```
+Both Loki and Prometheus are reached directly via their `*.tail8d86e.ts.net` Tailscale Ingress endpoints (not via [[caddy]]), since the proxy's ACLs only allow `tag:flyio-target`.

 ### Step 3: Tailscale auth key and ACLs (Pulumi)

@ -297,7 +178,7 @@ ACL test:
 },
 ```

-Each service's Tailscale Ingress must be annotated with `tag:flyio-target` to be reachable by the proxy — see [[#7. Update Tailscale ACLs if needed]].
+Each service's Tailscale Ingress must be annotated with `tag:flyio-target` to be reachable by the proxy — see [[#7. Tag the Tailscale Ingress with tag:flyio-target]].

 Deploy: `mise run tailnet-preview` then `mise run tailnet-up`.

@ -315,109 +196,15 @@ Store the auth key in 1Password as well for the `fly-setup` mise task.

 ### Step 4: Mise tasks

-**`mise-tasks/fly-deploy`:**
+Three mise tasks manage the proxy lifecycle. See the actual scripts in `mise-tasks/` for current implementation:

-```bash
-#!/usr/bin/env bash
-#MISE description="Deploy the Fly.io public proxy"
-
-set -euo pipefail
-
-cd "$(dirname "$0")/../fly"
-fly deploy "$@"
-```
-
-**`mise-tasks/fly-setup`:**
-
-```bash
-#!/usr/bin/env bash
-#MISE description="One-time setup: configure Fly.io secrets and certs (idempotent)"
-
-set -euo pipefail
-
-APP="blumeops-proxy"
-
-# Fetch Tailscale auth key from Pulumi state
-echo "Fetching Tailscale auth key from Pulumi..."
-TS_AUTHKEY=$(cd "$(dirname "$0")/../pulumi/tailscale" && pulumi stack select tail8d86e && pulumi stack output flyio_authkey --show-secrets)
-fly secrets set TS_AUTHKEY="$TS_AUTHKEY" --stage -a "$APP"
-echo "Tailscale auth key staged (will take effect on next deploy)"
-
-# Allocate IPs (idempotent — fly errors if already allocated)
-# Shared IPv4 is free and sufficient for HTTP/HTTPS services.
-# Use 'fly ips allocate-v4' (no --shared) for dedicated IPv4 ($2/mo)
-# if the service needs non-HTTP protocols.
-fly ips allocate-v4 --shared -a "$APP" 2>/dev/null || true
-fly ips allocate-v6 -a "$APP" 2>/dev/null || true
-echo "IPs allocated"
-
-# Add certs for all public domains (idempotent — fly ignores duplicates)
-fly certs add docs.eblu.me -a "$APP" 2>/dev/null || true
-# fly certs add wiki.eblu.me -a "$APP" 2>/dev/null || true  # future services
-echo "Certificates configured"
-
-echo "Done. Run 'mise run fly-deploy' to deploy."
-```
-
-**`mise-tasks/fly-shutoff`:**
-
-```bash
-#!/usr/bin/env bash
-#MISE description="Emergency shutoff: stop all Fly.io proxy machines"
-
-set -euo pipefail
-
-APP="blumeops-proxy"
-
-echo "EMERGENCY SHUTOFF: Stopping all machines for $APP"
-fly scale count 0 -a "$APP" --yes
-echo "All machines stopped. Public services are offline."
-echo "To restore: fly scale count 1 -a $APP"
-```
+- **`mise run fly-deploy`** — runs `fly deploy` from the `fly/` directory
+- **`mise run fly-setup`** — one-time, idempotent setup: fetches the Tailscale auth key from Pulumi state, stages it as a Fly.io secret, allocates IPs, and adds TLS certs for all public domains (currently `docs.eblu.me` and `cv.eblu.me`)
+- **`mise run fly-shutoff`** — emergency shutoff: scales machines to zero, immediately stopping all public traffic

 ### Step 5: Forgejo CI workflow

-**`.forgejo/workflows/deploy-fly.yaml`:**
-
-```yaml
-name: Deploy Fly.io Proxy
-
-on:
-  workflow_dispatch:
-  push:
-    branches: [main]
-    paths:
-      - 'fly/**'
-
-jobs:
-  deploy:
-    runs-on: k8s
-    steps:
-      - name: Checkout
-        uses: actions/checkout@v4
-
-      - name: Install flyctl
-        run: |
-          curl -L https://fly.io/install.sh | sh
-          echo "/root/.fly/bin" >> "$GITHUB_PATH"
-
-      - name: Deploy to Fly.io
-        env:
-          FLY_API_TOKEN: ${{ secrets.FLY_DEPLOY_TOKEN }}
-        run: |
-          cd fly
-          fly deploy
-
-      - name: Verify health
-        env:
-          FLY_API_TOKEN: ${{ secrets.FLY_DEPLOY_TOKEN }}
-        run: |
-          fly status -a blumeops-proxy
-          echo ""
-          echo "Health check:"
-          sleep 10
-          curl -sf https://blumeops-proxy.fly.dev/healthz || echo "Warning: health check failed (may need DNS propagation)"
-```
+A Forgejo Actions workflow (`.forgejo/workflows/deploy-fly.yaml`) auto-deploys on pushes to `main` that touch `fly/**`. It installs `flyctl`, runs `fly deploy`, and verifies health. It can also be triggered manually via `workflow_dispatch`.

 The `FLY_DEPLOY_TOKEN` Forgejo Actions secret must be set via the [[forgejo]] API or UI, following the pattern in the `forgejo_actions_secrets` Ansible role.

@ -430,9 +217,12 @@ To expose an additional service (example: `wiki.eblu.me`):
 ### 1. Add nginx server block

 Edit `fly/nginx.conf` — add a new `server` block. The configuration
-differs significantly between static and dynamic services.
+differs significantly between static and dynamic services. See the
+existing `docs.eblu.me` and `cv.eblu.me` blocks in `fly/nginx.conf`
+for the current pattern (uses `set $upstream` variable for deferred
+DNS resolution, `proxy_intercept_errors` for error pages, etc.).

-**Static site example** (same pattern as docs):
+**Static site template** (simplified — adapt from existing blocks):

 ```nginx
 # --- wiki.eblu.me (static) ---
@ -442,9 +232,18 @@ server {

    limit_req zone=general burst=20 nodelay;

+    error_page 502 503 504 /error.html;
+    location = /error.html {
+        root /usr/share/nginx/html;
+        internal;
+    }
+
    location / {
-        proxy_pass https://wiki.tail8d86e.ts.net;
+        set $upstream_wiki https://wiki.tail8d86e.ts.net;
+        proxy_pass $upstream_wiki$request_uri;
        proxy_ssl_verify off;
+        proxy_ssl_server_name on;
+        proxy_intercept_errors on;

        proxy_cache services;
        proxy_cache_valid 200 1d;
@ -460,7 +259,7 @@ server {
 }
 ```

-**Dynamic service example** (e.g., Forgejo):
+**Dynamic service template** (e.g., Forgejo — hypothetical, not currently deployed):

 ```nginx
 # --- forge.eblu.me (dynamic, authenticated) ---
@ -476,9 +275,18 @@ server {
    # Git LFS and repo uploads can be large
    client_max_body_size 512m;

+    error_page 502 503 504 /error.html;
+    location = /error.html {
+        root /usr/share/nginx/html;
+        internal;
+    }
+
    location / {
-        proxy_pass https://forge.tail8d86e.ts.net;
+        set $upstream_forge https://forge.tail8d86e.ts.net;
+        proxy_pass $upstream_forge$request_uri;
        proxy_ssl_verify off;
+        proxy_ssl_server_name on;
+        proxy_intercept_errors on;

        # NO proxy_cache — dynamic content with sessions.
        # Caching would serve stale pages and break authentication.
@ -497,8 +305,10 @@ server {

    # Selectively cache static assets only
    location ~* \.(css|js|png|jpg|svg|woff2?)$ {
-        proxy_pass https://forge.tail8d86e.ts.net;
+        set $upstream_forge_static https://forge.tail8d86e.ts.net;
+        proxy_pass $upstream_forge_static$request_uri;
        proxy_ssl_verify off;
+        proxy_ssl_server_name on;

        proxy_cache services;
        proxy_cache_valid 200 7d;
@ -709,6 +519,7 @@ dynamic, authenticated service like [[forgejo]].
 | Tailscale ACLs | Pulumi (`pulumi/tailscale/policy.hujson`) | yes |
 | DNS CNAMEs | Pulumi (`pulumi/gandi/`) | yes |
 | Container + app config | `fly/Dockerfile` + `fly/fly.toml` in repo | yes |
+| Observability | `fly/alloy.river` in repo | yes |
 | Deployment | Forgejo CI on push to `fly/`, or `mise run fly-deploy` | yes |
 | Fly.io secrets + certs | `mise run fly-setup` (one-time, idempotent) | semi |

@ -735,6 +546,7 @@ If anything fails here, debug without public DNS impact.
 After deploying DNS (`mise run dns-up`):

 1. `curl -I https://docs.eblu.me` — returns 200 with `X-Cache-Status` header
-2. `dig docs.eblu.me` — resolves to Fly.io IPs (not Tailscale IP)
-3. `dig forge.ops.eblu.me` — still resolves to indri's Tailscale IP (unchanged)
-4. Second request to same URL shows `X-Cache-Status: HIT`
+2. `curl -I https://cv.eblu.me` — same for each public service
+3. `dig docs.eblu.me` — resolves to Fly.io IPs (not Tailscale IP)
+4. `dig forge.ops.eblu.me` — still resolves to indri's Tailscale IP (unchanged)
+5. Second request to same URL shows `X-Cache-Status: HIT`
				`@ -0,0 +1 @@`
				`Review expose-service-publicly doc: replace stale inline code with references to actual files, add observability sidecar section, fix broken internal link, update templates to current patterns.`