From 005e2a03ed4b26008eafede91c0c0b548472d2f2 Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Mon, 27 Apr 2026 09:48:46 -0700 Subject: [PATCH] C0: split gandi-operations docs; add dns-acme-cleanup mise task MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Splits the nebulous gandi-operations how-to into two single-topic cards (manage-eblu-me-dns, rotate-gandi-pat) and adds a mise task for the recurring _acme-challenge TXT cleanup needed due to a value-comparison bug in libdns/gandi v1.1.0 that prevents certmagic's cleanup phase from removing presented TXT values. The gandi reference card is updated to drop the false "different credential from Pulumi PAT" claim — verified during the 2026-04-27 incident that Caddy and Pulumi share a single PAT. Co-Authored-By: Claude Opus 4.7 (1M context) --- docs/how-to/configuration/gandi-operations.md | 90 ------------- .../configuration/manage-eblu-me-dns.md | 52 ++++++++ .../configuration/manage-forgejo-mirrors.md | 2 +- docs/how-to/configuration/rotate-gandi-pat.md | 125 ++++++++++++++++++ docs/reference/infrastructure/gandi.md | 48 +++---- docs/reference/tools/mise-tasks.md | 1 + docs/reference/tools/pulumi.md | 3 +- mise-tasks/dns-acme-cleanup | 112 ++++++++++++++++ pulumi/gandi/README.md | 39 +----- pulumi/gandi/__main__.py | 2 +- 10 files changed, 315 insertions(+), 159 deletions(-) delete mode 100644 docs/how-to/configuration/gandi-operations.md create mode 100644 docs/how-to/configuration/manage-eblu-me-dns.md create mode 100644 docs/how-to/configuration/rotate-gandi-pat.md create mode 100755 mise-tasks/dns-acme-cleanup diff --git a/docs/how-to/configuration/gandi-operations.md b/docs/how-to/configuration/gandi-operations.md deleted file mode 100644 index 0be00dc..0000000 --- a/docs/how-to/configuration/gandi-operations.md +++ /dev/null @@ -1,90 +0,0 @@ ---- -title: Gandi Operations -modified: 2026-02-17 -last-reviewed: 2026-02-17 -tags: - - how-to - - dns - - pulumi ---- - -# Gandi Operations - -How to manage DNS records and cycle the Gandi API token. - -## Prerequisites - -- Pulumi CLI installed (`brew install pulumi`) -- Access to 1Password blumeops vault (for PAT) -- On the tailnet (Pulumi resolves indri's IP via MagicDNS) - -## Preview and Apply DNS Changes - -```bash -# Preview changes (always do this first) -mise run dns-preview - -# Apply changes -mise run dns-up -``` - -Both tasks fetch the Gandi PAT from 1Password automatically. - -To run Pulumi directly: - -```bash -export GANDI_PERSONAL_ACCESS_TOKEN=$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/mco6ka3dc3rmw7zkg2dhia5d2m/pat") -cd pulumi/gandi -pulumi preview -pulumi up --yes -``` - -## Cycle the Gandi PAT - -The Gandi Personal Access Token has a maximum lifetime of 90 days. Currently set to 30 days as a security compromise, though shorter may be appropriate given infrequent use. - -### 1. Create a new PAT - -Go to the [Gandi admin console](https://admin.gandi.net/organizations/1db8d76a-f729-11ed-b8d1-00163e94b645/account/pat) and create a new token: - -- **Name:** `blumeops-pulumi` (or similar) -- **Expiration:** 30 days (max 90; shorter is fine if you run this rarely) -- **Required permission:** Manage domain name technical configurations -- **Also enable:** See and renew domain names - -Copy the new PAT to your clipboard. - -### 2. Update 1Password - -With the new PAT on your clipboard: - -```bash -op item edit mco6ka3dc3rmw7zkg2dhia5d2m pat="$(pbpaste)" --vault vg6xf6vvfmoh5hqjjhlhbeoaie -``` - -### 3. Delete the old PAT - -Return to the Gandi admin console and delete the previous token. - -### 4. Verify - -```bash -mise run dns-preview -``` - -A successful preview confirms the new PAT is working. - -## Break-Glass Override - -If MagicDNS is unavailable and Pulumi can't resolve indri's IP, set the target IP manually. Find indri's current Tailscale IP via `tailscale status` or the admin console: - -```bash -export BLUMEOPS_REVERSE_PROXY_IP= -mise run dns-up -``` - -## Related - -- [[gandi]] - DNS configuration reference -- [[caddy]] - Reverse proxy (also uses a Gandi token for TLS) -- [[update-tailscale-acls]] - Similar Pulumi workflow for Tailscale diff --git a/docs/how-to/configuration/manage-eblu-me-dns.md b/docs/how-to/configuration/manage-eblu-me-dns.md new file mode 100644 index 0000000..4c37d4c --- /dev/null +++ b/docs/how-to/configuration/manage-eblu-me-dns.md @@ -0,0 +1,52 @@ +--- +title: Manage eblu.me DNS Records +modified: 2026-04-27 +last-reviewed: 2026-04-27 +tags: + - how-to + - dns + - pulumi +--- + +# Manage eblu.me DNS Records + +How to add, change, and apply DNS records for `eblu.me` via [[pulumi]]. + +## Prerequisites + +- Pulumi CLI installed (`brew install pulumi`) +- 1Password access (`blumeops` vault) — Pulumi reads the Gandi PAT from there +- On the tailnet — Pulumi resolves [[indri]]'s IP via MagicDNS at apply time + +## Preview and apply + +```bash +mise run dns-preview # always do this first +mise run dns-up # apply +``` + +Both fetch the PAT from 1Password automatically. The Pulumi program is in `pulumi/gandi/`; stack is `eblu-me`. + +## Adding a record + +Edit `pulumi/gandi/__main__.py` and add a `gandi.livedns.Record(...)`. The stack config (`Pulumi.eblu-me.yaml`) only holds `domain` and `subdomain`; everything else is in the program. + +After editing, preview, then apply. + +## Break-glass: override the indri target IP + +The wildcard `*.ops.eblu.me` is computed from `indri.tail8d86e.ts.net` via MagicDNS at apply time. If MagicDNS is unavailable: + +```bash +export BLUMEOPS_REVERSE_PROXY_IP= +mise run dns-up +``` + +Find the IP via `tailscale status` or the Tailscale admin console. + +## Related + +- [[gandi]] — Gandi reference card +- [[rotate-gandi-pat]] — Rotate the PAT shared with [[caddy]] +- [[pulumi]] — Pulumi tooling reference +- [[routing]] — Service URLs and routing architecture diff --git a/docs/how-to/configuration/manage-forgejo-mirrors.md b/docs/how-to/configuration/manage-forgejo-mirrors.md index 7f98549..9c0e113 100644 --- a/docs/how-to/configuration/manage-forgejo-mirrors.md +++ b/docs/how-to/configuration/manage-forgejo-mirrors.md @@ -144,6 +144,6 @@ Trigger a manual sync on one mirror to confirm the new PAT works: ## Related - [[forgejo]] — Forgejo service reference -- [[gandi-operations]] — Similar PAT rotation workflow for Gandi DNS +- [[rotate-gandi-pat]] — Similar PAT rotation workflow for Gandi DNS - [[spork-strategy]] — floating-branch soft-fork strategy explanation - [[create-a-spork]] — create a spork on top of a mirror diff --git a/docs/how-to/configuration/rotate-gandi-pat.md b/docs/how-to/configuration/rotate-gandi-pat.md new file mode 100644 index 0000000..94a0b4e --- /dev/null +++ b/docs/how-to/configuration/rotate-gandi-pat.md @@ -0,0 +1,125 @@ +--- +title: Rotate the Gandi PAT +modified: 2026-04-27 +last-reviewed: 2026-04-27 +tags: + - how-to + - dns + - secrets +--- + +# Rotate the Gandi PAT + +How to rotate the Gandi Personal Access Token. **One PAT** is shared by [[caddy]] (TLS via ACME DNS-01) and Pulumi (DNS records). It lives in 1Password at `op://blumeops/gandi - blumeops/pat`. + +## When to rotate + +- Every 60 days (Todoist recurring task) +- After any compromise / accidental disclosure +- Whenever Gandi starts rejecting the PAT (see [Debugging](#debugging)) + +Gandi caps PAT lifetime at 90 days; rotating at 60 leaves a 30-day buffer. + +## Prerequisites + +- Access to the [Gandi PAT admin console](https://admin.gandi.net/organizations/1db8d76a-f729-11ed-b8d1-00163e94b645/account/pat) +- 1Password (`blumeops` vault) +- Ability to run `mise run provision-indri` (ssh to [[indri]] + 1Password biometric) + +## Procedure + +### 1. Create a new PAT in Gandi + +In the [Gandi PAT console](https://admin.gandi.net/organizations/1db8d76a-f729-11ed-b8d1-00163e94b645/account/pat), create a token: + +- **Name:** `blumeops` +- **Expiration:** **90 days** (the max — paired with the 60-day rotation cadence) +- **Permissions:** + - Manage domain name technical configurations *(required — DNS records and ACME TXT writes)* + - See and renew domain names + +Other permissions are not used. + +Copy the new PAT to your clipboard. + +### 2. Update 1Password + +```bash +op item edit mco6ka3dc3rmw7zkg2dhia5d2m pat="$(pbpaste)" --vault vg6xf6vvfmoh5hqjjhlhbeoaie +``` + +### 3. Push to indri + +The PAT lives in two places: 1Password (read by Pulumi at runtime) and `~/.config/caddy/gandi-token` on indri (read by Caddy at startup). The 1Password edit only updates the first. + +```bash +mise run provision-indri --tags caddy +``` + +This re-fetches the PAT from 1Password, writes it to indri, and restarts Caddy. Caddy will renew any due certificates within minutes. + +### 4. Verify + +```bash +mise run dns-preview +``` + +A successful preview confirms Pulumi can use the PAT. + +```bash +ssh indri 'tail -50 ~/Library/Logs/mcquack.caddy.err.log' \ + | grep -E "obtained|renew|error" +``` + +Expect to see no `LiveDNS returned a 403` lines, and either no renewal activity (if no certs were due) or `certificate obtained successfully`. + +### 5. Delete the old PAT in Gandi + +Return to the Gandi PAT console and delete the previous token. + +### 6. Clean up orphan ACME records + +Each successful Caddy renewal leaves orphan `_acme-challenge.ops` TXT records in the zone (a bug in `libdns/gandi` v1.1.0 — see the script docstring). Cadence aligns with rotation: + +```bash +mise run dns-acme-cleanup --dry-run +mise run dns-acme-cleanup +``` + +## Debugging + +### Caddy logs `LiveDNS returned a 403` + +The PAT is invalid (expired, revoked, or insufficient scope). **Gandi returns 403 — not 401 — for an expired PAT**, which can read as a permissions issue. The most common cause is plain expiry. Rotate. + +### `mise run dns-preview` returns 403 + +Same root cause — Pulumi and Caddy share this PAT. + +### After a fresh PAT, Caddy still fails + +Check that the value on indri matches 1Password: + +```bash +diff <(ssh indri 'cat ~/.config/caddy/gandi-token') \ + <(op read 'op://blumeops/gandi - blumeops/pat') +``` + +If they differ, `mise run provision-indri --tags caddy` was skipped or failed. + +Confirm the new PAT works against Gandi directly: + +```bash +curl -s -o /dev/null -w "HTTP %{http_code}\n" \ + -H "Authorization: Bearer $(op read 'op://blumeops/gandi - blumeops/pat')" \ + https://api.gandi.net/v5/livedns/domains/eblu.me +``` + +`200` = healthy. `403` = scope or expiry. `401` = malformed token. + +## Related + +- [[gandi]] — Gandi reference card +- [[manage-eblu-me-dns]] — DNS records workflow (separate operation, same PAT) +- [[caddy]] — Reverse proxy that uses the PAT for TLS +- [[mise-tasks]] — `dns-acme-cleanup`, `provision-indri`, `dns-preview` reference diff --git a/docs/reference/infrastructure/gandi.md b/docs/reference/infrastructure/gandi.md index ae1fe56..763bae3 100644 --- a/docs/reference/infrastructure/gandi.md +++ b/docs/reference/infrastructure/gandi.md @@ -1,7 +1,7 @@ --- title: Gandi -modified: 2026-04-09 -last-reviewed: 2026-04-09 +modified: 2026-04-27 +last-reviewed: 2026-04-27 tags: - infrastructure - networking @@ -20,12 +20,11 @@ DNS hosting provider for the `eblu.me` domain, managed via Pulumi IaC. | **Provider** | Gandi LiveDNS | | **IaC** | `pulumi/gandi/` | | **Stack** | `eblu-me` | +| **PAT** | `op://blumeops/gandi - blumeops/pat` | ## What It Does -Gandi hosts the DNS records that make `*.ops.eblu.me` resolve to [[indri]]'s Tailscale IP (`indri.tail8d86e.ts.net`). Since Tailscale IPs are not publicly routable, this gives services real DNS names while keeping them private to the tailnet. - -The target IP is resolved dynamically from `indri.tail8d86e.ts.net` at deploy time, so if indri's Tailscale IP changes, re-running the deployment is sufficient. +Gandi hosts the DNS records that make `*.ops.eblu.me` resolve to [[indri]]'s Tailscale IP. Since Tailscale IPs are not publicly routable, this gives services real DNS names while keeping them private to the tailnet. The target IP is resolved dynamically from `indri.tail8d86e.ts.net` at deploy time. ## DNS Records @@ -46,38 +45,25 @@ Both records point to [[indri]], which runs [[caddy]] as the reverse proxy for a | `cv.eblu.me` | CNAME | `blumeops-proxy.fly.dev` | 300s | | `forge.eblu.me` | CNAME | `blumeops-proxy.fly.dev` | 300s | -Public CNAMEs point to [[flyio-proxy]] on Fly.io. See [[expose-service-publicly]] for adding new public services. - -See [[routing]] for the full service URL map. - -## Pulumi Configuration - -The Pulumi program lives in `pulumi/gandi/`: - -- `__main__.py` - Creates A and CNAME records via `pulumiverse_gandi` -- `Pulumi.eblu-me.yaml` - Stack config (domain, subdomain) - -Stack config values: - -| Key | Value | -|-----|-------| -| `blumeops-dns:domain` | `eblu.me` | -| `blumeops-dns:subdomain` | `ops` | - -A break-glass override is available via the `BLUMEOPS_REVERSE_PROXY_IP` environment variable, which bypasses dynamic IP resolution. +Public CNAMEs point to [[flyio-proxy]] on Fly.io. See [[expose-service-publicly]] for adding new public services. See [[routing]] for the full service URL map. ## TLS Integration -[[caddy]] uses Gandi's API separately (via `GANDI_BEARER_TOKEN`) for ACME DNS-01 challenges to obtain a wildcard Let's Encrypt certificate for `*.ops.eblu.me`. This is a different credential from the Pulumi PAT. +[[caddy]] uses this same Gandi PAT for ACME DNS-01 challenges to obtain a wildcard Let's Encrypt certificate for `*.ops.eblu.me`. Caddy reads the PAT from `~/.config/caddy/gandi-token` on [[indri]], populated by ansible from 1Password. ## Authentication -Gandi requires a Personal Access Token (PAT) for API access. PATs have a maximum lifetime of 90 days (currently set to 30). See [[gandi-operations]] for deployment and PAT cycling instructions. +One Gandi Personal Access Token, shared by Pulumi and Caddy. Gandi caps PATs at 90 days; rotate every 60 days via [[rotate-gandi-pat]]. + +## ACME Challenge Cleanup + +Caddy's renewal flow leaves `_acme-challenge.ops` TXT orphans in the zone — a value-comparison bug in `libdns/gandi` v1.1.0 makes the cleanup phase a no-op. Run `mise run dns-acme-cleanup` periodically (alongside PAT rotation works well). ## Related -- [[gandi-operations]] - PAT cycling and deployment how-to -- [[routing]] - Service URLs and routing architecture -- [[caddy]] - Reverse proxy using Gandi for TLS -- [[tailscale]] - Tailnet networking -- [[indri]] - Server hosting Caddy (DNS target) +- [[manage-eblu-me-dns]] — Add/change DNS records via Pulumi +- [[rotate-gandi-pat]] — Rotate the shared Gandi PAT +- [[routing]] — Service URLs and routing architecture +- [[caddy]] — Reverse proxy using this PAT for TLS +- [[tailscale]] — Tailnet networking +- [[indri]] — Server hosting Caddy (DNS target) diff --git a/docs/reference/tools/mise-tasks.md b/docs/reference/tools/mise-tasks.md index fefb30f..4ec3438 100644 --- a/docs/reference/tools/mise-tasks.md +++ b/docs/reference/tools/mise-tasks.md @@ -39,6 +39,7 @@ Run `mise tasks --sort name` for the live list with descriptions. | `fly-shutoff` | Emergency shutoff: stop all Fly.io proxy machines | | `dns-preview` | Preview DNS changes with [[pulumi]] | | `dns-up` | Apply DNS changes with [[pulumi]] | +| `dns-acme-cleanup` | Delete orphaned `_acme-challenge.ops` TXT records (libdns/gandi v1.1.0 workaround) | | `tailnet-preview` | Preview Tailscale ACL changes with [[pulumi]] | | `tailnet-up` | Apply Tailscale ACL changes with [[pulumi]] | diff --git a/docs/reference/tools/pulumi.md b/docs/reference/tools/pulumi.md index bdc7e8f..a716bb9 100644 --- a/docs/reference/tools/pulumi.md +++ b/docs/reference/tools/pulumi.md @@ -49,7 +49,8 @@ mise run tailnet-up # Apply ACL/tag changes ## Related -- [[gandi-operations]] — DNS PAT rotation and Pulumi workflow +- [[manage-eblu-me-dns]] — DNS records workflow +- [[rotate-gandi-pat]] — Rotate the Gandi PAT - [[update-tailscale-acls]] — ACL editing and Pulumi workflow - [[gandi]] — DNS hosting - [[tailscale]] — Tailnet configuration diff --git a/mise-tasks/dns-acme-cleanup b/mise-tasks/dns-acme-cleanup new file mode 100755 index 0000000..5152ae2 --- /dev/null +++ b/mise-tasks/dns-acme-cleanup @@ -0,0 +1,112 @@ +#!/usr/bin/env -S uv run --script +# /// script +# requires-python = ">=3.12" +# dependencies = ["httpx>=0.28.1", "rich>=14.0.0", "typer>=0.24.0"] +# /// +#MISE description="Delete orphaned ACME challenge TXT records in eblu.me" +#USAGE flag "--dry-run" help="List orphans without deleting" +"""Clean up orphaned _acme-challenge TXT records in the eblu.me zone. + +Workaround for libdns/gandi v1.1.0: its DeleteRecords compares unquoted +certmagic values to Gandi-quoted stored values, so cleanup is a silent +no-op. Without this script, the rrset grows by ~2 values per successful +Caddy renewal cycle. + +In healthy steady state these records should be absent. Run alongside +PAT rotation, or any time after Caddy ACME activity. +""" + +import os +import subprocess +from typing import Annotated + +import httpx +import typer +from rich.console import Console +from rich.table import Table + +DOMAIN = "eblu.me" +RRSET = "_acme-challenge.ops" +GANDI_API = "https://api.gandi.net/v5/livedns" +OP_PAT_REF = "op://blumeops/gandi - blumeops/pat" + + +def resolve_token(console: Console) -> str: + env_token = os.environ.get("GANDI_PERSONAL_ACCESS_TOKEN", "").strip() + if env_token: + return env_token + console.print("[dim]Reading Gandi PAT from 1Password...[/dim]") + try: + result = subprocess.run( + ["op", "read", OP_PAT_REF], + capture_output=True, + text=True, + check=True, + ) + return result.stdout.strip() + except (subprocess.CalledProcessError, FileNotFoundError) as e: + console.print(f"[red]Failed to read PAT from 1Password:[/red] {e}") + raise typer.Exit(1) + + +app = typer.Typer(add_completion=False) + + +@app.command() +def main( + dry_run: Annotated[ + bool, + typer.Option("--dry-run", help="List orphans without deleting"), + ] = False, +) -> None: + """Delete orphan _acme-challenge TXT records in eblu.me.""" + console = Console() + token = resolve_token(console) + + url = f"{GANDI_API}/domains/{DOMAIN}/records/{RRSET}/TXT" + headers = {"Authorization": f"Bearer {token}"} + + with httpx.Client(timeout=15, headers=headers) as client: + resp = client.get(url) + if resp.status_code == 404: + console.print( + f"[green]Clean — {RRSET}.{DOMAIN} TXT rrset is absent.[/green]" + ) + raise typer.Exit(0) + resp.raise_for_status() + values = resp.json().get("rrset_values", []) + + if not values: + console.print( + f"[green]Clean — {RRSET}.{DOMAIN} TXT rrset is empty.[/green]" + ) + raise typer.Exit(0) + + table = Table(title=f"Orphan ACME challenge values: {RRSET}.{DOMAIN}") + table.add_column("#", justify="right") + table.add_column("Value") + for i, v in enumerate(values, 1): + table.add_row(str(i), v) + console.print(table) + console.print(f"\n[bold]{len(values)}[/bold] orphan(s).") + + if dry_run: + console.print("\n[dim]Dry run — no records deleted.[/dim]") + raise typer.Exit(0) + + del_resp = client.delete(url) + if del_resp.status_code == 204: + console.print( + f"[green]Deleted {RRSET}.{DOMAIN} TXT " + f"({len(values)} values).[/green]" + ) + else: + console.print( + f"[red]Delete failed: HTTP {del_resp.status_code}[/red]\n" + f"{del_resp.text[:300]}" + ) + raise typer.Exit(1) + + +if __name__ == "__main__": + app() diff --git a/pulumi/gandi/README.md b/pulumi/gandi/README.md index 9d7b7aa..70d2821 100644 --- a/pulumi/gandi/README.md +++ b/pulumi/gandi/README.md @@ -27,50 +27,19 @@ pulumi stack select eblu-me # or: pulumi stack init eblu-me ## Authentication -This project requires a Gandi Personal Access Token (PAT) with LiveDNS permissions. +This project uses a Gandi Personal Access Token (PAT) shared with Caddy. See the [Gandi reference card](../../docs/reference/infrastructure/gandi.md) and [Rotate the Gandi PAT](../../docs/how-to/configuration/rotate-gandi-pat.md). -**The PAT expires every 30 days and must be cycled manually.** - -### Cycling the PAT - -1. Go to [Gandi PAT Management](https://admin.gandi.net/organizations/1db8d76a-f729-11ed-b8d1-00163e94b645/account/pat) - -2. Create a new PAT: - - Name: `blumeops-pulumi` (or similar) - - Expiration: 30 days (maximum is 90; shorter is fine if used rarely) - - Permissions required: - - **Manage domain name technical configurations** (required for DNS records) - - See and renew domain names - - Optional permissions (enabled but not strictly required): - - See & download SSL certificates - - Manage Cloud resources - - See Cloud resources - - View Organization - - Deploy Web Hosting instances - - Manage Web Hosting instances - - See and renew Web Hosting instances - -3. Update 1Password: - ```bash - # Update the existing item with the new PAT value - op item edit mco6ka3dc3rmw7zkg2dhia5d2m pat="" --vault vg6xf6vvfmoh5hqjjhlhbeoaie - ``` - -4. Delete the old PAT from Gandi admin console - -### Running with Authentication - -The mise task handles fetching the PAT from 1Password: +The mise tasks handle fetching the PAT from 1Password: ```bash -mise run dns-up # Preview and apply changes mise run dns-preview # Preview only +mise run dns-up # Preview and apply ``` Or manually: ```bash -export GANDI_PERSONAL_ACCESS_TOKEN=$(op read "op://vg6xf6vvfmoh5hqjjhlhbeoaie/mco6ka3dc3rmw7zkg2dhia5d2m/pat") +export GANDI_PERSONAL_ACCESS_TOKEN=$(op read "op://blumeops/gandi - blumeops/pat") pulumi up ``` diff --git a/pulumi/gandi/__main__.py b/pulumi/gandi/__main__.py index e448ed2..bda7a8a 100644 --- a/pulumi/gandi/__main__.py +++ b/pulumi/gandi/__main__.py @@ -8,7 +8,7 @@ This program manages DNS records for blumeops infrastructure: Authentication: Set GANDI_PERSONAL_ACCESS_TOKEN environment variable. - See docs/how-to/gandi-operations.md for PAT management instructions. + See docs/how-to/configuration/rotate-gandi-pat.md for PAT management. """ import os