blumeops/plans/completed/k8s-migration/P3_postgresql.complete.md

10 KiB

Phase 3: PostgreSQL Disaster Recovery & Backup

Goal: Test disaster recovery and configure borgmatic backups for k8s-pg

Status: Complete (2026-01-19)

Prerequisites: Phase 2 complete


Overview

Phase 3 establishes disaster recovery capabilities for the k8s PostgreSQL cluster:

  1. Fix borgmatic backup issues - Resolve borg: command not found error
  2. Test disaster recovery - Restore miniflux data from borgmatic backup to k8s-pg
  3. Create borgmatic user - Read-only backup user in k8s-pg via CloudNativePG
  4. Configure dual database backup - Backup both brew PostgreSQL and k8s-pg during migration

This phase prepares for Phase 4 (miniflux migration) by verifying we can restore data to k8s-pg.


Key Decisions

Backup Both Databases During Transition

Decision: Configure borgmatic to backup both localhost:5432/miniflux (brew) and k8s-pg.tail8d86e.ts.net:5432/miniflux (k8s) until migration complete.

Why: Provides redundancy during migration. After Phase 4, remove localhost entry.

Reuse Existing borgmatic Password

Decision: Use same borgmatic password from 1Password for k8s-pg user.

Why: Simpler credential management, password already proven secure.

CloudNativePG Managed Roles

Decision: Declare borgmatic user via CloudNativePG managed.roles instead of SQL commands.

Why: Declarative, version-controlled, matches eblume user pattern.

Disable selfHeal on apps App

Decision: Remove selfHeal: true from argocd/apps/apps.yaml.

Why: Allows temporarily pointing child apps to feature branches during development without ArgoCD reverting the change.


Steps

1. Fix borgmatic borg path issue

Problem: borgmatic failing with borg: command not found

Cause: LaunchAgent doesn't have homebrew in PATH, so borg binary not found.

Solution: Add local_path to borgmatic config template.

File: ansible/roles/borgmatic/templates/config.yaml.j2

# Path to borg binary (LaunchAgent doesn't have homebrew in PATH)
local_path: {{ borgmatic_local_path }}

File: ansible/roles/borgmatic/defaults/main.yml

borgmatic_local_path: /opt/homebrew/bin/borg

2. Run manual backup to verify fix

mise run provision-indri -- --tags borgmatic
ssh indri '/opt/homebrew/bin/borgmatic --verbosity 1'

3. Extract miniflux dump from borgmatic

ssh indri 'borgmatic list --archive latest'
ssh indri 'borgmatic restore --archive latest --destination /tmp/restore'

4. Add ACL grant for homelab → k8s

Problem: Connection from indri to k8s-pg blocked - Tailscale proxy logs showed "no rules matched"

Solution: Add ACL grant in Pulumi.

File: pulumi/policy.hujson

// Homelab can reach k8s PostgreSQL for borgmatic backups
{
  "src": ["tag:homelab"],
  "dst": ["tag:k8s"],
  "ip":  ["tcp:5432"],
},

Deploy: mise run tailnet-up


5. Restore data to k8s-pg

# Using eblume superuser credentials from 1Password
ssh indri "psql 'postgres://eblume@k8s-pg.tail8d86e.ts.net:5432/miniflux' -f /tmp/restore/localhost/miniflux/miniflux"

Verification:

psql 'postgres://eblume@k8s-pg.tail8d86e.ts.net:5432/miniflux' -c 'SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM feeds; SELECT COUNT(*) FROM entries;'
# Result: 2 users, 2 feeds, 44 entries

6. Create borgmatic user in k8s-pg via CloudNativePG

File: argocd/manifests/databases/secret-borgmatic.yaml.tpl

# Template for borgmatic backup user password
# Apply with: op inject -i secret-borgmatic.yaml.tpl | kubectl apply -f -
apiVersion: v1
kind: Secret
metadata:
  name: blumeops-pg-borgmatic
  namespace: databases
type: kubernetes.io/basic-auth
stringData:
  username: borgmatic
  password: {{ op://vg6xf6vvfmoh5hqjjhlhbeoaie/mw2bv5we7woicjza7hc6s44yvy/db-password }}

File: argocd/manifests/databases/blumeops-pg.yaml (add to managed roles)

managed:
  roles:
    # ... existing eblume role ...
    # borgmatic read-only user for backups
    - name: borgmatic
      login: true
      connectionLimit: -1
      ensure: present
      inherit: true
      inRoles:
        - pg_read_all_data
      passwordSecret:
        name: blumeops-pg-borgmatic

Deploy:

op inject -i argocd/manifests/databases/secret-borgmatic.yaml.tpl | kubectl apply -f -
argocd app set blumeops-pg --revision feature/p3-postgresql-borgmatic
argocd app sync blumeops-pg

7. Configure borgmatic for dual database backup

File: ansible/roles/borgmatic/defaults/main.yml

borgmatic_postgresql_databases:
  # Brew PostgreSQL on indri (current production)
  - name: miniflux
    hostname: localhost
    port: 5432
    username: borgmatic
  # k8s PostgreSQL (CloudNativePG) - backup both during migration
  - name: miniflux
    hostname: k8s-pg.tail8d86e.ts.net
    port: 5432
    username: borgmatic

File: ansible/roles/postgresql/tasks/main.yml (update .pgpass)

- name: Write .pgpass file for borgmatic backups
  ansible.builtin.copy:
    content: |
      # Managed by ansible - only read-only roles
      localhost:{{ postgresql_port }}:*:borgmatic:{{ postgresql_user_passwords['borgmatic'] }}
      k8s-pg.tail8d86e.ts.net:5432:*:borgmatic:{{ postgresql_user_passwords['borgmatic'] }}
    dest: ~/.pgpass
    mode: '0600'
  no_log: true

8. Verify complete backup pipeline

mise run provision-indri -- --tags borgmatic,postgresql
ssh indri '/opt/homebrew/bin/borgmatic --verbosity 1'
ssh indri 'borgmatic list --archive latest'

Expected output: Archive contains both dumps:

  • localhost/miniflux/miniflux
  • k8s-pg.tail8d86e.ts.net/miniflux/miniflux

9. Fix ArgoCD drift from CNPG defaults

Problem: ArgoCD showed blumeops-pg as OutOfSync due to CNPG operator adding default values.

Solution: Add CNPG defaults explicitly to managed roles.

File: argocd/manifests/databases/blumeops-pg.yaml

managed:
  roles:
    - name: eblume
      # ... existing fields ...
      connectionLimit: -1
      ensure: present
      inherit: true
    - name: borgmatic
      # ... existing fields ...
      connectionLimit: -1
      ensure: present
      inherit: true

10. Update zk documentation

Updated:

  • ~/code/personal/zk/borgmatic.md - k8s-pg backup documentation and log entry
  • ~/code/personal/zk/postgresql.md - k8s PostgreSQL section and log entry

New Files

Path Purpose
argocd/manifests/databases/secret-borgmatic.yaml.tpl borgmatic user password template

Modified Files

Path Change
ansible/roles/borgmatic/defaults/main.yml Added borgmatic_local_path, k8s-pg database entry
ansible/roles/borgmatic/templates/config.yaml.j2 Added local_path option
ansible/roles/postgresql/tasks/main.yml Added k8s-pg to .pgpass
argocd/apps/apps.yaml Disabled selfHeal
argocd/manifests/databases/blumeops-pg.yaml Added borgmatic managed role, CNPG defaults
pulumi/policy.hujson Added ACL grant homelab → k8s on tcp:5432

Verification

  • borgmatic backup runs successfully
  • Miniflux data restored to k8s-pg (2 users, 2 feeds, 44 entries)
  • borgmatic user created in k8s-pg with pg_read_all_data role
  • Both localhost and k8s-pg databases in backup archive
  • ArgoCD shows blumeops-pg as Synced
  • zk documentation updated

Rollback

Keep brew PostgreSQL running until Phase 4 verified. To revert:

  1. Remove k8s-pg entry from borgmatic databases
  2. Remove k8s-pg from .pgpass
  3. mise run provision-indri -- --tags borgmatic,postgresql

Implementation Notes

Added during implementation for retrospective review

borgmatic LaunchAgent PATH Issue

Problem: borgmatic LaunchAgent failed with borg: command not found

Root cause: LaunchAgents run with minimal PATH that doesn't include /opt/homebrew/bin

Solution: Added local_path: /opt/homebrew/bin/borg to borgmatic config. This was already done for pg_dump_command but not for borg itself.

Lesson: Any tool invoked by borgmatic needs absolute path when running from LaunchAgent.

1Password Field Name Mismatch

Issue: Initial secret template used password field but 1Password item had db-password.

Discovery: Error message from op inject indicated field not found.

Fix: Updated template to use correct field name db-password.

ACL Grant Discovery

Problem: Connection from indri (tag:homelab) to k8s-pg (tag:k8s) failed.

Diagnosis: Checked Tailscale operator proxy logs which showed "no rules matched" - clear indication of missing ACL.

Solution: Added explicit grant in pulumi/policy.hujson for tag:homelabtag:k8s on tcp:5432.

ArgoCD selfHeal and Feature Branch Development

Problem: When testing changes, temporarily pointed blumeops-pg app to feature branch via argocd app set --revision. ArgoCD's selfHeal kept reverting it back to main.

Discussion: Two options considered:

  • Option A: Disable selfHeal on apps app (manual sync required for new apps)
  • Option B: Keep selfHeal, use different workflow

Decision: Option A chosen. The apps app now only has prune: true, not selfHeal. This allows:

  1. Temporarily testing feature branches
  2. Manual control over when app manifest changes are applied

Trade-off: Must manually sync apps app when adding/removing Application manifests.

CloudNativePG Managed Role Reconciliation

Issue: After creating borgmatic secret with correct password, CNPG didn't immediately update the user.

Solution: Annotated the Cluster to trigger reconciliation:

kubectl annotate cluster blumeops-pg -n databases cnpg.io/reconcile=$(date +%s) --overwrite

ArgoCD Drift from CNPG Defaults

Problem: blumeops-pg showed OutOfSync despite successful syncs.

Cause: CNPG operator adds default values (connectionLimit: -1, ensure: present, inherit: true) to managed roles that weren't in our spec.

Solution: Added these defaults explicitly to our spec to match what CNPG generates.

Comment added: Documented in blumeops-pg.yaml that these are "CNPG defaults added to prevent ArgoCD drift".

Git Workflow for Phase 3

  1. Created feature branch: feature/p3-postgresql-borgmatic
  2. Made commits throughout implementation
  3. Pointed blumeops-pg app to feature branch for testing
  4. Created PR #32 for review
  5. After merge, reset app to main: argocd app set blumeops-pg --revision main

This workflow was enabled by disabling selfHeal (see above).