blumeops

Author	SHA1	Message	Date
Erich Blume	377efeb22c	Fix XID Age graph to show threshold context - Add fixed Y-axis (0-220M) so the autovacuum threshold is always visible - Add dashed threshold lines at 150M (yellow) and 200M (red) - Update title to mention the 200M threshold The raw XID age will always trend upward between vacuum freezes, which looked alarming without context. Now the graph shows how far the value is from the autovacuum_freeze_max_age threshold. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-29 07:05:44 -08:00
Erich Blume	0604877db2	Add 'Tesla' prefix to all TeslaMate dashboard titles (#68 ) ## Summary - Renamed all 18 TeslaMate Grafana dashboards to include "Tesla" prefix - Improves organization and discoverability in the dashboard list ## Deployment and Testing - [ ] Sync grafana-config app: `argocd app set grafana-config --revision feature/rename-tesla-dashboards && argocd app sync grafana-config` - [ ] Verify dashboards display with "Tesla" prefix in Grafana 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/68	2026-01-29 06:55:44 -08:00
Erich Blume	46081f5f10	Update rule 10: also require permission to push to main Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 20:45:44 -08:00
Erich Blume	55522579dc	Add rule: never merge PRs without explicit user request Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 20:43:51 -08:00
Erich Blume	a93f2a77e1	Merge pull request 'Migrate remaining secrets to ExternalSecrets' (#67 ) from feature/migrate-remaining-secrets into main	2026-01-28 20:41:45 -08:00
Erich Blume	8f4660915d	Fix argocd SSH key format for 1Password Connect 1Password Connect doesn't support ?ssh-format=openssh, so we need a separate Secure Note item with the OpenSSH-formatted key. Created new 1Password item: argocd-forge-ssh-key Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 20:31:16 -08:00
Erich Blume	9114aac8f6	Switch all ExternalSecrets to creationPolicy: Owner ESO now has full ownership of these secrets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 20:27:16 -08:00
Erich Blume	dd6cf20d51	Remove obsolete secret templates - Delete 13 .yaml.tpl files replaced by ExternalSecrets - Update immich/README.md with direct CNPG secret copy instructions - Update miniflux/README.md with context flag and ESO note Only 1password-connect/secret-credentials.yaml.tpl remains (bootstrap). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 20:26:37 -08:00
Erich Blume	351528474c	Add ExternalSecrets for remaining k8s secrets Migrate 10 secret templates to ESO ExternalSecrets with 1Password Connect: - databases: eblume, borgmatic, teslamate passwords - tailscale-operator: OAuth client credentials - grafana-config: admin password, teslamate datasource - teslamate: db password, encryption key - forgejo-runner: runner registration token - argocd: forge SSH credentials All use creationPolicy: Merge for safe migration from existing secrets. Skipped: - miniflux/secret-db: Uses CNPG secret, not 1Password directly - immich/secret-db: Requires 1Password item creation first - 1password-connect: Bootstrap secret, must stay as template Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-28 19:50:38 -08:00
Erich Blume	482414346e	Add External Secrets Operator with 1Password Connect (#66 ) (#66 ) ## Summary - Add 1Password Connect server for secrets automation API - Add External Secrets Operator (ESO) to sync secrets from 1Password to K8s - Add ClusterSecretStore connecting ESO to 1Password Connect - Convert devpi secret to ExternalSecret as proof of concept ## Architecture ``` 1Password Cloud → 1Password Connect (k8s) → ESO → Native K8s Secrets ``` ## Deployment and Testing - [ ] Mirror Helm charts to forge (connect-helm-charts, external-secrets) - DONE - [ ] Create 1Password Connect credentials (`op connect server create`) - [ ] Store credentials in 1Password item "1Password Connect" - [ ] Bootstrap secret: `op inject -i argocd/manifests/1password-connect/secret-credentials.yaml.tpl \| kubectl apply -f -` - [ ] Deploy 1password-connect: `argocd app sync 1password-connect` - [ ] Deploy external-secrets: `argocd app sync external-secrets` - [ ] Deploy external-secrets-config: `argocd app sync external-secrets-config` - [ ] Test devpi ExternalSecret: `argocd app sync devpi` - [ ] Verify secret synced: `kubectl get externalsecret -n devpi` ## Future Work After PoC validated, migrate remaining 12 secret templates to ExternalSecrets: - databases (3), tailscale-operator (1), grafana-config (2), teslamate (2) - forgejo-runner (1), argocd (1), immich (1), 1password-connect (1 - self-bootstrap) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/66	2026-01-28 19:30:10 -08:00
Erich Blume	3971670832	Remove immich-sync ansible role (#65 ) ## Summary - Remove immich_sync ansible role (server-side photo sync via osxphotos) - The Immich iOS app has built-in automatic backup that replaces this functionality - iOS app supports foreground/background backup and can sync iCloud photos directly ## Deployment and Testing - [ ] Clean up files on indri (see manual cleanup commands below) - [ ] Configure Immich iOS app for automatic backup ### Manual cleanup on indri: ```bash # Unload and remove LaunchAgent launchctl unload ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist rm ~/Library/LaunchAgents/mcquack.eblume.immich-sync.plist # Remove script and credentials rm ~/bin/immich-sync.sh rm ~/.immich-api-key # Remove logs rm ~/Library/Logs/mcquack.immich-sync.*.log # Optionally remove export directory (check if empty first) ls ~/Pictures/immich-export # rm -r ~/Pictures/immich-export ``` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/65	2026-01-28 08:49:22 -08:00
Erich Blume	aa4464f84e	Upgrade Immich from v2.4.1 to v2.5.0 (#64 ) ## Summary - Upgrades Immich image tag from v2.4.1 to v2.5.0 ## Deployment and Testing - [ ] Point immich ArgoCD app at feature branch and sync - [ ] Verify pods come up healthy - [ ] Verify Immich web UI accessible - [ ] Reset to main and sync after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/64	2026-01-27 20:51:09 -08:00
Erich Blume	54945be0e3	Add immich-sync ansible role for photo library sync (#63 ) ## Summary - Add `immich_sync` role that syncs macOS Photos Library to Immich - Uses osxphotos to export photos with metadata to staging directory - Uses immich-cli (via Docker) to upload to Immich server - LaunchAgent schedules hourly syncs following mcquack pattern - API key fetched from 1Password in playbook pre_tasks ## Architecture ``` Photos Library → osxphotos export → ~/Pictures/immich-export/ → immich-cli upload → Immich ``` ## Prerequisites (manual) - Install osxphotos on indri: Add `"pipx:osxphotos" = "latest"` to `~/.config/mise/config.toml`, run `mise install` - Docker is already installed on indri ## Deployment and Testing - [ ] Dry run: `mise run provision-indri -- --tags immich_sync --check --diff` - [ ] Deploy: `mise run provision-indri -- --tags immich_sync` - [ ] Verify LaunchAgent: `ssh indri 'launchctl list \| grep immich'` - [ ] Test manual sync: `ssh indri '~/bin/immich-sync.sh'` - [ ] Check logs: `ssh indri 'tail -50 ~/Library/Logs/mcquack.immich-sync.out.log'` - [ ] Verify photos in Immich at https://photos.ops.eblu.me 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/63	2026-01-26 12:38:38 -08:00
Erich Blume	8621996343	Add Immich photo management + migrate forge URLs (#62 ) ## Summary - Migrate all ArgoCD app repo URLs from `indri.tail8d86e.ts.net:2200` to `forge.ops.eblu.me:2222` - Add Immich self-hosted photo management service with: - Helm chart deployment via ArgoCD - PostgreSQL cluster with pgvecto.rs for AI vector search (immich-pg) - NFS storage on sifaka for photo library (2Ti) - Tailscale Ingress + Caddy proxy for `photos.ops.eblu.me` - Machine learning service for face/object recognition ## Deployment and Testing - [x] Update ArgoCD repo-creds-forge secret with new URL (one-time manual step) - [ ] Sync `apps` to pick up new applications - [ ] Sync all existing apps to verify new forge URL works - [ ] Sync `blumeops-pg` to deploy immich-pg cluster - [ ] Wait for immich-pg to be healthy - [ ] Create immich-db secret from auto-generated password - [ ] Sync `immich-storage` (PV, PVC, Ingress) - [ ] Sync `immich` (Helm chart) - [ ] Run `mise run provision-indri -- --tags caddy` to add photos.ops.eblu.me - [ ] Verify Immich UI is accessible 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/62	2026-01-26 11:20:11 -08:00
Erich Blume	c8b655f177	Build local containers for k8s services (#61 ) ## Summary - Move devpi Dockerfile from argocd/manifests to containers/devpi/ - Add containers for: transmission, teslamate, miniflux, kiwix-serve, kubectl - Update all k8s deployments to use local images (registry.ops.eblu.me/blumeops/*) - All containers use v1.0.0 tag for initial release ## Containers Added \| Container \| Source \| Notes \| \|-----------\|--------\|-------\| \| devpi \| python:3.12-slim \| Existing, moved to containers/ \| \| kubectl \| alpine + download \| For zim-watcher CronJob \| \| miniflux \| Go build from source \| v2.2.16 \| \| kiwix-serve \| Download pre-built binary \| v3.8.1 \| \| transmission \| alpine + apk install \| Simpler than linuxserver image \| \| teslamate \| Elixir build from source \| v2.2.0 \| ## Deployment and Testing - [ ] Build and tag devpi-v1.0.0 - [ ] Build and tag kubectl-v1.0.0 - [ ] Build and tag miniflux-v1.0.0 - [ ] Build and tag kiwix-serve-v1.0.0 - [ ] Build and tag transmission-v1.0.0 - [ ] Build and tag teslamate-v1.0.0 - [ ] Sync ArgoCD apps and verify services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/61	2026-01-25 21:35:57 -08:00
Erich Blume	ea42362b6f	Migrate Forgejo runner to Kubernetes with DinD (#60 ) ## Summary - Deploy Forgejo runner to k8s with Docker-in-Docker sidecar - Add job execution image with Node.js and Docker CLI - Retire host-mode runner on indri - All CI jobs now run containerized in k8s ## Components Added - `containers/forgejo-runner/Dockerfile` - Job execution image - `argocd/apps/forgejo-runner.yaml` - ArgoCD Application - `argocd/manifests/forgejo-runner/` - Kubernetes manifests ## Components Removed - `ansible/roles/forgejo_runner/` - No longer needed ## Changes to Existing Files - `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env - `.github/actionlint.yaml` - Only `k8s` label now valid ## Deployment 1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl \| kubectl --context=minikube-indri apply -f -` 2. Sync ArgoCD: `argocd app sync forgejo-runner` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60	2026-01-25 19:56:17 -08:00
Erich Blume	66badfafd1	Migrate k8s services to Caddy (.ops.eblu.me) (#59 ) All checks were successful Build Container / build (push) Successful in 13s Details nettest-v0.8.0 ## Summary - Add Caddy reverse proxy routes for all k8s services (grafana, argocd, prometheus, loki, miniflux, devpi, kiwix, torrent, teslamate) - Add PostgreSQL via Caddy L4 TCP proxy on port 5432 - Caddy proxies to existing Tailscale endpoints - traffic stays local on indri - Both `.ops.eblu.me` and `.tail8d86e.ts.net` URLs continue to work ## Updated References - Alloy: prometheus/loki push endpoints → `.ops.eblu.me` - Borgmatic: PostgreSQL backup host → `pg.ops.eblu.me` - Devpi: DEVPI_OUTSIDE_URL → `pypi.ops.eblu.me` - indri-services-check: health check URLs - CLAUDE.md: argocd login command ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags caddy` to deploy new Caddy config - [ ] Test HTTP services: `curl https://grafana.ops.eblu.me/api/health` - [ ] Test PostgreSQL: `pg_isready -h pg.ops.eblu.me -p 5432` - [ ] Run `mise run provision-indri -- --tags alloy` to update Alloy endpoints - [ ] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic - [ ] Sync devpi in ArgoCD: `argocd app sync devpi` - [ ] Re-login to ArgoCD: `argocd login argocd.ops.eblu.me ...` - [ ] Run `mise run indri-services-check` to verify all services 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/59	2026-01-25 12:56:31 -08:00
Erich Blume	d6e6b48f6a	Migrate registry to Caddy (registry.ops.eblu.me) (#58 ) ## Summary - Update all references from `registry.tail8d86e.ts.net` to `registry.ops.eblu.me` - Remove `tailscale_serve` ansible role (no longer needed - all services migrated to Caddy) - Update minikube containerd config for new registry URL - Update devpi manifest, CI actions, and mise tasks ## Deployment and Testing - [ ] Run `mise run provision-indri -- --check --diff` (dry run) - [ ] Run `mise run provision-indri -- --tags minikube` to update containerd config - [ ] Sync devpi ArgoCD app: `argocd app sync devpi` - [ ] Manually remove old Tailscale serve entry: `ssh indri 'tailscale serve --service=svc:registry off'` - [ ] Test registry access: `curl https://registry.ops.eblu.me/v2/_catalog` - [ ] Run `mise run indri-services-check` to verify all services healthy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/58	2026-01-25 12:06:15 -08:00
Erich Blume	9c1b7c7ca1	Update docs for Caddy migration (#57 ) ## Summary - Update CLAUDE.md with new service routing documentation - Document the two DNS domains: `.ops.eblu.me` (Caddy) vs `.tail8d86e.ts.net` (Tailscale) - Fix incorrect service listings (Prometheus/Loki are in k8s, not indri) ## ZK Updates (not in this PR) Also updated the blumeops zk card with: - Source code URL (forge is primary, GitHub is mirror) - Services split into Caddy vs Tailscale sections - Updated port map for Caddy - Updated "Adding a New Service" instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/57	2026-01-25 11:52:35 -08:00
Erich Blume	1184b4de1d	Add Caddy layer4 for Forgejo SSH (#56 ) ## Summary - Add layer4 TCP proxy configuration to Caddyfile template for SSH services - Configure Forgejo SSH on port 2222 → localhost:2200 - Switch HTTPS from port 8443 (testing) to 443 (production) - Requires Caddy rebuilt with `github.com/mholt/caddy-l4` plugin ## What This Enables Git+SSH access via `forge.ops.eblu.me:2222` is now accessible from: - Tailnet clients (gilbert) - Docker containers on indri - Kubernetes pods in minikube This solves the DNS resolution issues where containers couldn't reach Tailscale MagicDNS names. ## Testing Done - [x] Caddy rebuilt with layer4 plugin - [x] Validated Caddyfile syntax - [x] Cleared `svc:forge` from tailscale serve - [x] Verified HTTPS works: `curl https://forge.ops.eblu.me` - [x] Verified SSH works: `ssh -p 2222 forgejo@forge.ops.eblu.me` - [x] Verified git clone works via new endpoint - [x] Verified minikube pods can reach both HTTPS and SSH endpoints ## Deployment Caddy is already running with the new config on indri. This PR captures the ansible changes. ## Next Steps - Update zk docs with new git remote format - Migrate registry and other services to Caddy - Retire tailscale_services ansible role 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/56	2026-01-25 11:37:23 -08:00
Erich Blume	682a68dc9c	Add Caddy reverse proxy for blumeops services (#55 ) ## Summary - Add Caddy ansible role following zot pattern (manual build, ansible deploy) - Caddy built with Gandi DNS plugin for ACME DNS-01 challenges - Gandi PAT fetched from 1Password and written to secured file on indri - Configure wildcard TLS for `*.ops.eblu.me` - Initial services: forge, registry (indri-local) - Uses port 8443 during testing to avoid Tailscale serve conflicts ## Build Instructions (already done) On indri: ```bash cd ~/code/3rd/caddy && mise run build ``` ## Deployment and Testing - [ ] Review Caddyfile configuration - [ ] Run `mise run provision-indri -- --tags caddy` to deploy - [ ] Test: `curl -v https://forge.ops.eblu.me:8443` (should get TLS cert) - [ ] Test: `curl -v https://registry.ops.eblu.me:8443/v2/` (should return `{}`) - [ ] Once verified, switch to port 443 and migrate services from Tailscale serve ## Files Changed - `ansible/playbooks/indri.yml` - Add pre_task for Gandi PAT, add caddy role - `ansible/roles/caddy/` - New role with Caddyfile and LaunchAgent templates 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/55	2026-01-25 09:35:06 -08:00
Erich Blume	b08faa50cc	Add Gandi DNS management via Pulumi (#54 ) ## Summary - Restructure Pulumi into separate projects: `pulumi/tailscale/` and `pulumi/gandi/` - Add Gandi LiveDNS management for `eblu.me` domain - Create wildcard DNS record `.ops.eblu.me` → indri's Tailscale IP (100.98.163.89) - Add mise tasks: `dns-up`, `dns-preview` - Update `tailnet-up` to pass `--yes` by default - Document PAT cycling process (expires every 30 days) ## Background This enables using real DNS names (`.ops.eblu.me`) that resolve to Tailscale IPs, which allows containers and other systems to resolve services without depending on MagicDNS. Since Tailscale IPs (100.x.x.x) are not publicly routable, services remain tailnet-only while using standard DNS. ## Deployment and Testing - [ ] Run `cd pulumi/gandi && uv sync` to install dependencies - [ ] Run `cd pulumi/gandi && pulumi stack init eblu-me` to create stack - [ ] Run `mise run dns-preview` to verify configuration - [ ] Run `mise run dns-up` to apply DNS records - [ ] Verify with `dig +short test.ops.eblu.me` returns `100.98.163.89` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/54	2026-01-25 08:15:46 -08:00
Erich Blume	af3536bc17	Simplify indri IP extraction from tailscale status Some checks failed Build Container / build (push) Failing after 13s Details nettest-v0.7.0 Use simple grep and awk to parse plain text tailscale status output instead of trying to parse JSON. Also show the status output for debugging. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 20:04:28 -08:00
Erich Blume	04fdd5906d	Get indri's IP from tailscale status for registry access Some checks failed Build Container / build (push) Failing after 9s Details nettest-v0.6.0 Use 'tailscale status' to get indri's Tailscale IP and add it to /etc/hosts for registry hostname resolution. The registry service runs on indri, so we need indri's IP specifically. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 20:00:57 -08:00
Erich Blume	16ccabbc34	Resolve registry IP via Tailscale and add to /etc/hosts Some checks failed Build Container / build (push) Failing after 8s Details nettest-v0.5.0 The Tailscale container's DNS doesn't work because it runs in userspace mode. Instead, resolve the registry IP using 'tailscale ip' and add it to /etc/hosts inside the container before running skopeo. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:44:39 -08:00
Erich Blume	b4b8abb2d9	Add tag:ci-gateway to Tailscale ACL for CI builds Some checks failed Build Container / build (push) Failing after 13s Details nettest-v0.4.0 - Add tag:ci-gateway to tagOwners - Grant ci-gateway access to registry on port 443 - Add test for ci-gateway -> registry access Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:38:02 -08:00
Erich Blume	9f98b3007e	Add debugging for Tailscale container failures Some checks failed Build Container / build (push) Failing after 7s Details nettest-v0.3.0 Capture container logs when the Tailscale sidecar exits unexpectedly. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:32:40 -08:00
Erich Blume	424647cd93	Use Tailscale sidecar for container registry push Some checks failed Build Container / build (push) Failing after 1m9s Details nettest-v0.2.0 Docker Desktop's VM can't resolve tailnet hostnames. Work around this by: 1. Starting a Tailscale container that joins the tailnet 2. Building the image with docker build 3. Saving to tarball with docker save 4. Pushing via skopeo inside the Tailscale container Uses TS_CI_GATEWAY_AUTHKEY repository secret for authentication. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 19:29:01 -08:00
Erich Blume	b598ea8d5a	Add indri-runner-logs script to fetch workflow logs Fetches logs for Forgejo Actions runs from indri's local storage. Logs are stored as zstd-compressed files in the forgejo data directory. Usage: mise run indri-runner-logs <run_id> Only works for runs executed by the indri-host-runner. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-24 17:25:32 -08:00
Erich Blume	31697b4d63	Add nettest container for CI/CD network debugging (#52 ) Some checks failed Build Container / build (push) Failing after 18s Details nettest-v0.1.0 ## Summary - Add `containers/nettest/` with Alpine-based Dockerfile and connectivity test script - Add `.forgejo/workflows/build-nettest.yaml` workflow triggered by `nettest-v*` tags - Test script checks DNS resolution and HTTPS connectivity to forge and registry ## Deployment and Testing - [ ] Merge PR to main - [ ] Run `mise run container-release nettest v0.1.0` to trigger first build - [ ] Verify workflow runs successfully and container can reach tailnet services - [ ] Manually test from minikube: `kubectl run nettest --rm -it --image=registry.tail8d86e.ts.net/blumeops/nettest:v0.1.0` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/52	2026-01-24 16:54:35 -08:00
Erich Blume	13b8d16eb2	Remove mention of plans dir All checks were successful Test CI / test (push) Successful in 3s Details	2026-01-24 16:22:39 -08:00
Erich Blume	ceba6b3c2c	Remove plans, they dont seem to work All checks were successful Test CI / test (push) Successful in 3s Details	2026-01-24 16:21:49 -08:00
Erich Blume	8ca8798121	Switch to Buildah for container builds (#51 ) All checks were successful Test CI / test (push) Successful in 4s Details ## Summary - Replace Docker with Buildah for container image builds - No Docker socket required - buildah is daemonless - Cleaner security model (no privileged containers or socket mounting) - Remove Docker-related security context from deployment ## Changes - Update Dockerfile to install buildah/podman instead of docker-cli - Configure buildah storage with overlay driver and fuse-overlayfs - Update composite action to use `buildah bud` and `buildah push` - Add `imagePullPolicy: Always` to ensure fresh image pulls - Update test workflow to verify buildah/podman ## Testing - [ ] Runner pod starts successfully - [ ] Buildah is available in runner - [ ] Test workflow verifies buildah/podman versions - [ ] Container build workflow builds and pushes to zot 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51	2026-01-24 13:30:26 -08:00
Erich Blume	5fcd122494	Reorganize CI/CD bootstrap phases and add custom runner Dockerfile (#50 ) All checks were successful Test CI / test (push) Successful in 2s Details ## Summary - Reorder CI/CD bootstrap phases to address chicken-and-egg problem - P2 is now "Custom Runner Image" (stock runner lacks Node.js) - Add P3 for "Mirror Forgejo & Build from Source" - Rename P3 -> P4 (Self-Deploy), P4 -> P5 (Container Builds) - Add Dockerfile for custom runner with Node.js, npm, docker, build tools - Update overview with new phase structure, host mode notes, and cross-compilation challenge ## Key Changes ### Phase Reordering \| Old \| New \| Name \| \|-----\|-----\|------\| \| P1 \| P1 \| Enable Actions (complete) \| \| P2 \| P2 \| Custom Runner Image (new focus) \| \| - \| P3 \| Mirror Forgejo & Build (new) \| \| P3 \| P4 \| Self-Deploy \| \| P4 \| P5 \| Container Builds \| ### Custom Runner Dockerfile The stock `forgejo/runner:3.5.1` image lacks Node.js, so `actions/checkout@v4` doesn't work. The new Dockerfile adds: - Node.js + npm (for GitHub Actions) - Docker CLI (for container builds) - Build tools (gcc, make, curl, jq) ### Bootstrap Strategy 1. Build custom runner image manually on gilbert (podman build) 2. Push to zot registry 3. Update deployment to use custom image 4. Then enable auto-build workflow for runner ## Deployment and Testing - [x] Review plan changes - [x] Build custom runner image manually and verify - [x] Update runner deployment - [x] Test `actions/checkout@v4` works 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/50	2026-01-23 18:50:27 -08:00
Erich Blume	3bcad4189f	Add actionlint pre-commit hook for workflow validation (#49 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Fix workflow to use `github.` context variables (Forgejo schema validator only recognizes GitHub Actions syntax, not `gitea.` aliases) - Pass untrusted inputs through environment variables (security best practice per actionlint) - Add actionlint to Brewfile and pre-commit config to catch workflow validation errors locally ## Deployment and Testing - [x] Pre-commit hooks all pass - [x] actionlint validates `.forgejo/workflows/test.yaml` successfully - [ ] Verify workflow runs without errors on Forge after merge 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/49	2026-01-23 17:56:24 -08:00
Erich Blume	6a436d141a	Update CI/CD plan: mark Phase 1 complete, add runner observability All checks were successful Test CI / test (push) Successful in 0s Details - Mark Phase 1 (Enable Actions) as completed with date - Check off all verification items in P1 - Add Step 6 to Phase 4 for runner logging and metrics - Update overview table with status column Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-23 17:10:14 -08:00
Erich Blume	7893c41020	Enable Forgejo Actions (Phase 1) (#48 ) All checks were successful Test CI / test (push) Successful in 0s Details ## Summary - Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password - Enable Forgejo Actions in config (`[actions] ENABLED = true`) - Add `repo.actions` to DEFAULT_REPO_UNITS - Clean up unused MySQL database fields (we use SQLite) ## Phase 1 Progress This PR covers the first part of Phase 1 (ci-cd-bootstrap plan): - [x] Refactor app.ini to ansible template - [x] Store secrets in 1Password - [x] Enable Actions in config - [ ] Deploy config changes (pending review) - [ ] Create runner registration token - [ ] Deploy runner to k8s - [ ] Test with simple workflow ## Deployment and Testing - [ ] Run `mise run provision-indri -- --tags forgejo` to deploy - [ ] Verify Forgejo restarts correctly - [ ] Verify Actions tab appears in repo settings 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48	2026-01-23 17:00:12 -08:00
Erich Blume	016f1043c8	Retire k8s-migration plan and create ci-cd-bootstrap plan	2026-01-23 14:13:01 -08:00
Erich Blume	25fa2ea665	Update indri-services-check	2026-01-22 21:31:11 -08:00
Erich Blume	272ddb213b	Add TeslaMate deployment for Tesla Model Y data logging (#47 ) ## Summary - Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net - Add teslamate user to CloudNativePG blumeops-pg cluster - Add TeslaMate PostgreSQL datasource to Grafana - Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc. - Add teslamate database to borgmatic backup configuration ## Deployment and Testing - [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key" - [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl \| kubectl apply -f -` - [ ] Sync blumeops-pg: `argocd app sync blumeops-pg` - [ ] Create teslamate database - [ ] Apply teslamate secrets (encryption key, db connection) - [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl \| kubectl apply -f -` - [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config` - [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net - [ ] Verify data collection starts - [ ] Verify Grafana dashboards show data 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47	2026-01-22 21:25:44 -08:00
Erich Blume	11075d4517	Remove logfmt parsing stage from Alloy k8s config The stage.match selector wasn't preventing Alloy from logging decode errors internally. Removing logfmt parsing entirely - JSON parsing handles most structured logs, and plain text logs still get collected. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 18:06:34 -08:00
Erich Blume	e6de7ba391	Fix Alloy logfmt decode errors for JSON logs (#46 ) ## Summary - Use `stage.match` to conditionally apply logfmt parsing only to lines that don't start with `{` - This prevents error spam like `"failed to decode logfmt" component_path=/ component_id=loki.process.pods component=stage type=logfmt err="logfmt syntax error at pos 2 on line 1: unexpected '\"'"` when JSON-formatted logs hit the logfmt parser ## Deployment and Testing - [ ] Sync alloy-k8s app to feature branch and verify errors stop appearing - [ ] Verify JSON logs are still parsed correctly - [ ] Verify logfmt logs (from Loki, Prometheus etc.) are still parsed correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/46	2026-01-22 18:00:34 -08:00
Erich Blume	16bfe06b7b	Fix LaunchDaemon check to use become: true LaunchDaemons run in the system domain and require sudo to query. Without become: true, the check always fails and tries to reload. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-22 17:34:23 -08:00
Erich Blume	57bf8512dc	Log filtering cleanup and observability improvements (#45 ) ## Summary - Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue) - Disable thermal collector on indri Alloy (not supported on macOS M1) - Add macOS power/thermal metrics collection via powermetrics LaunchDaemon - Add Power & Thermal section to macOS Grafana dashboard - Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.) - Extract more fields from JSON logs (zot compatibility - uses "message" not "msg") - Silence logfmt parse errors for non-logfmt logs - Fix JSON escaping in devpi dashboard ## Deployment and Testing - [x] Deployed Alloy config changes to indri via ansible - [x] Synced alloy-k8s and grafana-config via ArgoCD - [x] Verified power metrics appearing in Prometheus - [x] Verified thermal collector errors stopped - [x] Verified logfmt parse errors silenced - [x] Verified devpi dashboard loads correctly 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45	2026-01-22 17:30:08 -08:00
Erich Blume	af39067e1f	Pin ArgoCD to v3.2.6 (#44 ) ## Summary - Pin ArgoCD kustomization to v3.2.6 tag instead of `stable` branch - This gives intentional control over ArgoCD version upgrades ## Deployment and Testing - [ ] Sync the `apps` application: `argocd app sync apps` - [ ] Point argocd at feature branch: `argocd app set argocd --revision feature/pin-argocd-v3.2.6` - [ ] Sync argocd: `argocd app sync argocd` - [ ] Verify ArgoCD is running v3.2.6 - [ ] After merge, reset to main: `argocd app set argocd --revision main && argocd app sync argocd` 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/44	2026-01-22 16:38:27 -08:00
Erich Blume	e4a8405de7	Observability cleanup and k8s service monitoring (#43 ) (#43 ) ## Summary - Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s) - Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery - Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd - Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios) - Replace stale devpi dashboard with probe-based metrics (status, response time, uptime) - Add unified "K8s Services Health" dashboard for service uptime/response monitoring ## Manual cleanup already performed - Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom` - Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/` ## Deployment and Testing - [x] Sync `apps` application to pick up new alloy-k8s app - [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s` - [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent` - [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus` - [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config` - [x] Verify pod logs appear in Loki/Grafana - [x] Verify transmission metrics appear in Prometheus - [x] Verify service probe metrics appear in Prometheus - [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config - [ ] After merge, reset apps to main and resync 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43	2026-01-22 13:51:01 -08:00
Erich Blume	17023085cb	Migrate observability stack to Kubernetes (#42 ) Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack. Summary - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal) - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net) - Add ACL rule for port 9187 (CNPG metrics) - Delete obsolete ansible roles for prometheus and loki Changes - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints - pulumi/policy.hujson - ACL for port 9187 - Deleted ansible/roles/prometheus/ and ansible/roles/loki/ Deployment and Testing - Stop prometheus and loki on indri - Sync ArgoCD apps (apps, prometheus, loki, grafana) - Run mise run provision-indri -- --tags alloy - Verify Grafana dashboards show data 🤖 Generated with https://claude.ai/claude-code Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42	2026-01-22 12:06:02 -08:00
Erich Blume	5a829e0afd	Remove unused indri tags and ansible roles (#41 ) ## Summary - Remove ansible roles for services migrated to k8s: devpi, kiwix, transmission - Also remove unused node_exporter and podman ansible roles - Remove service tags from indri for k8s-hosted services (grafana, kiwix, devpi, pg, feed) - Update indri description to reflect current architecture ## Changes Ansible roles removed (34 files, ~1000 lines): - devpi, devpi_metrics - kiwix - transmission, transmission_metrics - node_exporter - podman Pulumi indri tags removed: - tag:grafana, tag:kiwix, tag:devpi, tag:pg, tag:feed These services now run in k8s with their own Tailscale devices via tailscale-operator. ## Deployment and Testing - [x] Verified remaining ansible roles match indri.yml - [x] Verified no playbooks or role dependencies reference removed roles - [ ] Run `pulumi preview` to verify tag changes - [ ] Run `pulumi up` to apply tag changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/41	2026-01-21 20:18:53 -08:00
Erich Blume	6a140107c6	P7 forgejo plan updated	2026-01-21 20:04:18 -08:00
Erich Blume	4dd74dfff8	complete P6	2026-01-21 19:16:04 -08:00

1 2 3

135 commits