Commit graph

59 commits

Author SHA1 Message Date
be54cc3411 C1: migrate homepage dashboard to ringtail k3s
Repoint the ArgoCD Application destination from minikube to ringtail and
bump the image tag to the new amd64 nix-built v1.11.0-b87f62e-nix.

Rework services.yaml for the autodiscovery shift: 11 services that
previously auto-populated via minikube Ingress annotations (ArgoCD,
Immich, Kiwix, Mealie, Miniflux, Grafana, Prometheus, Navidrome,
Paperless, TeslaMate, Transmission) become explicit static entries with
their widget configs preserved. Conversely, the ringtail services that
will now auto-populate (Frigate/NVR, Authentik, Ntfy) are removed from
the static list to avoid duplicates; Ollama becomes newly visible.

Add a Content group for Immich/Kiwix/Miniflux which previously lived
under the autodiscovered "Content" group from annotations.
2026-05-10 20:37:03 -07:00
5096223b48 C1: clean up cv + docs minikube artifacts (#343)
## Summary

Follow-up to #342. The cv and docs services are now live on indri (Caddy file_server backed by ansible-managed tarball extraction) and verified working. This PR removes the dead minikube artifacts and the tooling shims that referenced them.

## Changes

**Deletions:**
- ``argocd/apps/{cv,docs}.yaml``
- ``argocd/manifests/{cv,docs}/`` (deployment, service, ingress, pdb, kustomization)
- ``containers/{cv,quartz}/`` (Dockerfiles + start scripts)

**Tooling:**
- ``mise-tasks/container-version-check``: remove the ``quartz``→``docs`` CONTAINER_TO_SERVICE mapping (containers/quartz no longer exists)
- ``service-versions.yaml``: bump ``docs.current-version`` to ``v1.16.0`` (the blumeops docs release tag) and trim the migration-window comment

## Live state context

The argocd Applications ``cv`` and ``docs`` were already deleted from the cluster manually as part of the cutover; this PR just removes the YAML files that the ``apps`` app-of-apps was still ingesting. After merge, ``argocd app sync apps`` will reconcile and the ``apps`` Application returns to Synced.

The Caddyfile ``handle_errors`` bug that briefly crashed all ``*.ops.eblu.me`` services during cutover is fixed in a separate C0 (``2ee53fe``) on main, not here.

## Test plan
- [x] ``mise run container-version-check --all-files`` clean
- [x] ``mise run service-review --type ansible`` shows cv at 1.0.3, docs at v1.16.0
- [ ] After merge: ``argocd app sync apps`` returns clean (cv/docs entries gone, no children to reconcile)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #343
2026-04-29 15:18:39 -07:00
14ca0160ba Migrate devpi from minikube to indri (launchd) (#341)
## Summary

Devpi was crash-looping under memory pressure on the minikube StatefulSet, breaking the Python toolchain across the repo (`mise run docs-mikado`, `prek`, every `uv pip install`). It moves to indri as a native LaunchAgent.

## What changed

- **New ansible role** `ansible/roles/devpi/`: installs `devpi-server` + `devpi-web` into a uv-managed venv, initializes the server-dir on first run via 1Password root password, runs as a LaunchAgent (`mcquack.eblume.devpi`) bound to `127.0.0.1:3141`. Bootstraps from upstream PyPI (so devpi can install itself on a fresh box).
- **Caddy**: `pypi.ops.eblu.me` now proxies to `http://localhost:3141`.
- **Playbook**: `indri.yml` gains pre_tasks for the root password and the new role.
- **service-versions.yaml**: devpi flipped from `type: argocd` to `type: ansible`.
- **ArgoCD**: removed `apps/devpi.yaml` and `manifests/devpi/`. The in-cluster Application, namespace, and PVC have been deleted.
- **Docs**: new how-to `docs/how-to/operations/devpi-on-indri.md`; `restart-indri.md` lists devpi in the LaunchAgent stop list.

## Already deployed (live on indri)

- Service running: `launchctl list mcquack.eblume.devpi` → PID 53888
- `curl https://pypi.ops.eblu.me/+api` returns 200 
- `mise run docs-mikado` works again 
- 1.0G of cached PyPI data was migrated from the PVC to `~erichblume/devpi/server-dir/`
- Minikube namespace and PVC fully reclaimed

## Test plan

- [ ] `mise run services-check` (after merge)
- [ ] CI workflows that use devpi succeed
- [ ] No regressions in tools that depend on `pypi.ops.eblu.me` (prek, uv-script tasks, dagger pipelines)

## Context

This is the C1 prelude to a planned C2 chain (`mikado/retire-minikube-indri`) to retire minikube on indri entirely. Doing devpi as a standalone C1 was the right call because (a) it was urgent — it was breaking the toolchain — and (b) it shakes out the migration recipe before we commit to a multi-leaf chain.

Reviewed-on: #341
2026-04-29 13:38:36 -07:00
07f52e9488 Deploy Paperless-ngx document management (#328)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-dockerfile (paperless) (push) Successful in 9s
## Summary

- Add paperless-ngx (v2.20.13) as a new ArgoCD-managed service on indri
- Dockerfile built from forge mirror (`mirrors/paperless-ngx`), multi-stage with s6-overlay
- PostgreSQL database via `blumeops-pg` CNPG cluster, Redis sidecar for Celery
- NFS document storage on sifaka (`/volume1/paperless`)
- Authentik OIDC SSO via baked JSON blob from 1Password
- Caddy route at `paperless.ops.eblu.me`
- 1Password item "Paperless (blumeops)" created with all secrets

## Files

- `containers/paperless/Dockerfile` — multi-stage build
- `argocd/manifests/paperless/` — full k8s manifest set
- `argocd/apps/paperless.yaml` — ArgoCD application
- `argocd/manifests/databases/` — CNPG role + ExternalSecret
- `ansible/roles/caddy/defaults/main.yml` — Caddy route
- `service-versions.yaml` — version tracking entry
- `docs/reference/services/paperless.md` — reference card

## Remaining deploy steps

1. Build container: `mise run container-build-and-release paperless`
2. Update kustomization.yaml `newTag` with actual image tag
3. Create Authentik application/provider for paperless
4. Create `paperless` database on blumeops-pg
5. Sync ArgoCD apps, then sync paperless from branch
6. Provision Caddy: `mise run provision-indri -- --tags caddy`
7. Verify at https://paperless.ops.eblu.me

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #328
2026-04-08 17:54:12 -07:00
c7e5af6d51 Migrate 1Password Connect from Helm to kustomize (1.8.1 → 1.8.2) (#326)
## Summary

- Renders manifests from `connect-helm-charts v2.4.1` as plain kustomize (deployment + service)
- Bumps 1Password Connect from 1.8.1 → 1.8.2
- Completes the no-helm-policy migration — all services now use kustomize
- Retains all production hardening from the Helm chart (securityContext, runAsNonRoot, drop ALL, seccomp, resource limits)

## Changes

- **New:** `deployment.yaml`, `service.yaml`, `kustomization.yaml` in `argocd/manifests/1password-connect/`
- **Rewritten:** Both ArgoCD app definitions (indri + ringtail) — single source kustomize instead of multi-source Helm
- **Deleted:** `values.yaml` (Helm values no longer needed)
- **Updated:** `no-helm-policy.md`, `service-versions.yaml`, `README.md`

## Deployment plan

1. Sync `apps` app to pick up the new app definitions
2. `argocd app set 1password-connect --revision 1password-connect-kustomize`
3. `argocd app sync 1password-connect` — verify on indri
4. Repeat for ringtail
5. After merge: reset revision to main, re-sync both

## Test plan

- [ ] `kubectl kustomize` renders cleanly (verified locally)
- [ ] ArgoCD diff shows expected changes (Helm labels removed, images bumped)
- [ ] Pods come up healthy on indri
- [ ] External Secrets still resolves 1Password items
- [ ] Repeat on ringtail

Reviewed-on: #326
2026-04-06 07:31:40 -07:00
64200a55c5 Migrate Immich from Helm chart to kustomize manifests (v2.5.6 → v2.6.3)
Replace the Helm chart deployment with plain kustomize manifests following
the Authentik pattern (separate deployments per component). Consolidate
the immich-storage ArgoCD app into the main immich app. Add no-helm-policy
doc establishing kustomize as the standard deployment mechanism.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-04 09:42:25 -07:00
f9206bf10b Build custom Kingfisher container from sporked deploy branch (#318)
All checks were successful
Build Container / detect (push) Successful in 2s
Build Container / build-nix (kingfisher) (push) Successful in 12s
## Summary

- Add Dockerfile for Kingfisher built from source (sporked deploy branch)
- Multi-stage: Rust build with Boost/vectorscan, debian-slim runtime
- Switch CronJob from upstream `ghcr.io/mongodb/kingfisher` to `registry.ops.eblu.me/blumeops/kingfisher`
- Add kingfisher to service-versions.yaml (version tracks upstream main SHA)
- Document spork workflow in CLAUDE.md

## Test plan

- [ ] Build container: `mise run container-build-and-release kingfisher 1d37d29`
- [ ] Verify image on registry: `mise run container-list`
- [ ] Update kustomization newTag
- [ ] Sync ArgoCD kingfisher app from branch
- [ ] Trigger manual CronJob and verify scan completes
- [ ] Verify reports on sifaka

Reviewed-on: #318
2026-03-30 06:34:49 -07:00
35705faca2 Add Kingfisher secret scanner CronJob (#317)
## Summary

- Deploys MongoDB Kingfisher as a weekly CronJob on minikube-indri
- Scans all Forgejo repos (eblume + all orgs) for leaked secrets with live validation
- Produces timestamped HTML and JSON reports on sifaka NFS (`/volume1/reports/kingfisher/`)
- Forgejo API token sourced from 1Password via ExternalSecret
- Uses official `ghcr.io/mongodb/kingfisher:1.91.0` container image
- Runs Sunday 4am (after Prowler's 3am k8s scan)

## Resources

- CronJob, PV/PVC (sifaka NFS), ExternalSecret
- ArgoCD Application with manual sync + CreateNamespace

## Test plan

- [x] Sync ArgoCD `apps` app to pick up new kingfisher Application
- [x] Set `--revision feature/kingfisher-cronjob` on kingfisher app
- [x] Verify ExternalSecret creates the `kingfisher-forgejo-token` Secret
- [x] Trigger manual job: `kubectl create job --from=cronjob/kingfisher kingfisher-manual -n kingfisher --context=minikube-indri`
- [ ] Verify reports appear on sifaka at `/volume1/reports/kingfisher/`
- [ ] After merge: set `--revision main` and re-sync

Reviewed-on: #317
2026-03-28 21:39:55 -07:00
796baaa41a Upgrade External Secrets Operator v2.2.0 + migrate Helm to kustomize (#312)
## Summary

- Upgrade External Secrets Operator from v1.3.2 (helm-chart-2.0.0) to v2.2.0
- Migrate from Helm chart deployment to static kustomize manifests, matching the repo's kustomize-first pattern
- Merge separate `-config` ArgoCD apps into the main operator apps (6 → 4 apps)
- Clean up Helm-specific labels (`helm.sh/chart`, `managed-by: Helm`)
- Update README example from v1beta1 to v1 API

## Breaking changes assessment

Low risk — v2.0.0 removed Alibaba and Device42 providers (we use neither). No templating changes affect us. All ExternalSecrets already use v1 API.

## Deployment steps

1. Sync CRDs first on both clusters (new CRD version)
2. Sync operator apps (now kustomize-based)
3. Verify ClusterSecretStore and all ExternalSecrets are healthy
4. Delete orphaned config apps: `argocd app delete external-secrets-config` and `-config-ringtail`
5. `mise run services-check`

Reviewed-on: #312
2026-03-25 15:56:41 -07:00
d021b3534f Deploy Prowler CIS scanner (#310)
All checks were successful
Build Container / detect (push) Successful in 4s
Build Container / build-dockerfile (prowler) (push) Successful in 10s
## Summary
- Deploy Prowler 5 as a weekly CronJob on minikube-indri for CIS Kubernetes Benchmark v1.11 scanning
- Custom slim container build (strips PowerShell, Trivy, and non-K8s providers from upstream)
- Reports (HTML, CSV, JSON-OCSF) written to NFS share on sifaka at `/volume1/reports/prowler/`
- Read-only ClusterRole for pod, RBAC, and control plane inspection
- Host path mounts + hostPID for kubelet file permission checks

## Follow-ups
- Mirror prowler-cloud/prowler on forge for supply chain control
- Build and push container image, update kustomization.yaml newTag
- Consider adding k3s-ringtail scanning (core + RBAC checks only)

## Test plan
- [ ] Build container: `mise run container-release prowler v5.22.0`
- [ ] Update `argocd/manifests/prowler/kustomization.yaml` newTag to built image tag
- [ ] Sync ArgoCD: `argocd app sync apps && argocd app set prowler --revision deploy-prowler && argocd app sync prowler`
- [ ] Trigger manual job: `kubectl create job --from=cronjob/prowler prowler-manual -n prowler --context=minikube-indri`
- [ ] Verify reports appear on sifaka NFS share
- [ ] `mise run services-check`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #310
2026-03-24 16:08:09 -07:00
fc45989a6c Decommission JobSync service (#308)
All checks were successful
Build Container / detect (push) Successful in 3s
## Summary

- Remove all JobSync infrastructure: ArgoCD app, k8s manifests, container build (nix), Caddy reverse proxy entry, Homepage dashboard entry, service-versions tracking, and all documentation
- Runtime teardown already completed: ArgoCD app cascade-deleted (removes deployment, PVC, service, ingress, external-secret), forge mirror deleted, 1Password item archived, local clone removed

## Motivation

Replacing JobSync with a datasette-based job tracking pipeline driven by mise tasks and a Claude agent frontend. JobSync's Next.js server actions don't expose a useful API for automation.

## Remaining manual steps after merge

- Provision Caddy to remove the stale proxy route: `mise run provision-indri -- --tags caddy`
- Sync Homepage: `argocd app sync homepage`
- Verify namespace cleanup on ringtail: `kubectl get ns jobsync --context=k3s-ringtail` (should be gone)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #308
2026-03-24 08:44:23 -07:00
3750428b58 Fix ArgoCD apps app permanent OutOfSync
Remove `group: ""` from ignoreDifferences in tailscale-operator and
tailscale-operator-ringtail — ArgoCD normalizes away the empty string
field, so the live state never matches git.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-22 20:42:37 -07:00
11330ebea0 Deploy Mealie recipe manager (#299)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (mealie) (push) Successful in 2s
Build Container / build (mealie) (push) Successful in 8s
## Summary

- Deploy Mealie (self-hosted recipe manager) on minikube-indri via ArgoCD
- Build container from source via forge mirror (`mirrors/mealie`) — multi-stage Dockerfile with Node.js frontend + Python/uv backend
- Add Caddy proxy entry for `meals.ops.eblu.me`
- Part of a larger meal planning pipeline: Mealie stores categorized recipes, a planner script selects balanced meals, and Ollama generates unified cooking timelines

## Status

- [x] Mirror mealie repo on forge
- [x] Dockerfile (from-source build)
- [x] ArgoCD app + k8s manifests
- [x] Caddy proxy entry
- [x] Service docs, routing table, app registry
- [ ] Local Dagger build test
- [ ] Container build + push to registry
- [ ] Update kustomization.yaml with real image tag
- [ ] Deploy and verify
- [ ] Provision Caddy

## Test plan

- Build container locally via `dagger call build --src=. --container-name=mealie`
- Trigger CI build via `mise run container-build-and-release mealie`
- Deploy from branch: `argocd app set mealie --revision deploy-mealie && argocd app sync mealie`
- Verify Mealie UI at `https://meals.ops.eblu.me`
- Verify API docs at `https://meals.ops.eblu.me/docs`

Reviewed-on: #299
2026-03-16 21:59:10 -07:00
4dc3e5cae2 Add UnPoller for UniFi network metrics (#298)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (unpoller) (push) Successful in 2s
Build Container / build (unpoller) (push) Successful in 7s
## Summary
- Deploy UnPoller as a k8s service on indri to export UniFi controller metrics to Prometheus
- Custom-built container from forge mirror (`containers/unpoller/Dockerfile`)
- Credentials pulled from 1Password via external-secrets
- Prometheus scrape job added, docs and service-versions updated

## Test plan
- [ ] Build container: `mise run container-release unpoller v2.34.0`
- [ ] Update kustomization tag with built image tag
- [ ] Deploy from branch: `argocd app set unpoller --revision feature/unpoller && argocd app sync unpoller`
- [ ] Verify pod connects to UX7 controller (check logs)
- [ ] Confirm `unpoller` target appears in Prometheus
- [ ] Query `unifi_` metrics in Grafana

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #298
2026-03-16 15:52:45 -07:00
40f1568088 Remove unused Mosquitto MQTT broker from ringtail
Mosquitto has been dormant since frigate-notify switched from MQTT to
webapi polling (529ba10). Tear down live infra (ArgoCD app, namespace)
and remove all manifests, service-versions entry, services-check, and
doc references.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 18:37:31 -07:00
3a811fb188 Deploy JobSync — job search tracker on ringtail k3s (#288)
All checks were successful
Build Container (Nix) / detect (push) Successful in 1s
Build Container / detect (push) Successful in 2s
Build Container / build (jobsync) (push) Successful in 2s
Build Container (Nix) / build (jobsync) (push) Successful in 8s
## Summary

C2 Mikado chain to deploy [JobSync](https://github.com/Gsync/jobsync) — a self-hosted job application tracker — to ringtail's k3s cluster.

### Mikado Graph

```
deploy-jobsync (goal)
├── build-jobsync-container
│   └── mirror-jobsync
└── integrate-jobsync-ollama
```

### What is JobSync?

Next.js app with SQLite for tracking job applications. Features resume management, application pipeline tracking, and AI-powered resume review/job matching.

### Key Decisions

- **Ringtail k3s** (not minikube-indri) — colocates with Ollama for zero-latency AI
- **Nix container** via `buildLayeredImage` — no Dockerfile, mirrors upstream source on forge
- **Ollama for AI** — uses existing deployment, no API keys needed for AI features
- **No upstream fork** — vanilla JobSync, Anthropic AI deferred to future work if needed

### Current Status

Planning phase — cards committed, ready for review before implementation begins.

Reviewed-on: #288
2026-03-08 11:02:05 -07:00
c281fb5403 Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286)
## Summary

Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki).

- **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus
- **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
- **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
- **Prometheus** scrapes Tempo operational metrics

### Architecture

```
ringtail (k3s)                                indri (minikube)
┌──────────────────────┐                      ┌─────────────────────┐
│ Alloy+Beyla (eBPF)   │──OTLP HTTP────────→ │ Tempo               │
│  ↳ Frigate, ntfy,    │  via tailnet         │  ↳ trace storage    │
│    Ollama, Immich     │                      │  ↳ RED → Prometheus │
└──────────────────────┘                      │                     │
                                              │ Grafana             │
                                              │  ↳ Tempo datasource │
                                              └─────────────────────┘
```

### New files (12)
- `docs/reference/services/tempo.md` — reference doc
- `docs/changelog.d/feature-otel-tracing.feature.md`
- `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files)
- `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files)

### Modified files (6)
- `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields
- `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target
- `service-versions.yaml` — tempo + alloy-tracing-ringtail entries
- `docs/reference/services/grafana.md` — Tempo in datasources table
- `docs/reference/reference.md` — Tempo in services index
- `docs/reference/operations/observability.md` — Tempo in components list

## Deployment and Testing

- [ ] Sync `apps` app to pick up new Application definitions
- [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo`
- [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo`
- [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready`
- [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring`
- [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail`
- [ ] Check Beyla discovery in alloy-tracing logs on ringtail
- [ ] Sync grafana-config for updated datasources
- [ ] Sync prometheus for updated scrape config
- [ ] Test Grafana Tempo datasource connection
- [ ] Generate test traffic and search traces in Grafana Explore → Tempo
- [ ] After merge: reset all ArgoCD app revisions back to main

Reviewed-on: #286
2026-03-05 10:51:07 -08:00
01d3b4d1c7 Switch forgejo-runner ArgoCD app to internal SSH repo URL
Was the only app still using https://forge.eblu.me (public proxy) for
git polling. All other apps already use the internal SSH endpoint at
forge.ops.eblu.me.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:43:01 -08:00
a87c997ee1 Expose Forgejo publicly at forge.eblu.me (#278)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m28s
## Summary

Expose Forgejo publicly at `forge.eblu.me` via the Fly.io reverse proxy — the first dynamic, authenticated public-facing service.

- **Forgejo hardening:** Domain changed to forge.eblu.me, SSH stays on forge.ops.eblu.me, reverse proxy trust headers configured, local registration locked to external-only (Authentik SSO)
- **Tailscale Ingress:** ExternalName Service + Ingress in tailscale-operator creates forge.tail8d86e.ts.net endpoint
- **Fly.io proxy:** nginx server block with rate-limited auth endpoints (3r/s), fail2ban with custom nginx-deny action, security headers, /swagger blocked, WebSocket support, 512m body limit
- **Authentik:** OAuth callback updated to forge.eblu.me
- **DNS/TLS:** CNAME record in Pulumi, cert in fly-setup
- **Rename:** ~29 files updated from forge.ops.eblu.me to forge.eblu.me (HTTPS refs only; SSH, container builds, and Caddy table kept as-is)

## Deployment Order

1. `mise run provision-indri -- --tags forgejo` (config changes)
2. Verify forge.ops.eblu.me still works
3. `argocd app set tailscale-operator --revision feature/forge-public && argocd app sync tailscale-operator`
4. Verify `curl https://forge.tail8d86e.ts.net`
5. `cd fly && fly deploy`
6. Verify pre-DNS: `curl -H "Host: forge.eblu.me" https://blumeops-proxy.fly.dev/`
7. `fly certs add forge.eblu.me -a blumeops-proxy`
8. `argocd app set authentik --revision feature/forge-public && argocd app sync authentik`
9. `mise run dns-preview && mise run dns-up`
10. Full verification (see below)
11. Rehearse `mise run fly-shutoff`
12. After merge: reset ArgoCD revisions to main, re-sync

## Verification Checklist

- [ ] forge.eblu.me loads, shows public repos
- [ ] forge.ops.eblu.me still works from tailnet
- [ ] SSH clone via forge.ops.eblu.me:2222 works
- [ ] HTTPS clone via forge.eblu.me works
- [ ] UI shows forge.eblu.me for HTTPS clone, forge.ops.eblu.me for SSH
- [ ] /swagger returns 403
- [ ] Rapid login attempts trigger 429 rate limit
- [ ] fail2ban bans after 5 failed logins in 10 minutes
- [ ] ArgoCD can still sync (SSH unaffected)
- [ ] `mise run fly-shutoff` stops all public traffic
- [ ] `mise run services-check` passes

Reviewed-on: #278
2026-03-03 08:40:41 -08:00
31d925814f Deploy Ollama LLM server on ringtail (#277)
## Summary
- Deploy Ollama as a new ArgoCD-managed service on ringtail's k3s cluster with GPU acceleration
- Declarative model management via `models.txt` + sidecar sync script (mirrors kiwix torrent pattern)
- Initial models: `qwen2.5:14b`, `deepseek-r1:14b`, `phi4:14b`, `gemma3:12b`
- hostPath PV on `/mnt/storage1/ollama` for fast local model storage (200Gi)
- Tailscale ingress at `ollama.ops.eblu.me` for API access from tailnet
- Enable GPU time-slicing (`replicas: 2`) on nvidia-device-plugin so Frigate and Ollama share the RTX 4080

## Deployment and Testing
- [ ] Deploy nvidia-device-plugin changes first: `argocd app sync nvidia-device-plugin`
- [ ] Verify GPU time-slicing: `kubectl describe node ringtail --context=k3s-ringtail` shows `nvidia.com/gpu: 2`
- [ ] Sync `apps` app with `--revision feature/ollama-ringtail`
- [ ] Set ollama app to branch: `argocd app set ollama --revision feature/ollama-ringtail && argocd app sync ollama`
- [ ] Verify model-sync sidecar pulls models: `kubectl logs -n ollama deploy/ollama -c model-sync --context=k3s-ringtail`
- [ ] Test API: `curl https://ollama.ops.eblu.me/api/tags`
- [ ] Test inference: `curl https://ollama.ops.eblu.me/api/generate -d '{"model":"qwen2.5:14b","prompt":"Hello"}'`
- [ ] Verify Frigate still works after GPU sharing change
- [ ] After merge: `argocd app set ollama --revision main && argocd app sync ollama`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/277
2026-03-02 20:39:51 -08:00
db561c6b0e Upgrade ArgoCD v3.2.6 → v3.3.2 with Server-Side Apply (#272)
## Summary
- Upgrade ArgoCD from v3.2.6 to v3.3.2
- Enable `ServerSideApply=true` sync option (required by v3.3 — ApplicationSet CRD exceeds client-side apply annotation limit)
- Update service-versions.yaml with review for argocd and 1password-connect

## Breaking changes reviewed
- **Server-Side Apply required**: Added to syncOptions 
- **Source Hydrator git notes**: Not used — N/A
- **Application path cleaning removed**: Not used — N/A
- **Settings API field restriction**: Authenticated access only — N/A

## Deployment and Testing
- [ ] Sync the `apps` app first (picks up SSA syncOption change)
- [ ] `argocd app set argocd --revision feature/argocd-v3.3.2`
- [ ] `argocd app sync argocd`
- [ ] Verify all argocd pods running with v3.3.2 images
- [ ] Verify other apps still sync correctly
- [ ] After merge: `argocd app set argocd --revision main && argocd app sync argocd`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/272
2026-02-26 06:51:50 -08:00
03d71544ec Add multi-cluster observability with ringtail metrics and dashboards (#270)
## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
  - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
  - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels

## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods

## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
2026-02-25 22:01:00 -08:00
de54b4e33d Port CloudNative-PG off Helm to direct release manifest (#268)
## Summary
- Point ArgoCD app directly at forge-mirrored upstream repo (`mirrors/cloudnative-pg`) instead of the Helm charts repo
- Use `directory.include` to select the specific release manifest (`cnpg-1.27.1.yaml`) from the `releases/` directory
- No vendored files, no Helm — upgrades are a two-line change (`targetRevision` + `directory.include`)
- Delete unused `values.yaml` (was empty, all Helm defaults)

## Deployment and Testing
- [ ] Register mirror repo in ArgoCD: `argocd repo add ssh://forgejo@forge.ops.eblu.me:2222/mirrors/cloudnative-pg.git --ssh-private-key-path <key>`
- [ ] `argocd app set cloudnative-pg --revision feature/cnpg-direct-source && argocd app sync cloudnative-pg`
- [ ] Verify operator pod running: `kubectl get pods -n cnpg-system --context=minikube-indri`
- [ ] Verify CRDs exist: `kubectl get crd --context=minikube-indri | grep cnpg`
- [ ] Verify existing clusters healthy: `kubectl get clusters -A --context=minikube-indri`
- [ ] After merge: `argocd app set cloudnative-pg --revision main && argocd app sync cloudnative-pg`

## Notes
- The forge mirror was created via `mise run mirror-create` from `https://github.com/cloudnative-pg/cloudnative-pg.git`
- ArgoCD may need the mirror repo added to its known repositories if the credential template doesn't already match `mirrors/*`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/268
2026-02-25 17:37:53 -08:00
5f9bc20345 Fix mirror org refs in ArgoCD apps and widen credential template (#266)
## Summary

- Widen `repo-creds-forge` URL prefix from `/eblume/` to host-wide `/` so it matches repos in all forge orgs (fixes `mirrors/` repos not getting SSH credentials)
- Update 8 ArgoCD app definitions from `eblume/<mirror>` → `mirrors/<mirror>` (immich-charts, cloudnative-pg-charts, external-secrets, connect-helm-charts)
- Fix stale alloy clone comment in Ansible defaults
- Bump immich v2.5.2 → v2.5.6 (bug-fix patches only)
- Update ArgoCD README bootstrap command and credential docs

## Context

Mirrors were migrated from `forge.ops.eblu.me/eblume/` to `forge.ops.eblu.me/mirrors/` in commit `cd57814`. Container Dockerfiles and image tags were updated, but ArgoCD app definitions and the repo credential template were missed, causing `ComparisonError` on apps that source Helm charts from mirrored repos.

## Deployment

1. Sync the ArgoCD `argocd` app first (picks up the widened credential template)
2. Sync the `apps` app (picks up new repo URLs for all 8 apps)
3. Verify immich resolves its ComparisonError: `argocd app get immich`
4. Sync immich to deploy v2.5.6: `argocd app sync immich`
5. Spot-check: `argocd app get external-secrets`, `argocd app get cloudnative-pg`, `argocd app get 1password-connect`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/266
2026-02-25 06:55:53 -08:00
d05d2fbaff C2: Upgrade Grafana to 12.x with Nix container and Kustomize (#260)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 1s
Build Container (Nix) / build (grafana) (push) Successful in 2s
Build Container / build (grafana) (push) Successful in 7s
## Summary

Mikado chain to upgrade Grafana from 11.4.0 (Helm chart) to 12.x with:
- Home-built Nix container image (`forge.ops.eblu.me/eblume/grafana`)
- Kustomize manifests replacing the Helm chart
- Single-source ArgoCD app

## Chain

Goal: `upgrade-grafana`
Leaves: `build-grafana-container`, `kustomize-grafana-deployment`

Track with: `mise run docs-mikado upgrade-grafana`

## Test plan
- [ ] Container builds successfully via Nix
- [ ] Container pushed to registry
- [ ] Kustomize manifests produce equivalent resources to current Helm
- [ ] Pod runs, UI loads, OIDC works, datasources healthy
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/260
2026-02-23 18:07:18 -08:00
71cb256527 Deploy Authentik identity provider (C2 Mikado) (#227)
## Summary
C2 Mikado chain for deploying Authentik as the SSO identity provider, replacing Dex.

This PR will evolve over multiple sessions. Each iteration adds documentation (prerequisite cards) and eventually code as leaf nodes are resolved.

## Current Mikado State
- **Goal:** `deploy-authentik` (active)
- **Leaf prerequisites:**
  - `build-authentik-container` — Build Nix container image
  - `provision-authentik-database` — Create PostgreSQL database on CNPG cluster
  - `create-authentik-secrets` — Create 1Password item with credentials

## Process refinements
- Updated agent-change-process with lessons from first attempt: reset code before committing cards, open PRs early

## Test plan
- [ ] `mise run docs-mikado` shows correct dependency chain
- [ ] Leaf nodes can be worked independently
- [ ] Container builds on ringtail
- [ ] Authentik starts and reaches healthy state
- [ ] Forgejo OAuth2 connector works

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/227
2026-02-20 12:55:59 -08:00
0cdc143227 Deploy Dex OIDC identity provider with Grafana SSO (#222)
## Summary
- Deploys Dex OIDC identity provider on ringtail k3s cluster as central authentication service
- Integrates Grafana as first SSO client via `auth.generic_oauth`
- Uses Kubernetes CRD storage backend (no PVC needed)
- All secrets (bcrypt hash, client secrets) injected via ExternalSecrets from 1Password item "Dex (blumeops)"
- NixOS-built container image via `containers/dex/default.nix`

## Pre-requisites (manual, before deployment)
1. Create 1Password item "Dex (blumeops)" in `blumeops` vault with fields:
   - `password`: strong generated password for Dex login
   - `static-password-hash`: bcrypt hash of above (`htpasswd -BnC 10 eblume`, copy hash after `eblume:`)
   - `grafana-client-secret`: random 32-char hex (`openssl rand -hex 16`)
2. Build container: `mise run container-tag-and-release dex v1.0.0`

## Deployment sequence
1. Build container: `mise run container-tag-and-release dex v1.0.0`
2. Deploy Caddy: `mise run provision-indri -- --tags caddy`
3. Sync ArgoCD: `argocd app sync apps` → `argocd app sync dex`
4. Verify Dex: `curl https://dex.ops.eblu.me/.well-known/openid-configuration`
5. Sync Grafana: `argocd app sync grafana-config` → `argocd app sync grafana`
6. Test SSO: Visit `https://grafana.ops.eblu.me/login`, click "Sign in with Dex"

## Verification
- [ ] Container image exists: `mise run container-list` shows `dex:v1.0.0-nix`
- [ ] `curl https://dex.ops.eblu.me/.well-known/openid-configuration` returns valid OIDC discovery
- [ ] `curl https://dex.ops.eblu.me/healthz` returns healthy
- [ ] Grafana login shows "Sign in with Dex" button alongside local login
- [ ] OIDC flow: click Dex → enter credentials → redirect back → logged in as Admin
- [ ] Break-glass: local admin login still works
- [ ] `mise run services-check` passes

## Files changed
| File | Action | Purpose |
|------|--------|---------|
| `containers/dex/default.nix` | Create | NixOS container build |
| `argocd/apps/dex.yaml` | Create | ArgoCD app targeting ringtail |
| `argocd/manifests/dex/*` (8 files) | Create | K8s manifests (RBAC, ExternalSecret, Deployment, Service, Ingress) |
| `argocd/manifests/grafana-config/external-secret-dex-oauth.yaml` | Create | Grafana OIDC client secret |
| `argocd/manifests/grafana-config/kustomization.yaml` | Modify | Add new ExternalSecret resource |
| `argocd/manifests/grafana/values.yaml` | Modify | Add `auth.generic_oauth` config + envFromSecrets |
| `ansible/roles/caddy/defaults/main.yml` | Modify | Add `dex.ops.eblu.me` reverse proxy entry |
| `docs/changelog.d/feature-dex-oidc.feature.md` | Create | Changelog fragment |

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/222
2026-02-19 20:24:24 -08:00
b876e39981 Replace Homepage Helm chart with kustomize manifests and custom Dockerfile (#221)
## Summary
- Replace third-party Helm chart (jameswynn/homepage v2.1.0, pinned at app v1.2.0) with plain kustomize manifests and a custom Dockerfile building from forge mirror at v1.10.1
- Adds Dockerfile (`containers/homepage/`) with multi-stage build (node:22-slim builder, node:22-alpine runtime)
- Creates kustomize manifests: Deployment, Service, ConfigMap (6 config files), ServiceAccount, ClusterRole, ClusterRoleBinding
- Keeps existing ingress-tailscale.yaml and all 6 ExternalSecret resources unchanged
- Updates ArgoCD app definition from multi-source Helm to single directory source

## Prerequisite
- Homepage source mirrored at forge.ops.eblu.me/eblume/homepage.git 
- Container must be built and pushed before syncing: `mise run container-release homepage v1.10.1`

## Deployment and Testing
- [ ] Build and push container image: `mise run container-release homepage v1.10.1`
- [ ] Branch-test via ArgoCD: `argocd app set homepage --revision feature/homepage-kustomize && argocd app sync homepage`
- [ ] Verify dashboard loads at go.ops.eblu.me / go.tail8d86e.ts.net
- [ ] Verify k8s autodiscovery works (services appear on dashboard)
- [ ] Verify widgets load (weather, Forgejo, Jellyfin, etc.)
- [ ] After merge: `argocd app set homepage --revision main && argocd app sync homepage`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/221
2026-02-19 18:29:19 -08:00
d5d32fe91f Port Frigate NVR to ringtail k3s with GPU acceleration (#217)
## Summary

- Enable NVIDIA container toolkit on ringtail NixOS and configure k3s containerd with nvidia runtime
- Add NVIDIA device plugin ArgoCD app (RuntimeClass + DaemonSet) to expose `nvidia.com/gpu` resources
- Re-target Frigate from indri minikube (arm64, ZMQ detector) to ringtail k3s (x86_64, TensorRT/ONNX)
- Switch Frigate image to `-tensorrt` variant with GPU resource limits and increased shared memory

## Manual Prerequisites

1. **NFS access**: Verify ringtail can mount `sifaka:/volume1/frigate`
   ```fish
   ssh ringtail 'sudo mount -t nfs sifaka:/volume1/frigate /mnt/storage1 && ls /mnt/storage1 && sudo umount /mnt/storage1'
   ```
2. **YOLO model**: Verify `/volume1/frigate/models/yolov9m.onnx` exists on sifaka

## Deployment Steps

1. Provision ringtail: `mise run provision-ringtail`
2. Sync ArgoCD apps: `argocd app sync apps --prune`
3. Deploy NVIDIA device plugin: `argocd app sync nvidia-device-plugin`
4. Verify GPU: `kubectl --context=k3s-ringtail get nodes -o json | jq '.items[].status.capacity'`
5. Deploy Frigate: `argocd app sync frigate`

## Verification

- [ ] `nvidia.com/gpu: 1` visible in node capacity
- [ ] Frigate pod running with GPU allocated
- [ ] Frigate UI loads at `https://nvr.ops.eblu.me`
- [ ] Detector shows ONNX/TensorRT on System page
- [ ] Camera feed with bounding boxes in live view
- [ ] TensorRT engine build completes (watch logs on first start)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/217
2026-02-19 14:27:04 -08:00
16a4a9a616 Port Mosquitto and ntfy to ringtail k3s, retire Apple Silicon Detector (#216)
## Summary
- Delete `ansible/roles/frigate_detector/` and remove from indri playbook — the Apple Silicon Detector is retired
- Move Mosquitto (MQTT) ArgoCD app from indri minikube to ringtail k3s
- Move ntfy ArgoCD app from indri minikube to ringtail k3s
- Update Frigate docs to reflect detector removal and planned RTX 4080 migration
- Manifests are reused as-is (same `argocd/manifests/mosquitto/` and `argocd/manifests/ntfy/`), just pointed at ringtail

## Deployment

After merge:
1. Sync indri ArgoCD `apps` app with prune to remove old mosquitto/ntfy apps:
   ```
   argocd app sync apps --prune
   ```
2. Sync new ringtail apps:
   ```
   argocd app sync mosquitto-ringtail
   argocd app sync ntfy-ringtail
   ```
3. Manually clean up the detector LaunchAgent on indri:
   ```
   ssh indri 'launchctl unload ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ssh indri 'rm ~/Library/LaunchAgents/mcquack.eblume.frigate-detector.plist'
   ```

## Notes
- Frigate on indri will lose MQTT/ntfy connectivity — this is expected (user confirmed no downtime concerns)
- ntfy Tailscale Ingress hostname `ntfy` will transfer from indri ProxyGroup to ringtail ProxyGroup
- Caddy on indri proxies `ntfy.ops.eblu.me` → `ntfy.tail8d86e.ts.net`, so no Caddy changes needed
- Frigate + frigate-notify will be ported to ringtail in a follow-up PR

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/216
2026-02-19 11:22:44 -08:00
61ca1ca305 Deploy Tailscale operator on ringtail k3s cluster (#215)
## Summary
- Extract shared Tailscale operator resources (CRDs, RBAC, Deployment, ProxyClass, DNSConfig) into `tailscale-operator-base/` so both clusters reference the same manifests
- Add `tailscale-operator-ringtail/` overlay with 1-replica ProxyGroup and ExternalSecret for the shared OAuth client
- Add ArgoCD Application targeting `ringtail.tail8d86e.ts.net:6443`
- Update `.yamllint.yaml` ignore path for the moved `operator.yaml`

## Deployment and Testing
- [ ] Sync `apps` app to pick up the new Application definition
- [ ] `argocd app sync tailscale-operator-ringtail`
- [ ] Verify ExternalSecret syncs: `kubectl --context=k3s-ringtail -n tailscale get externalsecret`
- [ ] Verify operator pod runs: `kubectl --context=k3s-ringtail -n tailscale get pods`
- [ ] Verify ProxyGroup ready: `kubectl --context=k3s-ringtail -n tailscale get proxygroups`
- [ ] Verify indri operator still works: `argocd app diff tailscale-operator`
- [ ] Check Tailscale admin for new operator device with `tag:k8s-operator`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/215
2026-02-19 09:33:05 -08:00
918df9e642 Add k3s, 1Password Connect, and systemd nix-container-builder to ringtail (#209)
## Summary

  Extends ringtail from a desktop/gaming NixOS box into an infrastructure node with a k3s cluster, secrets management, and a Forgejo Actions
  runner for building containers with Nix.

  ### K3s cluster
  - Single-node k3s with Traefik/ServiceLB/metrics-server disabled (minimal footprint)
  - TLS SAN set to `ringtail.tail8d86e.ts.net` so ArgoCD on indri can manage it via Tailscale
  - Containerd registry mirrors pull through Zot on indri (`k3s-registries.yaml`)
  - Tailscale interface added to `trustedInterfaces` for cross-node ArgoCD access
  - `kubectl` added to system packages

  ### 1Password Connect + External Secrets Operator
  - Four new ArgoCD apps targeting `k3s-ringtail`: `1password-connect-ringtail`, `external-secrets-crds-ringtail`, `external-secrets-ringtail`,
  `external-secrets-config-ringtail`
  - Reuses the same Helm charts/values as indri, just pointed at ringtail's k3s API server
  - Bootstrap secrets (`op-credentials`, `onepassword-token`) provisioned by Ansible pre_tasks via `op read`, then applied to the `1password`
  namespace in post_tasks

  ### Systemd Forgejo Actions runner
  - Native `services.gitea-actions-runner` with `forgejo-runner` package — no DinD, no k8s pod, runs directly on the NixOS host
  - Label `nix-container-builder:host` — jobs execute on the host with `nix`, `skopeo`, `nodejs`, etc. in PATH
  - Registration token fetched from 1Password (`Forgejo Secrets/runner_reg`) by Ansible and written to `/etc/forgejo-runner/token.env`
  - Runner's dynamic user (`gitea-runner`) added to `nix.settings.trusted-users` for nix daemon access

  ### Nix container build workflow
  - New `.forgejo/workflows/build-container-nix.yaml` triggers on `*-nix-v[0-9]*` tags (e.g. `nettest-nix-v1.0.0`)
  - Builds with `nix build -f containers/<name>/default.nix`, pushes to Zot via `skopeo copy`
  - Existing Dockerfile workflow guarded with `if: !contains(github.ref_name, '-nix-v')` to avoid double-triggering

  ### Mise task updates
  - `container-tag-and-release` auto-detects `default.nix` vs `Dockerfile` and uses the appropriate tag format (`-nix-v` vs `-v`)
  - `container-list` shows build type indicator (`[nix]` / `[dockerfile]`)

  ## Post-merge

  1. `mise run provision-ringtail` — deploys k3s token, runner token, NixOS rebuild
  2. Register k3s cluster in ArgoCD (first time only):
     ```fish
     ssh ringtail 'sudo cat /etc/rancher/k3s/k3s.yaml' | \
       sed 's|127.0.0.1|ringtail.tail8d86e.ts.net|' > /tmp/k3s-ringtail.yaml
     set -x KUBECONFIG /tmp/k3s-ringtail.yaml
     argocd cluster add default --name k3s-ringtail
  3. Sync ArgoCD apps in order: 1password-connect-ringtail -> external-secrets-crds-ringtail -> external-secrets-ringtail ->
  external-secrets-config-ringtail
  4. Verify runner: ssh ringtail 'systemctl status gitea-runner-nix-container-builder'
  5. Check Forgejo admin panel for ringtail-nix-builder runner online
  6. Test: create containers/<name>/default.nix, tag with <name>-nix-v0.1.0

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/209
2026-02-18 21:15:30 -08:00
105a2c8c08 Update External Secrets Helm chart 1.3.1 → 2.0.0 (#203)
## Summary
- Bump External Secrets Operator Helm chart from `helm-chart-1.3.1` to `helm-chart-2.0.0` (operator v1.3.2)
- Updates both the operator app and CRDs app `targetRevision`
- No Helm values changes needed — `installCRDs`, `resources`, `webhook`, `certController` keys are unchanged

## Breaking changes in chart 2.0.0
- **Removed providers:** Alibaba and Device42 (unmaintained) — does not affect our 1Password setup
- **Templating engine v1 deprecated** — our ExternalSecrets don't set `engineVersion`, so they use the default (v2)
- **Webhook `failurePolicy`** for SecretStore is now dynamic

## Deployment
1. Sync CRDs first: `argocd app set external-secrets-crds --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets-crds`
2. Sync operator: `argocd app set external-secrets --revision update/external-secrets-helm-2.0.0 && argocd app sync external-secrets`
3. Verify: `kubectl --context=minikube-indri -n external-secrets get pods`
4. After merge, set both apps back to `--revision main`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/203
2026-02-17 10:43:21 -08:00
22f418d0dc Doc review: connect-to-postgres, create-release-artifact-workflow, deploy-k8s-service (#191)
## Summary

Review session covering 3 docs, plus a codebase-wide cleanup:

### Docs reviewed
- **connect-to-postgres** — verified end-to-end (psql connection tested), stamped
- **create-release-artifact-workflow** — clarified that `build-blumeops.yaml` is only a version bump example (not a packages API example)
- **deploy-k8s-service** — fixed stale repoURL (`indri:2200` → `forge.ops.eblu.me:2222`), wrong Caddy config keys (`upstream` → `backend`, added missing `host`), updated Homepage group to "Services", added Tailscale tag documentation

### Codebase cleanup
- Migrated all remaining `op item get --fields` calls to `op read` URI syntax across 7 files (docs, READMEs, YAML comments)
- Simplified the `op read` vs `op item get` guidance in CLAUDE.md

## Side findings (not addressed)
- New `immich-pg` CNPG cluster not yet documented in the postgresql reference card

## Test plan
- [x] `psql` connection to `pg.ops.eblu.me` verified
- [x] All pre-commit hooks pass
- [x] `docs-check-links`, `docs-check-index`, `docs-check-frontmatter` pass

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/191
2026-02-15 07:42:01 -08:00
04c7f3c45a Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190)
## Summary

Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159):

- **Mosquitto** — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth)
- **Ntfy** — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me`
- **Frigate** — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me`
- **frigate-notify** — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT

Also includes:
- Prometheus scrape target for Frigate metrics
- Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage)
- Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me`

## Prerequisites

- [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri)
- [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields

## Deployment (after merge)

```bash
argocd app sync apps
argocd app sync mosquitto
argocd app sync ntfy
argocd app sync frigate
argocd app sync grafana-config
argocd app sync prometheus
mise run provision-indri -- --tags caddy
mise run services-check
```

## Verification

- [ ] Mosquitto pod running, accepting connections on 1883
- [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me`
- [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed
- [ ] Object detection working (ONNX, person/car/dog/cat)
- [ ] Recordings appearing in NFS share on sifaka
- [ ] frigate-notify sending detection alerts to Ntfy
- [ ] Prometheus scraping Frigate metrics
- [ ] Grafana dashboard showing Frigate data

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190
2026-02-14 21:27:44 -08:00
b3747f6c95 Tier 1 version bumps (#186)
All checks were successful
Build Container / build (push) Successful in 8s
## Summary

Audit and upgrade of all deployed images, helm charts, and custom container Dockerfiles to latest stable versions. This PR covers Tier 1 (low-risk minor/patch bumps only).

### Upstream images
| Image | Old | New |
|-------|-----|-----|
| kube-state-metrics | v2.13.0 | v2.18.0 |
| prometheus | v3.2.1 | v3.9.1 |
| loki | 3.3.2 | 3.6.5 |
| alloy | v1.5.1 | v1.13.1 |
| tailscale (proxy + operator) | v1.92.5 | v1.94.1 |
| navidrome | :latest | v0.60.3 (pinned) |

### Helm charts
| Chart | Old | New |
|-------|-----|-----|
| CloudNativePG | v0.27.0 | v0.27.1 |
| 1Password Connect | 2.2.1 | 2.3.0 |

### Custom containers (Dockerfiles updated, images not yet tagged)
| Container | Changes | New tag |
|-----------|---------|---------|
| miniflux | 2.2.16→2.2.17 (security), alpine 3.22 | v1.1.0 |
| kubectl | v1.34.1→v1.34.4, alpine 3.22 | v1.1.0 |
| kiwix-serve | alpine 3.22 | v1.1.0 |
| nettest | alpine 3.22 | v0.14.0 |
| transmission | alpine 3.22, pkg 4.0.6-r4 | v1.1.0 |

All custom containers verified with local `dagger call build`.

### Deferred to Tier 2 (separate PRs)
- Forgejo runner 6→12 (major version scheme change)
- Docker DinD 27→29
- Grafana chart 8→11 (repo migration)
- External Secrets 1→2 (breaking changes)
- Python 3.12→3.13, Elixir 1.18→1.19, Node 22→24
- Transmission 4.0.6→4.1.0 (not in Alpine yet)

## Deployment

After merge:
1. Tag custom containers: `mise run container-tag-and-release <name> <version>` for each
2. Wait for CI builds to complete
3. `argocd app sync apps` then sync individual apps, or let ArgoCD auto-detect

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/186
2026-02-13 17:16:37 -08:00
01e19023ee Add CV/resume web app at cv.ops.eblu.me (#169)
## Summary
- nginx container (`containers/cv/`) downloads and serves a content tarball at startup (same pattern as quartz)
- ArgoCD app + k8s manifests (deployment, service, Tailscale ingress)
- Caddy route for `cv.ops.eblu.me`
- Deploy workflow: resolves "latest" or specific version from Forgejo packages, updates deployment, syncs ArgoCD
- Content is built and released from the separate [cv repo](https://forge.ops.eblu.me/eblume/cv)

## Deployment steps (after merge)
1. `mise run container-tag-and-release cv v1.0.0`
2. Run "Release CV" workflow in cv repo (SPECIFIC_VERSION `v0.1.0`)
3. Run "Deploy CV" workflow in blumeops (default: latest)
4. `mise run provision-indri -- --tags caddy`
5. Verify at `https://cv.ops.eblu.me/`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/169
2026-02-12 11:09:41 -08:00
1c86134a62 Phase 1b: Deploy docs hosting with Quartz (#85)
## Summary
- Add ArgoCD Application and manifests for `quartz` service
- Add `docs.ops.eblu.me` to Caddy reverse proxy configuration
- ConfigMap points to blumeops v1.0.0 release tarball
- Tailscale ingress with homepage annotations for auto-discovery

## Deployment and Testing

**Pre-deployment (container build):**
- [ ] Build and tag quartz container: `mise run container-tag-and-release quartz v1.0.0`

**K8s deployment:**
- [ ] Sync apps: `argocd app sync apps`
- [ ] Point quartz at feature branch: `argocd app set quartz --revision feature/docs-phase-1b-hosting`
- [ ] Sync quartz: `argocd app sync quartz`
- [ ] Verify pod is running: `kubectl --context=minikube-indri get pods -n quartz`
- [ ] Verify Tailscale ingress: `kubectl --context=minikube-indri get ingress -n quartz`

**Caddy deployment:**
- [ ] Dry run: `mise run provision-indri -- --tags caddy --check --diff`
- [ ] Apply: `mise run provision-indri -- --tags caddy`

**Verification:**
- [ ] Test https://docs.tail8d86e.ts.net
- [ ] Test https://docs.ops.eblu.me
- [ ] Verify homepage dashboard shows docs link

**Post-merge:**
- [ ] Reset to main: `argocd app set quartz --revision main && argocd app sync quartz`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/85
2026-02-03 10:52:20 -08:00
ade21cc49e Add Navidrome music streaming server (#79)
## Summary
- Deploy Navidrome music streaming server to k8s
- NFS mount for music library from sifaka:/volume1/music (read-only)
- Local PVC for SQLite database and config (10Gi)
- Tailscale ingress for dj.tail8d86e.ts.net
- Caddy reverse proxy for dj.ops.eblu.me
- Homepage annotations for dashboard discovery in Media group

## Deployment and Testing
- [ ] Sync `apps` application to pick up new Application definition
- [ ] Set navidrome app to feature branch and sync
- [ ] Verify NFS mount with `kubectl exec`
- [ ] Provision Caddy for dj.ops.eblu.me
- [ ] Access https://dj.ops.eblu.me and create initial admin user
- [ ] Verify Homepage shows DJ in Media group
- [ ] Reset to main and resync after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/79
2026-01-31 20:19:31 -08:00
e830a0f26b Homepage dashboard improvements (#76)
## Summary
- Fix ArgoCD icon (use `argo-cd.png` per Dashboard Icons naming)
- Add Borgmatic backup metrics widget (time since last backup, archive size)
- Add Sifaka NAS disk usage widget (used/total space)
- Create `[[grafana]]` zk card with management notes

## What didn't work
Attempted Grafana iframe embedding for a metrics panel but reverted:
- Homepage iframe widget only supports height classes, not width
- Some panels fail to load even with anonymous auth enabled
- Documented in grafana zk card for future reference

## Deployment and Testing
- [x] ArgoCD icon displays correctly
- [x] Borgmatic metrics show time since backup and archive size
- [x] NAS disk usage shows used/total bytes
- [x] Grafana reverted to authenticated-only access

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/76
2026-01-30 15:05:02 -08:00
38538ad5f0 Replace hajimari with gethomepage (#75)
## Summary
- Remove hajimari (unmaintained since Oct 2022, broken helm deps)
- Add gethomepage (28k stars, actively maintained, monthly releases)
- Migrate custom apps, bookmarks, and search config
- Enable k8s RBAC for service autodiscovery
- Configure Tailscale ingress at go.tail8d86e.ts.net

## Why the switch
Hajimari hasn't released since October 2022. The helm chart has a broken
dependency (bjw-s/common URL is 404), and unreleased code on main has bugs.
gethomepage has similar k8s autodiscovery via ingress annotations and is
very actively maintained.

## Deployment and Testing
- [ ] Delete hajimari app from ArgoCD
- [ ] Delete hajimari namespace
- [ ] Sync apps to pick up new homepage app
- [ ] Sync homepage app
- [ ] Verify go.ops.eblu.me loads

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/75
2026-01-30 13:21:12 -08:00
9fd70c3892 Switch Hajimari to custom fork
- Use chart from forge.ops.eblu.me/eblume/hajimari fork
- Use custom image from registry.ops.eblu.me/blumeops/hajimari
- Enables future customizations (search auto-focus, weather widget)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-30 11:26:20 -08:00
d1164c8aac Add Hajimari service dashboard (#73)
## Summary
- Add Hajimari as a service dashboard/start page at `go.ops.eblu.me`
- Auto-discovers k8s services from ingress annotations
- Custom apps for non-k8s services: Forgejo, Registry, Sifaka NAS
- Add `nas.ops.eblu.me` Caddy proxy to Synology dashboard

## Services Configured

**Auto-discovered (k8s ingresses with hajimari.io annotations):**
- Grafana, ArgoCD, Prometheus, Loki (Observability)
- Miniflux, Kiwix, Transmission, TeslaMate, Immich (Apps)
- PyPI/devpi (Infrastructure)

**Custom apps (non-k8s):**
- Forgejo (forge.ops.eblu.me)
- Registry (registry.ops.eblu.me)
- Sifaka NAS (nas.ops.eblu.me)

**Bookmarks:**
- Tailscale Admin, 1Password, Pulumi

## Deployment and Testing
- [ ] Sync `apps` application to pick up new Hajimari Application
- [ ] Sync `hajimari` application
- [ ] Run `mise run provision-indri -- --tags caddy` for go/nas proxy entries
- [ ] Re-sync all k8s apps with hajimari annotations (or wait for natural drift)
- [ ] Verify https://go.ops.eblu.me shows dashboard with all services

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/73
2026-01-29 15:51:42 -08:00
0e2df9645d Fix ArgoCD sync drift for apps and immich (#71)
## Summary
- Fix immich app to track `main` branch instead of `feature/immich` for values
- The tailscale-operator ignoreDifferences schema drift will be fixed by syncing the `apps` app

## Deployment and Testing
- [ ] Sync `apps` to fix tailscale-operator schema drift
- [ ] Sync `immich` to pick up correct image versions from main

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/71
2026-01-29 10:24:26 -08:00
482414346e Add External Secrets Operator with 1Password Connect (#66) (#66)
## Summary
- Add 1Password Connect server for secrets automation API
- Add External Secrets Operator (ESO) to sync secrets from 1Password to K8s
- Add ClusterSecretStore connecting ESO to 1Password Connect
- Convert devpi secret to ExternalSecret as proof of concept

## Architecture
```
1Password Cloud → 1Password Connect (k8s) → ESO → Native K8s Secrets
```

## Deployment and Testing
- [ ] Mirror Helm charts to forge (connect-helm-charts, external-secrets) - DONE
- [ ] Create 1Password Connect credentials (`op connect server create`)
- [ ] Store credentials in 1Password item "1Password Connect"
- [ ] Bootstrap secret: `op inject -i argocd/manifests/1password-connect/secret-credentials.yaml.tpl | kubectl apply -f -`
- [ ] Deploy 1password-connect: `argocd app sync 1password-connect`
- [ ] Deploy external-secrets: `argocd app sync external-secrets`
- [ ] Deploy external-secrets-config: `argocd app sync external-secrets-config`
- [ ] Test devpi ExternalSecret: `argocd app sync devpi`
- [ ] Verify secret synced: `kubectl get externalsecret -n devpi`

## Future Work
After PoC validated, migrate remaining 12 secret templates to ExternalSecrets:
- databases (3), tailscale-operator (1), grafana-config (2), teslamate (2)
- forgejo-runner (1), argocd (1), immich (1), 1password-connect (1 - self-bootstrap)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/66
2026-01-28 19:30:10 -08:00
8621996343 Add Immich photo management + migrate forge URLs (#62)
## Summary
- Migrate all ArgoCD app repo URLs from `indri.tail8d86e.ts.net:2200` to `forge.ops.eblu.me:2222`
- Add Immich self-hosted photo management service with:
  - Helm chart deployment via ArgoCD
  - PostgreSQL cluster with pgvecto.rs for AI vector search (immich-pg)
  - NFS storage on sifaka for photo library (2Ti)
  - Tailscale Ingress + Caddy proxy for `photos.ops.eblu.me`
  - Machine learning service for face/object recognition

## Deployment and Testing
- [x] Update ArgoCD repo-creds-forge secret with new URL (one-time manual step)
- [ ] Sync `apps` to pick up new applications
- [ ] Sync all existing apps to verify new forge URL works
- [ ] Sync `blumeops-pg` to deploy immich-pg cluster
- [ ] Wait for immich-pg to be healthy
- [ ] Create immich-db secret from auto-generated password
- [ ] Sync `immich-storage` (PV, PVC, Ingress)
- [ ] Sync `immich` (Helm chart)
- [ ] Run `mise run provision-indri -- --tags caddy` to add photos.ops.eblu.me
- [ ] Verify Immich UI is accessible

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/62
2026-01-26 11:20:11 -08:00
ea42362b6f Migrate Forgejo runner to Kubernetes with DinD (#60)
## Summary
- Deploy Forgejo runner to k8s with Docker-in-Docker sidecar
- Add job execution image with Node.js and Docker CLI
- Retire host-mode runner on indri
- All CI jobs now run containerized in k8s

## Components Added
- `containers/forgejo-runner/Dockerfile` - Job execution image
- `argocd/apps/forgejo-runner.yaml` - ArgoCD Application
- `argocd/manifests/forgejo-runner/` - Kubernetes manifests

## Components Removed
- `ansible/roles/forgejo_runner/` - No longer needed

## Changes to Existing Files
- `.forgejo/workflows/build-container.yaml` - Use `k8s` runner with `DOCKER_HOST` env
- `.github/actionlint.yaml` - Only `k8s` label now valid

## Deployment
1. Apply secret: `op inject -i argocd/manifests/forgejo-runner/secret.yaml.tpl | kubectl --context=minikube-indri apply -f -`
2. Sync ArgoCD: `argocd app sync forgejo-runner`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/60
2026-01-25 19:56:17 -08:00
8ca8798121 Switch to Buildah for container builds (#51)
All checks were successful
Test CI / test (push) Successful in 4s
## Summary
- Replace Docker with Buildah for container image builds
- No Docker socket required - buildah is daemonless
- Cleaner security model (no privileged containers or socket mounting)
- Remove Docker-related security context from deployment

## Changes
- Update Dockerfile to install buildah/podman instead of docker-cli
- Configure buildah storage with overlay driver and fuse-overlayfs
- Update composite action to use `buildah bud` and `buildah push`
- Add `imagePullPolicy: Always` to ensure fresh image pulls
- Update test workflow to verify buildah/podman

## Testing
- [ ] Runner pod starts successfully
- [ ] Buildah is available in runner
- [ ] Test workflow verifies buildah/podman versions
- [ ] Container build workflow builds and pushes to zot

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/51
2026-01-24 13:30:26 -08:00
7893c41020 Enable Forgejo Actions (Phase 1) (#48)
All checks were successful
Test CI / test (push) Successful in 0s
## Summary
- Refactor Forgejo app.ini to be managed by ansible with secrets from 1Password
- Enable Forgejo Actions in config (`[actions] ENABLED = true`)
- Add `repo.actions` to DEFAULT_REPO_UNITS
- Clean up unused MySQL database fields (we use SQLite)

## Phase 1 Progress
This PR covers the first part of Phase 1 (ci-cd-bootstrap plan):
- [x] Refactor app.ini to ansible template
- [x] Store secrets in 1Password
- [x] Enable Actions in config
- [ ] Deploy config changes (pending review)
- [ ] Create runner registration token
- [ ] Deploy runner to k8s
- [ ] Test with simple workflow

## Deployment and Testing
- [ ] Run `mise run provision-indri -- --tags forgejo` to deploy
- [ ] Verify Forgejo restarts correctly
- [ ] Verify Actions tab appears in repo settings

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/48
2026-01-23 17:00:12 -08:00
272ddb213b Add TeslaMate deployment for Tesla Model Y data logging (#47)
## Summary
- Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net
- Add teslamate user to CloudNativePG blumeops-pg cluster
- Add TeslaMate PostgreSQL datasource to Grafana
- Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc.
- Add teslamate database to borgmatic backup configuration

## Deployment and Testing
- [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key"
- [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl | kubectl apply -f -`
- [ ] Sync blumeops-pg: `argocd app sync blumeops-pg`
- [ ] Create teslamate database
- [ ] Apply teslamate secrets (encryption key, db connection)
- [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl | kubectl apply -f -`
- [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config`
- [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net
- [ ] Verify data collection starts
- [ ] Verify Grafana dashboards show data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47
2026-01-22 21:25:44 -08:00