blumeops

Author	SHA1	Message	Date
Erich Blume	1897eb1c5b	C0: move immich blackbox probe to ringtail alloy Immich migrated to ringtail's k3s cluster but the probe still targeted the in-cluster service DNS on indri's minikube, firing ServiceProbeFailure indefinitely. Moved the target into alloy-ringtail's config so the probe runs in the cluster where immich actually lives. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-17 08:46:22 -07:00
Erich Blume	55563afc7e	C0: alloy — bump kustomization tags to main-branch SHA Per the build-container-image squash-merge convention, rebuild alloy v1.16.0 container images from the main SHA (`9564435`) and update the three alloy kustomizations to reference :v1.16.0-9564435[-nix] instead of the branch SHA :v1.16.0-26a3ab5[-nix] left over from #345. Both images were rebuilt locally on gilbert (dagger) and ringtail (nix) because indri is still under heavy macOS memory-compressor pressure (see separate ticket); CI on indri can't reliably run the dagger publish step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-01 08:31:27 -07:00
Erich Blume	9564435b11	Alloy V1.16.0 (#345 ) Bump Grafana Alloy v1.14.0 → v1.16.0 across all four services (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail; alloy native ansible). Also migrate the indri build path from `Dockerfile` to a native Dagger `container.py` per the build-container-image migration playbook. ## Highlights from upstream - v1.15: database observability promoted to stable, OTel Collector → v0.147.0 - v1.16: clustering for `loki.source.kubernetes_events`, MySQL exporter 0.19.0 - One pre-existing breaking change in v1.15 (`loki.source.awsfirehose` undocumented metric prefix rename) — not used here. ## Build infra Alloy v1.16.0's go.mod requires Go 1.26.2. The nix derivation now uses `pkgs.go_1_26` with `GOTOOLCHAIN=local` to avoid auto-downloading a toolchain blob that violated the fixed-output rule. ## Test plan - [ ] CI: `mise run container-build-and-release alloy --ref alloy-v1.16.0` (dispatched as run 522; nix job to be re-triggered with the v1.16.0 goModules outputHash once the local ringtail build surfaces it) - [ ] After CI green, bump `images[].newTag` in three kustomizations to the new `-<sha>` and `-<sha>-nix` tags, deploy from this branch via `argocd app set <app> --revision alloy-v1.16.0 && argocd app sync <app>` - [ ] Manual rebuild of macOS native binary on gilbert (per ansible/roles/alloy README) and `mise run provision-indri -- --tags alloy --check --diff` - [ ] `mise run services-check` after merge & redeploy Reviewed-on: #345	2026-05-01 08:05:37 -07:00
Erich Blume	b97e37543f	Deploy Tor Snowflake proxy on ringtail (#311 ) ## Summary - Add Snowflake proxy as a native systemd service on ringtail (NixOS) - Uses `pkgs.snowflake` from nixpkgs (v2.11.0) - Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit - Prometheus metrics enabled on localhost:9999 ## What is Snowflake? A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. This is NOT a Tor exit node — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP. ## Changes - `nixos/ringtail/configuration.nix` — new systemd service definition - `docs/reference/services/snowflake-proxy.md` — service reference card - `docs/reference/infrastructure/ringtail.md` — updated systemd services section - `service-versions.yaml` — added entry (type: nixos) ## Deploy plan After review, deploy via `mise run provision-ringtail`. Service starts automatically. ## Test plan - [ ] `mise run provision-ringtail` succeeds - [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active - [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections - [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics Reviewed-on: #311	2026-03-24 20:51:40 -07:00
Erich Blume	3b7abbd689	Update container tags to `fd0bebb` (post-merge rebuild) C0 follow-up to #309: update kustomization newTag for all containers rebuilt by the merge (authentik, authentik-redis, ntfy, alloy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 13:39:26 -07:00
Erich Blume	3d2a97aaf9	Update kustomization tags to OCI-labeled builds (`613f05d`) Point all services at the `613f05d` images which carry the new consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4 version, no matching build) and docs/quartz (no `613f05d` build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 06:34:12 -07:00
Erich Blume	4f99b7edaa	Update alloy kustomizations to local container tags Point alloy-k8s at v1.14.0-61f02a0 (Dockerfile) and both ringtail deployments at v1.14.0-61f02a0-nix (Nix build). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-17 16:55:55 -07:00
Erich Blume	61f02a0335	Localize Alloy container image (#300 ) All checks were successful Build Container (Nix) / detect (push) Successful in 2s Details Build Container / detect (push) Successful in 2s Details Build Container (Nix) / build (alloy) (push) Successful in 14s Details Build Container / build (alloy) (push) Successful in 38m34s Details ## Summary - Add `containers/alloy/` with dual Dockerfile + Nix build files for Grafana Alloy v1.14.0 - Both builds fetch source from forge mirror (`forge.ops.eblu.me/mirrors/alloy.git`), build the web UI (Node), then compile the Go binary with `netgo embedalloyui` tags - Update all three alloy deployments (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail) to use `registry.ops.eblu.me/blumeops/alloy` - `promtail_journal_enabled` tag omitted — requires systemd headers and none of our configs use `loki.source.journal` ## Build verification - Dockerfile: Tested locally via `docker build`, binary reports `v1.14.0` with correct tags - Nix: Tested on ringtail via `nix-build`, all three hashes (fetchgit, npmDeps, goModules) resolved and build succeeds ## Post-merge steps 1. Wait for CI to build the container from main (both Dockerfile and Nix workflows) 2. `mise run container-list alloy` to find the `[main]` tagged image 3. C0 follow-up to update `newTag` in all three kustomizations from `v1.14.0-placeholder` to the real tag 4. Sync ArgoCD apps and verify pods come up healthy Reviewed-on: #300	2026-03-17 16:42:53 -07:00
Erich Blume	ab8ea6f301	Bump Grafana Alloy to v1.14.0 (#292 ) ## Summary - Bump alloy-k8s, alloy-ringtail, and alloy-tracing-ringtail image tags from v1.13.1 to v1.14.0 - Mark indri alloy (ansible) as reviewed at v1.14.0 — source rebuild from forge mirror needed - Add missing alloy-ringtail entry to service-versions.yaml - Update alloy reference doc ## Breaking changes reviewed - `loki.secretfilter` options removed — not used in our configs - OTel Collector upgraded to v0.142.0 — Kafka receiver changes don't affect us - Exporter queue default changes — our tracing pipeline (Beyla → batch → otlphttp) uses simple config, low risk ## Deployment and Testing - [ ] Sync alloy-k8s: `argocd app set alloy-k8s --revision bump/alloy-v1.14.0 && argocd app sync alloy-k8s` - [ ] Sync alloy-ringtail: `argocd app set alloy-ringtail --revision bump/alloy-v1.14.0 --server ringtail-argocd && argocd app sync alloy-ringtail` - [ ] Sync alloy-tracing-ringtail similarly - [ ] Verify metrics flowing in Grafana - [ ] Verify traces flowing to Tempo (ringtail) - [ ] Rebuild indri alloy from source (`v1.14.0` tag on forge mirror), SCP to indri, restart - [ ] After merge: reset ArgoCD revisions to main, re-sync Reviewed-on: #292	2026-03-13 16:25:27 -07:00
Erich Blume	14e931591b	Fix 1Password Connect numeric log levels misclassified in Grafana (#287 ) ## Summary - 1Password Connect uses non-standard numeric log levels (`1`=error, `2`=warn, `3`=info, `4`=debug, `5`=trace) per [1Password/connect#44](https://github.com/1Password/connect/issues/44) - Alloy extracts the `level` JSON field as-is, so info-level health checks get `level="3"` in Loki - Grafana expects string level labels — numeric values are unrecognized, causing misclassified log severity/coloring - Adds a `stage.match` + `stage.template` in the Alloy pipeline scoped to `{namespace="1password"}` to normalize numeric levels to standard strings - Other services are completely unaffected (scoped by namespace, not global) ## Deployment and Testing - [ ] Sync alloy-k8s from branch: `argocd app set alloy-k8s --revision fix/onepassword-numeric-log-levels && argocd app sync alloy-k8s` - [ ] Wait ~2 minutes for new logs to flow - [ ] Verify level labels: `curl -sG "http://localhost:3100/loki/api/v1/label/level/values" --data-urlencode 'query={namespace="1password"}'` should show `"info"` and `"warn"` instead of `"3"` and `"2"` - [ ] Check Grafana log panel for 1password namespace — logs should no longer appear as errors - [ ] After merge: `argocd app set alloy-k8s --revision main && argocd app sync alloy-k8s` Reviewed-on: #287	2026-03-07 13:57:04 -08:00
Erich Blume	6e8d11c6bb	Add :kustomized sentinel tag to manifest images, review devpi Bare image references in manifests were ambiguous — unclear whether the tag was intentionally omitted or managed by kustomize. Add :kustomized sentinel to all 37 image refs overridden by kustomize images transformer. Add sync notes for tailscale-operator proxyclass (CRD fields not processed by kustomize). Mark devpi reviewed (6.19.1 is current). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 08:15:06 -08:00
Erich Blume	03d71544ec	Add multi-cluster observability with ringtail metrics and dashboards (#270 ) ## Summary - Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs - Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests) - Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki - Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with: - Kubernetes Clusters dashboard — multi-cluster with `cluster` and `namespace` template variables - Ringtail (k3s) dashboard — dedicated ringtail view with GPU usage panels ## Deployment and Testing 1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`) 2. Sync `prometheus` → verify `cluster` label on scraped metrics 3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs 4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs 5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail 6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}` 7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values 8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods ## Notes - Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later - DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve - The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270	2026-02-25 22:01:00 -08:00

12 commits