Commit graph

12 commits

Author SHA1 Message Date
1897eb1c5b C0: move immich blackbox probe to ringtail alloy
Immich migrated to ringtail's k3s cluster but the probe still targeted
the in-cluster service DNS on indri's minikube, firing ServiceProbeFailure
indefinitely. Moved the target into alloy-ringtail's config so the probe
runs in the cluster where immich actually lives.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-17 08:46:22 -07:00
55563afc7e C0: alloy — bump kustomization tags to main-branch SHA
Per the build-container-image squash-merge convention, rebuild alloy v1.16.0
container images from the main SHA (9564435) and update the three alloy
kustomizations to reference :v1.16.0-9564435[-nix] instead of the branch
SHA :v1.16.0-26a3ab5[-nix] left over from #345.

Both images were rebuilt locally on gilbert (dagger) and ringtail (nix)
because indri is still under heavy macOS memory-compressor pressure (see
separate ticket); CI on indri can't reliably run the dagger publish step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-01 08:31:27 -07:00
9564435b11 Alloy V1.16.0 (#345)
Bump Grafana Alloy v1.14.0 → v1.16.0 across all four services (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail; alloy native ansible). Also migrate the indri build path from `Dockerfile` to a native Dagger `container.py` per the build-container-image migration playbook.

## Highlights from upstream
- v1.15: database observability promoted to stable, OTel Collector → v0.147.0
- v1.16: clustering for `loki.source.kubernetes_events`, MySQL exporter 0.19.0
- One pre-existing breaking change in v1.15 (`loki.source.awsfirehose` undocumented metric prefix rename) — not used here.

## Build infra
Alloy v1.16.0's go.mod requires Go 1.26.2. The nix derivation now uses `pkgs.go_1_26` with `GOTOOLCHAIN=local` to avoid auto-downloading a toolchain blob that violated the fixed-output rule.

## Test plan
- [ ] CI: `mise run container-build-and-release alloy --ref alloy-v1.16.0` (dispatched as run 522; nix job to be re-triggered with the v1.16.0 goModules outputHash once the local ringtail build surfaces it)
- [ ] After CI green, bump `images[].newTag` in three kustomizations to the new `-<sha>` and `-<sha>-nix` tags, deploy from this branch via `argocd app set <app> --revision alloy-v1.16.0 && argocd app sync <app>`
- [ ] Manual rebuild of macOS native binary on gilbert (per ansible/roles/alloy README) and `mise run provision-indri -- --tags alloy --check --diff`
- [ ] `mise run services-check` after merge & redeploy

Reviewed-on: #345
2026-05-01 08:05:37 -07:00
b97e37543f Deploy Tor Snowflake proxy on ringtail (#311)
## Summary

- Add Snowflake proxy as a native systemd service on ringtail (NixOS)
- Uses `pkgs.snowflake` from nixpkgs (v2.11.0)
- Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit
- Prometheus metrics enabled on localhost:9999

## What is Snowflake?

A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. **This is NOT a Tor exit node** — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP.

## Changes

- `nixos/ringtail/configuration.nix` — new systemd service definition
- `docs/reference/services/snowflake-proxy.md` — service reference card
- `docs/reference/infrastructure/ringtail.md` — updated systemd services section
- `service-versions.yaml` — added entry (type: nixos)

## Deploy plan

After review, deploy via `mise run provision-ringtail`. Service starts automatically.

## Test plan

- [ ] `mise run provision-ringtail` succeeds
- [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active
- [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections
- [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics

Reviewed-on: #311
2026-03-24 20:51:40 -07:00
3b7abbd689 Update container tags to fd0bebb (post-merge rebuild)
C0 follow-up to #309: update kustomization newTag for all containers
rebuilt by the merge (authentik, authentik-redis, ntfy, alloy).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-24 13:39:26 -07:00
3d2a97aaf9 Update kustomization tags to OCI-labeled builds (613f05d)
Point all services at the 613f05d images which carry the new
consistent OCI labels. Skipped kiwix/transmission (old v4.0.6-r4
version, no matching build) and docs/quartz (no 613f05d build).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-19 06:34:12 -07:00
4f99b7edaa Update alloy kustomizations to local container tags
Point alloy-k8s at v1.14.0-61f02a0 (Dockerfile) and both ringtail
deployments at v1.14.0-61f02a0-nix (Nix build).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-17 16:55:55 -07:00
61f02a0335 Localize Alloy container image (#300)
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (alloy) (push) Successful in 14s
Build Container / build (alloy) (push) Successful in 38m34s
## Summary

- Add `containers/alloy/` with dual Dockerfile + Nix build files for Grafana Alloy v1.14.0
- Both builds fetch source from forge mirror (`forge.ops.eblu.me/mirrors/alloy.git`), build the web UI (Node), then compile the Go binary with `netgo embedalloyui` tags
- Update all three alloy deployments (alloy-k8s, alloy-ringtail, alloy-tracing-ringtail) to use `registry.ops.eblu.me/blumeops/alloy`
- `promtail_journal_enabled` tag omitted — requires systemd headers and none of our configs use `loki.source.journal`

## Build verification

- **Dockerfile:** Tested locally via `docker build`, binary reports `v1.14.0` with correct tags
- **Nix:** Tested on ringtail via `nix-build`, all three hashes (fetchgit, npmDeps, goModules) resolved and build succeeds

## Post-merge steps

1. Wait for CI to build the container from main (both Dockerfile and Nix workflows)
2. `mise run container-list alloy` to find the `[main]` tagged image
3. C0 follow-up to update `newTag` in all three kustomizations from `v1.14.0-placeholder` to the real tag
4. Sync ArgoCD apps and verify pods come up healthy

Reviewed-on: #300
2026-03-17 16:42:53 -07:00
ab8ea6f301 Bump Grafana Alloy to v1.14.0 (#292)
## Summary
- Bump alloy-k8s, alloy-ringtail, and alloy-tracing-ringtail image tags from v1.13.1 to v1.14.0
- Mark indri alloy (ansible) as reviewed at v1.14.0 — source rebuild from forge mirror needed
- Add missing alloy-ringtail entry to service-versions.yaml
- Update alloy reference doc

## Breaking changes reviewed
- `loki.secretfilter` options removed — not used in our configs
- OTel Collector upgraded to v0.142.0 — Kafka receiver changes don't affect us
- Exporter queue default changes — our tracing pipeline (Beyla → batch → otlphttp) uses simple config, low risk

## Deployment and Testing
- [ ] Sync alloy-k8s: `argocd app set alloy-k8s --revision bump/alloy-v1.14.0 && argocd app sync alloy-k8s`
- [ ] Sync alloy-ringtail: `argocd app set alloy-ringtail --revision bump/alloy-v1.14.0 --server ringtail-argocd && argocd app sync alloy-ringtail`
- [ ] Sync alloy-tracing-ringtail similarly
- [ ] Verify metrics flowing in Grafana
- [ ] Verify traces flowing to Tempo (ringtail)
- [ ] Rebuild indri alloy from source (`v1.14.0` tag on forge mirror), SCP to indri, restart
- [ ] After merge: reset ArgoCD revisions to main, re-sync

Reviewed-on: #292
2026-03-13 16:25:27 -07:00
14e931591b Fix 1Password Connect numeric log levels misclassified in Grafana (#287)
## Summary
- 1Password Connect uses non-standard numeric log levels (`1`=error, `2`=warn, `3`=info, `4`=debug, `5`=trace) per [1Password/connect#44](https://github.com/1Password/connect/issues/44)
- Alloy extracts the `level` JSON field as-is, so info-level health checks get `level="3"` in Loki
- Grafana expects string level labels — numeric values are unrecognized, causing misclassified log severity/coloring
- Adds a `stage.match` + `stage.template` in the Alloy pipeline scoped to `{namespace="1password"}` to normalize numeric levels to standard strings
- Other services are completely unaffected (scoped by namespace, not global)

## Deployment and Testing
- [ ] Sync alloy-k8s from branch: `argocd app set alloy-k8s --revision fix/onepassword-numeric-log-levels && argocd app sync alloy-k8s`
- [ ] Wait ~2 minutes for new logs to flow
- [ ] Verify level labels: `curl -sG "http://localhost:3100/loki/api/v1/label/level/values" --data-urlencode 'query={namespace="1password"}'` should show `"info"` and `"warn"` instead of `"3"` and `"2"`
- [ ] Check Grafana log panel for 1password namespace — logs should no longer appear as errors
- [ ] After merge: `argocd app set alloy-k8s --revision main && argocd app sync alloy-k8s`

Reviewed-on: #287
2026-03-07 13:57:04 -08:00
6e8d11c6bb Add :kustomized sentinel tag to manifest images, review devpi
Bare image references in manifests were ambiguous — unclear whether the
tag was intentionally omitted or managed by kustomize. Add :kustomized
sentinel to all 37 image refs overridden by kustomize images transformer.
Add sync notes for tailscale-operator proxyclass (CRD fields not processed
by kustomize). Mark devpi reviewed (6.19.1 is current).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-06 08:15:06 -08:00
03d71544ec Add multi-cluster observability with ringtail metrics and dashboards (#270)
## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
  - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
  - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels

## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods

## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
2026-02-25 22:01:00 -08:00