Commit graph

36 commits

Author SHA1 Message Date
292d354902 C1: deploy adelaide-baby-shower-app to ringtail k3s (#349)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m12s
## Summary

Brings up the Adelaide / Heidi / Addie baby shower app on ringtail k3s with the public/private split that the app's hosting contract calls for: `shower.eblu.me` (public, via Fly proxy) and `shower.ops.eblu.me` (tailnet). App is consumed as a wheel from the Forgejo PyPI index — source lives at [`adelaide-baby-shower-app`](https://forge.eblu.me/eblume/adelaide-baby-shower-app).

### What's included

- **ArgoCD app + manifests** under `argocd/manifests/shower/` (deployment, service, ProxyGroup ingress, ConfigMap for `DJANGO_DEBUG`/`DJANGO_ADMIN_URL`, ExternalSecret for `DJANGO_SECRET_KEY` from 1Password item `Shower (blumeops)`, NFS PV on sifaka, RWX media PVC, RWO local-path data PVC for SQLite). Recreate rollout because SQLite is single-writer.
- **Public surface** (`fly/`): new `shower.eblu.me` server block proxying to `shower.ops.eblu.me`. `/admin/` returns 403 at the edge except `/admin/login/` and `/admin/logout/`, which are rate-limited via a new `shower_auth` zone. `X-Clacks-Overhead` on. GNU Terry Pratchett.
- **fail2ban** filter (`shower-admin-login.conf`) matching 401/403/429 on `/admin/login/` and jail (`shower.conf`) with `maxretry=5/findtime=600/bantime=3600`. The `nginx-deny` action was generalized to take a per-jail `nginx_deny_file` so the shower has its own deny list (forge keeps using the legacy default).
- **Caddy** route on indri (`shower.ops.eblu.me` → `https://shower.tail8d86e.ts.net`).
- **Pulumi** Gandi CNAME `shower.eblu.me → blumeops-proxy.fly.dev.`.
- **Grafana** APM dashboard `configmap-shower-apm.yaml` (request rate, error rate, failed admin login count, latency percentiles, bandwidth, access logs) mirroring `docs-apm.json` with a `host="shower.eblu.me"` filter.
- **Container** `containers/shower/default.nix` — `dockerTools.buildLayeredImage` with a nixpkgs Python and a startup wrapper that creates `/app/data/.venv`, pip-installs `adelaide-baby-shower-app==1.0.0` from the forge PyPI index on first boot, runs migrations + collectstatic, and execs gunicorn. A `local_settings.py` shim pins `DATABASES.NAME`/`MEDIA_ROOT`/`STATIC_ROOT` to absolute paths so they don't end up in site-packages.
- **Docs** runbook at `docs/how-to/operations/shower-app.md` linked from the apps registry, plus changelog fragments.

### Defense layers on the public surface

1. fly nginx geo+fail2ban `$shower_banned` (per-service deny list)
2. fly nginx `limit_req zone=shower_auth` (3 r/s per Fly-Client-IP)
3. django-axes (5 fails / 1h, keyed on username+ip_address)
4. edge `/admin/` block (returns 403 for anything that isn't login/logout)

## Prerequisites for the user to do (NOT in this PR)

Halted on these per request — they touch shared/manual systems:

- [x] **NFS share** on sifaka: `/volume1/shower`, NFS rule for ringtail RW, `chown 1000:1000`
- [ ] **1Password item** `Shower (blumeops)` in the blumeops vault with a freshly minted `secret-key` field (`openssl rand -base64 48`) — do NOT reuse anything that has lived in git
- [ ] **Container build**: `mise run container-build-and-release shower`, then update `images[].newTag` in `argocd/manifests/shower/kustomization.yaml` to the resulting `v1.0.0-<sha>-nix`
- [x] **DNS**: `mise run dns-up` after merge
- [x] **Fly cert**: `fly certs add shower.eblu.me -a blumeops-proxy`
- [ ] **Caddy push**: `mise run provision-indri -- --tags caddy`
- [ ] **Fly redeploy** to pick up the new nginx block + fail2ban jail: `mise run fly-deploy`
- [ ] **ArgoCD sync**: `argocd app set shower --revision shower-app-deploy && argocd app sync shower` to test from this branch before merging

## Test plan

- [ ] Container builds successfully on nix-container-builder runner
- [ ] Pod starts, migrations run, gunicorn answers on :8000
- [ ] `kubectl --context=k3s-ringtail -n shower logs deploy/shower` clean
- [ ] `curl -sf https://shower.ops.eblu.me/` returns the splash page (tailnet)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/` returns 200 (pre-DNS verification)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/users/` returns 403 (edge block)
- [ ] `curl -I -H "Host: shower.eblu.me" https://blumeops-proxy.fly.dev/admin/login/` returns a Django login response
- [ ] After DNS is up: `curl -I https://shower.eblu.me/` returns 200 with `X-Clacks-Overhead`
- [ ] Grafana dashboard "Shower APM" appears and starts showing traffic
- [ ] `mise run services-check` passes

Reviewed-on: #349
2026-05-11 13:47:18 -07:00
1c0ee099fb Move forge-specific latency panels to Forgejo dashboard
Fly.io dashboard keeps aggregate all-hosts p50/p90/p99. Forge-filtered
upstream response time panel moves to Forgejo's "Public Proxy" section.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:13:40 -07:00
d7af004842 Add Forgejo metrics + upstream latency histogram to Fly proxy dashboard
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m53s
- Enable Forgejo /metrics endpoint (app.ini [metrics] section)
- Add Alloy scrape target for Forgejo metrics on indri
- Add upstream_response_time histogram to Fly proxy Alloy config
- Replace single p95 panel with p50/p90/p99 + upstream breakdown
  filtered to forge.eblu.me host

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-04-17 15:05:59 -07:00
c78b86c72c Add offsite backup for immich photo library to BorgBase (#315)
## Summary

- Adds a second borgmatic config (`photos.yaml`) that backs up `/Volumes/photos` (sifaka SMB mount, ~128 GB) to a dedicated BorgBase repo (`immich-photos`), running daily at 4 AM
- Separate launchd agent (`mcquack.eblume.borgmatic-photos`) so photo backups run independently from the main backup
- Refactors `borgmatic_metrics` script to support multiple repos with a `repo` Prometheus label
- Updates Grafana "Borg Backups" dashboard with a `repo` template variable so you can filter/compare repos
- Docs updated: `backups.md`, `borgmatic.md`

## Prerequisites (manual)

- [x] Create `immich-photos` repo on BorgBase with same SSH key
- [ ] Upgrade BorgBase plan to Small ($24/yr) if currently on free tier (128 GB exceeds 10 GB limit)
- [ ] After deploy: `borg init` the new repo (borgmatic does this automatically on first run)

## Test plan

- [ ] Dry run: `mise run provision-indri -- --check --diff --tags borgmatic,borgmatic_metrics`
- [ ] Deploy borgmatic role and verify both configs deployed
- [ ] Run `borgmatic --config ~/.config/borgmatic/photos.yaml create --verbosity 1` manually for first backup (will take hours)
- [ ] Verify metrics script collects from both repos: `~/.local/bin/borgmatic-metrics && cat /opt/homebrew/var/node_exporter/textfile/borgmatic.prom`
- [ ] Sync grafana-config in ArgoCD and verify dashboard repo selector works

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: #315
2026-03-27 19:43:05 -07:00
b97e37543f Deploy Tor Snowflake proxy on ringtail (#311)
## Summary

- Add Snowflake proxy as a native systemd service on ringtail (NixOS)
- Uses `pkgs.snowflake` from nixpkgs (v2.11.0)
- Hardened systemd unit with DynamicUser, ProtectSystem=strict, 512MB memory limit
- Prometheus metrics enabled on localhost:9999

## What is Snowflake?

A Tor pluggable transport that helps censored users reach the Tor network via WebRTC. **This is NOT a Tor exit node** — traffic exits through Tor exit nodes operated by others. The proxy operator cannot see traffic content (double-encrypted) and destination servers never see the proxy's IP.

## Changes

- `nixos/ringtail/configuration.nix` — new systemd service definition
- `docs/reference/services/snowflake-proxy.md` — service reference card
- `docs/reference/infrastructure/ringtail.md` — updated systemd services section
- `service-versions.yaml` — added entry (type: nixos)

## Deploy plan

After review, deploy via `mise run provision-ringtail`. Service starts automatically.

## Test plan

- [ ] `mise run provision-ringtail` succeeds
- [ ] `ssh ringtail 'systemctl status snowflake-proxy'` shows active
- [ ] `ssh ringtail 'journalctl -u snowflake-proxy --no-pager -n 20'` shows broker connections
- [ ] `ssh ringtail 'curl -s localhost:9999/metrics'` returns Prometheus metrics

Reviewed-on: #311
2026-03-24 20:51:40 -07:00
b96b4dad47 Move Alerts dashboard into Infrastructure Alerts folder
Uses the grafana_folder annotation to place the dashboard in the
existing folder created by alert rule provisioning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:20:14 -07:00
9024d41230 Add Grafana alerts dashboard for mobile-friendly alert overview
Two panels: currently firing alerts (firing/pending/noData/error) and
recent state changes. Refreshes every 30s. Uses Grafana's built-in
alertlist panel type — no datasource needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-23 21:16:54 -07:00
e8bdecdb11 Rename Borgmatic dashboard to Borg Backups, add duration graph
Rename dashboard title since borgmatic is just the execution layer.
Add Backup Duration Over Time panel next to New Data Per Backup.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-18 06:33:27 -07:00
4ca2e39901 Externalize TeslaMate dashboards to forge mirror (#296)
## Summary
- Replaces 18 TeslaMate dashboard ConfigMaps (713 KB / 22,080 lines) with a Grafana init container
- Init container fetches dashboard JSON directly from `mirrors/teslamate` on forge, pinned to `v3.0.0`
- Grafana's file provider picks them up from `/tmp/dashboards/TeslaMate/` via `foldersFromFilesStructure`
- Non-TeslaMate dashboards remain as ConfigMaps (unchanged)

## How it works
- New `init-teslamate-dashboards` init container uses busybox `wget` to fetch each JSON file from `https://forge.eblu.me/mirrors/teslamate/raw/tag/v3.0.0/grafana/dashboards/`
- Files land in `/tmp/dashboards/TeslaMate/`, same emptyDir volume the sidecar uses
- The sidecar continues to handle ConfigMap-based dashboards; the init container handles TeslaMate
- Version pin is in the init container args (TESLAMATE_VERSION)

## Deployment and Testing
- [ ] Sync `grafana` app from branch — verify init container runs and fetches dashboards
- [ ] Sync `grafana-config` app from branch — verify TeslaMate ConfigMaps are pruned
- [ ] Check Grafana UI: TeslaMate folder should still contain all 18 dashboards
- [ ] Verify non-TeslaMate dashboards are unaffected
- [ ] After merge: sync both apps from main

Reviewed-on: #296
2026-03-15 18:31:19 -07:00
c281fb5403 Add OpenTelemetry distributed tracing (Tempo + Beyla eBPF) (#286)
## Summary

Adds the third observability pillar — **distributed tracing** — alongside existing metrics (Prometheus) and logs (Loki).

- **Grafana Tempo 2.10.1** on minikube-indri for trace storage with 7d retention, OTLP receivers, and `metrics_generator` that remote-writes span-metrics (RED) to Prometheus
- **Beyla eBPF auto-instrumentation** via a privileged Alloy DaemonSet on ringtail — instruments HTTP services (Frigate, ntfy, Ollama, Immich) without code changes
- **Grafana integration** — Tempo datasource with trace↔log and trace↔metrics correlation, plus Loki derivedFields for trace ID linking
- **Prometheus** scrapes Tempo operational metrics

### Architecture

```
ringtail (k3s)                                indri (minikube)
┌──────────────────────┐                      ┌─────────────────────┐
│ Alloy+Beyla (eBPF)   │──OTLP HTTP────────→ │ Tempo               │
│  ↳ Frigate, ntfy,    │  via tailnet         │  ↳ trace storage    │
│    Ollama, Immich     │                      │  ↳ RED → Prometheus │
└──────────────────────┘                      │                     │
                                              │ Grafana             │
                                              │  ↳ Tempo datasource │
                                              └─────────────────────┘
```

### New files (12)
- `docs/reference/services/tempo.md` — reference doc
- `docs/changelog.d/feature-otel-tracing.feature.md`
- `argocd/apps/tempo.yaml` + `argocd/manifests/tempo/` (6 files)
- `argocd/apps/alloy-tracing-ringtail.yaml` + `argocd/manifests/alloy-tracing-ringtail/` (4 files)

### Modified files (6)
- `argocd/manifests/grafana/datasources.yaml` — Tempo datasource + Loki derivedFields
- `argocd/manifests/prometheus/prometheus.yml` — Tempo scrape target
- `service-versions.yaml` — tempo + alloy-tracing-ringtail entries
- `docs/reference/services/grafana.md` — Tempo in datasources table
- `docs/reference/reference.md` — Tempo in services index
- `docs/reference/operations/observability.md` — Tempo in components list

## Deployment and Testing

- [ ] Sync `apps` app to pick up new Application definitions
- [ ] `argocd app set tempo --revision feature/otel-tracing && argocd app sync tempo`
- [ ] Verify Tempo pod: `kubectl --context=minikube-indri get pods -n monitoring -l app=tempo`
- [ ] Verify Tempo ready: port-forward 3200 and `curl localhost:3200/ready`
- [ ] Verify Tailscale ingresses: `kubectl --context=minikube-indri get ingress -n monitoring`
- [ ] `argocd app set alloy-tracing-ringtail --revision feature/otel-tracing && argocd app sync alloy-tracing-ringtail`
- [ ] Check Beyla discovery in alloy-tracing logs on ringtail
- [ ] Sync grafana-config for updated datasources
- [ ] Sync prometheus for updated scrape config
- [ ] Test Grafana Tempo datasource connection
- [ ] Generate test traffic and search traces in Grafana Explore → Tempo
- [ ] After merge: reset all ArgoCD app revisions back to main

Reviewed-on: #286
2026-03-05 10:51:07 -08:00
797133b28e Fix per-torrent rate panels showing cumulative bytes instead of rates
All checks were successful
Build Container (Nix) / detect (push) Successful in 2s
Build Container / detect (push) Successful in 2s
Build Container (Nix) / build (transmission-exporter) (push) Successful in 2s
Build Container / build (transmission-exporter) (push) Successful in 38s
Dashboard "Download/Upload Rate by Torrent" panels were querying
transmission_torrent_download_bytes (total_size * percent_done) and
transmission_torrent_upload_bytes (uploaded_ever) — cumulative byte
gauges, not rates. Added new metrics using Transmission's native
rate_download/rate_upload and updated dashboard queries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-05 08:01:37 -08:00
91d84e54d5 Replace OOMKilled stat with detail table, shrink waiting reason panel
The count-only stat wasn't actionable. New table shows pod name, container,
restart count, and memory limit for each OOMKilled container. Waiting reason
panel narrowed to make room.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:58:11 -08:00
008da43736 Add OOMKill observability to Kubernetes Clusters dashboard
OOMKilled containers previously only appeared briefly in "Unhealthy Pods"
while dying, then vanished on restart. New panels use persistent metrics
(last_terminated_reason) and restart rate tracking.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-04 20:53:07 -08:00
7b68be2e80 Add fly.io proxy observability and app logs to Forgejo dashboard
Rename "Forgejo Repository Health" to "Forgejo" and add proxy metrics
(request rate, error rate, RPS, latency, bandwidth), proxy access logs,
and Forgejo application logs from Loki.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-03 10:24:53 -08:00
95c8424e62 Add Transmission metrics exporter and Grafana dashboard (#271)
## Summary
- Add `metalmatze/transmission-exporter` as a sidecar container in the torrent deployment, exposing Prometheus metrics on port 19091
- Add metrics port to the torrent service for Prometheus scraping
- Add Prometheus scrape job targeting the transmission exporter
- Create Grafana dashboard with:
  - Overview stats (download/upload speed, active/total torrents)
  - Transfer speed timeseries (download + upload over time)
  - Transfer volume stats (total downloaded/uploaded in selected range)
  - Per-torrent download and upload rate timeseries
  - Per-torrent details table (ratio, uploaded, percent done)

## Deployment and Testing
- [ ] Sync ArgoCD `torrent` app from branch — verify exporter sidecar starts
- [ ] Verify exporter metrics: `kubectl exec` into pod, `curl localhost:19091/metrics`
- [ ] Verify Prometheus scrapes it: check targets at prometheus.ops.eblu.me
- [ ] Open Grafana, find "Transmission" dashboard, verify panels populate
- [ ] Sync ArgoCD `prometheus` app from branch
- [ ] Sync ArgoCD `grafana-config` app from branch

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/271
2026-02-25 22:23:33 -08:00
03d71544ec Add multi-cluster observability with ringtail metrics and dashboards (#270)
## Summary
- Add `cluster` label (indri/ringtail) to all Prometheus scrape jobs, Alloy k8s metrics/logs, and Alloy host metrics/logs
- Deploy kube-state-metrics on ringtail's k3s cluster (ArgoCD app + manifests)
- Deploy Alloy on ringtail to collect pod metrics and logs, remote-writing to indri's Prometheus and Loki
- Replace single-cluster "Minikube Kubernetes" and "K8s Services Health" dashboards with:
  - **Kubernetes Clusters** dashboard — multi-cluster with `cluster` and `namespace` template variables
  - **Ringtail (k3s)** dashboard — dedicated ringtail view with GPU usage panels

## Deployment and Testing
1. Sync `apps` on indri ArgoCD to pick up new app definitions (`kube-state-metrics-ringtail`, `alloy-ringtail`)
2. Sync `prometheus` → verify `cluster` label on scraped metrics
3. Sync `alloy-k8s` → verify `cluster=indri` on remote-written metrics and logs
4. Run `mise run provision-indri -- --tags alloy` → verify `cluster=indri` on host Alloy metrics/logs
5. Sync `kube-state-metrics-ringtail` → verify pods running on ringtail
6. Sync `alloy-ringtail` → verify pods running, check Prometheus for `kube_pod_info{cluster="ringtail"}`
7. Sync `grafana-config` → verify dashboards appear, cluster variable populates both values
8. Check Loki for `{cluster="ringtail"}` logs from ringtail pods

## Notes
- Alloy on ringtail uses `insecure_skip_verify=true` for TLS to Prometheus/Loki (Tailscale-managed certs not in container trust store) — tighten later
- DNS resolution for `*.tail8d86e.ts.net` from ringtail pods depends on CoreDNS inheriting host's MagicDNS resolver; may need CoreDNS forwarding rules if pods can't resolve
- The old services dashboard (blackbox probes) is removed — those probes are still running in alloy-k8s and the data is still in Prometheus, just not in a dedicated dashboard

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/270
2026-02-25 22:01:00 -08:00
285ad4141f Fix Frigate detection events rate metric name in Grafana dashboard
The panel queried frigate_camera_events but the actual metric exposed
by Frigate is frigate_camera_events_total with a "camera" label
(not "camera_name").

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-25 16:51:57 -08:00
6871bf32a8 Remove unused gpu panel in frigate dashboard 2026-02-22 18:26:47 -08:00
2c6c6a244a Fix Frigate Prometheus metrics & rebuild Grafana dashboard (#252)
## Summary

- **Prometheus scrape target:** Changed from `frigate.frigate.svc.cluster.local:5000` (broken after ringtail migration) to `nvr.ops.eblu.me` via HTTPS through Caddy on indri
- **Grafana dashboard:** Rebuilt for Frigate 0.17 metrics — 12 panels total:
  - Row 1 (stats): Uptime, Inference Speed, Camera FPS, Detection FPS, GPU Usage, GPU Temp
  - Row 2 (timeseries): CPU Usage, Memory Usage
  - Row 3 (timeseries): Camera FPS + Skipped FPS, GPU Usage + Memory over time
  - Row 4 (timeseries): Storage Usage, Detection Events (rate by camera/label)

## Deployment and Testing

1. Sync prometheus app on branch:
   ```
   argocd app set prometheus --revision fix/frigate-metrics-dashboard && argocd app sync prometheus
   ```
2. Check `prometheus.ops.eblu.me/targets` — frigate job should show UP
3. Sync grafana-config:
   ```
   argocd app sync grafana-config
   ```
4. Check `grafana.ops.eblu.me` — Frigate NVR dashboard should show live data
5. After merge: reset both apps to `--revision main` and sync

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/252
2026-02-22 18:14:17 -08:00
2c081eed28 Add Forgejo repository health metrics and Grafana dashboard (#245)
## Summary
- New `forgejo_metrics` Ansible role that queries the Forgejo REST API every 60s and writes Prometheus textfile metrics (open PRs, issues, languages, releases, commits, Actions runs/duration/success)
- Grafana dashboard "Forgejo Repository Health" with 12 panels across 4 rows: overview stats, CI/CD health, repository info, and staleness tracking
- Deletes superseded `forgejo-actions-dashboard` plan doc (this implementation covers a broader scope)

## Deployment and Testing
- [ ] `mise run provision-indri -- --tags forgejo_metrics` to deploy the collector
- [ ] `ssh indri 'cat /opt/homebrew/var/node_exporter/textfile/forgejo.prom'` to verify metrics
- [ ] `argocd app sync grafana-config` to deploy the dashboard
- [ ] Check Grafana dashboard "Forgejo Repository Health" loads with data
- [ ] `mise run services-check` passes

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/245
2026-02-22 11:16:03 -08:00
04c7f3c45a Deploy Frigate NVR stack with Mosquitto, Ntfy, and frigate-notify (#190)
## Summary

Deploy a cloud-free NVR stack for the GableCam (ReoLink Elite Floodlight at 192.168.1.159):

- **Mosquitto** — shared MQTT broker in `mqtt` namespace (cluster-internal, no auth)
- **Ntfy** — self-hosted push notifications in `ntfy` namespace, exposed at `ntfy.tail8d86e.ts.net` / `ntfy.ops.eblu.me`
- **Frigate** — NVR with GableCam via HTTP-FLV, ONNX CPU detection, NFS recordings on sifaka, exposed at `nvr.tail8d86e.ts.net` / `nvr.ops.eblu.me`
- **frigate-notify** — bridges Frigate detection events (person, car, dog, cat) to Ntfy alerts via MQTT

Also includes:
- Prometheus scrape target for Frigate metrics
- Grafana dashboard for Frigate (status, inference speed, FPS, CPU/memory, storage)
- Caddy reverse proxy entries for `nvr.ops.eblu.me` and `ntfy.ops.eblu.me`

## Prerequisites

- [ ] Create NFS share `frigate` on sifaka (`/volume1/frigate`, RW for indri)
- [ ] Create 1Password item "Reolink Floodlight Camera" in `blumeops` vault with `username` and `password` fields

## Deployment (after merge)

```bash
argocd app sync apps
argocd app sync mosquitto
argocd app sync ntfy
argocd app sync frigate
argocd app sync grafana-config
argocd app sync prometheus
mise run provision-indri -- --tags caddy
mise run services-check
```

## Verification

- [ ] Mosquitto pod running, accepting connections on 1883
- [ ] Ntfy web UI accessible at `ntfy.ops.eblu.me`
- [ ] Frigate web UI at `nvr.ops.eblu.me` showing GableCam live feed
- [ ] Object detection working (ONNX, person/car/dog/cat)
- [ ] Recordings appearing in NFS share on sifaka
- [ ] frigate-notify sending detection alerts to Ntfy
- [ ] Prometheus scraping Frigate metrics
- [ ] Grafana dashboard showing Frigate data

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/190
2026-02-14 21:27:44 -08:00
9c789a1868 Fix cache hit rate on APM and Fly.io dashboards (#177)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m19s
## Summary
- Remove `match_all = true` from `flyio_nginx_cache_requests_total` in Alloy so the metric only counts requests that go through the proxy cache (excludes health checks with empty `cache_status`)
- Change dashboard queries from `rate(...[5m])` to `increase(...[$__range])` — aggregates over the full dashboard time window instead of a 5-minute sliding window, giving meaningful ratios for low-traffic static sites
- Add null/NaN value mapping to show "No traffic" in neutral color instead of blank/red

## Root cause
Health check requests from Fly.io hit the default nginx server block (no `proxy_cache`), producing entries with empty `upstream_cache_status`. With `match_all = true`, these were counted in the cache metric, diluting the Fly.io dashboard ratio. For APM dashboards, `rate()[5m]` on low-traffic sites with 24h cache validity almost always returns either all-HITs (100%) or no data (blank → red background).

## Deployment
- Fly.io proxy redeploy needed for Alloy config change
- ArgoCD sync for dashboard ConfigMap changes

## Test plan
- [ ] Redeploy Fly.io proxy
- [ ] Sync grafana-config in ArgoCD
- [ ] Verify CV APM cache hit ratio shows a real percentage (not 100%)
- [ ] Verify Docs APM shows "No traffic" in neutral color when idle, real ratio when visited
- [ ] Verify Fly.io proxy dashboard cache ratio excludes health checks

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/177
2026-02-12 18:40:48 -08:00
df372fccb6 Expose CV publicly at cv.eblu.me (#173)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m57s
## Summary
- Add nginx server block for `cv.eblu.me` (static site, same pattern as docs)
- Add DNS CNAME record in Pulumi (`cv.eblu.me` → `blumeops-proxy.fly.dev`)
- Add `cv.eblu.me` cert to `fly-setup` mise task
- Tag CV Tailscale ingress with `tag:flyio-target` for ACL access
- Remove `/_error` test endpoint from docs proxy

## Deployment and Testing
- [ ] `argocd app set cv --revision cv/public-cv-eblu-me && argocd app sync cv`
- [ ] `fly certs add cv.eblu.me -a blumeops-proxy`
- [ ] `mise run fly-deploy`
- [ ] Verify proxy: `curl -I -H "Host: cv.eblu.me" https://blumeops-proxy.fly.dev/`
- [ ] `mise run dns-preview` then `mise run dns-up`
- [ ] Verify live: `curl -I https://cv.eblu.me`
- [ ] Merge, then `argocd app set cv --revision main && argocd app sync cv`

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/173
2026-02-12 14:05:00 -08:00
85e36cd807 Operations and observability for sifaka NAS (#135)
## Summary
- Add `smartctl_exporter` Docker container to sifaka for SMART disk health monitoring
- Formalize existing `node_exporter` container under Ansible management
- Route both exporters through Caddy L4 TCP proxy (`nas.ops.eblu.me:9100`, `nas.ops.eblu.me:9633`), replacing the hardcoded LAN IP in Prometheus
- Create "Sifaka Disk Health" Grafana dashboard (health status, temperature, wear indicators, lifetime)
- Introduce `ansible/playbooks/sifaka.yml` and `mise run provision-sifaka` — first Ansible playbook for the NAS
- Shared exporter port variables in `group_vars/all.yml` to avoid duplication between Caddy and sifaka roles

## Prerequisites before deploy
- [ ] Enable SSH on sifaka (DSM Control Panel > Terminal & SNMP)
- [ ] Verify `ssh eblume@sifaka 'docker ps'` works
- [ ] Run `mise run provision-sifaka` to deploy containers
- [ ] Run `mise run provision-indri -- --tags caddy` to add L4 routes
- [ ] `argocd app sync prometheus` + `argocd app sync grafana-config`

## Test plan
- [ ] Verify smartctl_exporter metrics: `curl http://nas.ops.eblu.me:9633/metrics`
- [ ] Verify Prometheus targets page shows both sifaka jobs as UP
- [ ] Verify Grafana "Sifaka Disk Health" dashboard loads with data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/135
2026-02-09 17:44:05 -08:00
3415cad38c Log real client IPs via Fly-Client-IP header (#130)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 59s
## Summary
- Add `client_ip` field to the Fly.io nginx JSON log format, sourced from `Fly-Client-IP` header
- Extract `client_ip` in the Alloy pipeline so it's available as a parsed field in Loki
- Keeps `remote_addr` (the internal proxy IP) for debugging

Fixes: Grafana access logs for docs.eblu.me showing 172.16.11.178 for every request instead of real visitor IPs.

## Deployment and Testing
- [ ] Deploy updated fly.io proxy: `fly deploy` from `fly/` directory
- [ ] Verify in Grafana that new log lines include `client_ip` with real IPs
- [ ] Confirm `remote_addr` still shows the proxy IP (preserved for debugging)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/130
2026-02-09 11:02:06 -08:00
cc54b4f565 Add Fly.io proxy observability via embedded Alloy (#123)
All checks were successful
Deploy Fly.io Proxy / deploy (push) Successful in 1m16s
## Summary

- Embed Grafana Alloy in the Fly.io proxy container to collect nginx JSON access logs (→ Loki) and derive request rate, latency histogram, cache status, and bandwidth metrics (→ Prometheus)
- Add nginx `stub_status` endpoint for connection-level metrics (active/reading/writing/waiting)
- Create two Grafana dashboards: **Docs APM** (per-service view filtered by `host="docs.eblu.me"`) and **Fly.io Proxy Health** (aggregate proxy health across all upstream services)

## Changed Files

| File | Change |
|------|--------|
| `fly/nginx.conf` | Add JSON `log_format` + `access_log`, add `stub_status` endpoint |
| `fly/Dockerfile` | COPY Alloy binary from `grafana/alloy:v1.5.1`, COPY `alloy.river` config |
| `fly/alloy.river` | **New** — Alloy config: log tailing, metric extraction, remote_write |
| `fly/start.sh` | Start Alloy after Tailscale, before nginx |
| `argocd/manifests/grafana-config/dashboards/configmap-docs-apm.yaml` | **New** — Docs APM dashboard |
| `argocd/manifests/grafana-config/dashboards/configmap-flyio.yaml` | **New** — Fly.io Proxy Health dashboard |
| `argocd/manifests/grafana-config/kustomization.yaml` | Register new dashboard configmaps |
| `docs/reference/services/flyio-proxy.md` | Document observability setup |

## Deployment and Testing

- [ ] `mise run fly-deploy` — rebuild container with Alloy
- [ ] `curl https://docs.eblu.me/` — generate traffic
- [ ] `fly logs -a blumeops-proxy` — verify Alloy startup
- [ ] Query Prometheus: `flyio_nginx_http_requests_total{instance="flyio-proxy"}`
- [ ] Query Loki: `{instance="flyio-proxy", job="flyio-nginx"}`
- [ ] `argocd app sync grafana-config` — deploy dashboards
- [ ] Verify dashboards show data in Grafana
- [ ] `mise run services-check` — no regressions

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/123
2026-02-08 10:05:38 -08:00
737371ab59 Add pod state observability to minikube dashboard (#83)
## Summary
- Add "Unhealthy Pods" stat panel showing count of pods in error states (ImagePullBackOff, CrashLoopBackOff, etc.) with red background when > 0
- Add "Pods by Waiting Reason" time series chart showing container waiting states over time
- Provides visibility into stuck pods that ArgoCD doesn't track (since it manages CronJobs, not the Jobs/Pods they spawn)

## Context
This addresses the issue where a `zim-watcher` cronjob pod was stuck in `ImagePullBackOff` for 11 days without any alerting. ArgoCD showed the CronJob as "Synced, Healthy" because it only manages the CronJob resource, not its spawned Jobs/Pods.

## Deployment and Testing
- [ ] Sync grafana-config app to test branch
- [ ] Verify dashboard renders correctly
- [ ] Confirm "Unhealthy Pods" shows 0 (green) when no issues
- [ ] Reset to main after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/83
2026-02-03 07:20:05 -08:00
b8b33b76c8 Remove Plex media server (#78)
## Summary
- Remove plex_metrics ansible role
- Remove Plex Grafana dashboard
- Remove Plex log collection from Alloy config
- Update indri-services-check to check Jellyfin instead of Plex

## Deployment and Testing
- [x] Unloaded plex-metrics LaunchAgent on indri
- [x] Deleted plex-metrics plist and script
- [x] Deleted plex.prom textfile
- [ ] Deploy Alloy config update
- [ ] Sync grafana-config to remove dashboard

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/78
2026-01-30 17:06:00 -08:00
bcc8685316 Add Jellyfin media server deployment (#77)
## Summary
- Add Jellyfin ansible role for native macOS deployment via Homebrew cask
- Add jellyfin_metrics role for Prometheus textfile metrics collection
- Add Caddy routing for jellyfin.ops.eblu.me
- Add Alloy log collection for Jellyfin stdout/stderr
- Add Grafana dashboard for Jellyfin monitoring

## Architecture
Jellyfin runs natively on indri (not in k8s) for full VideoToolbox hardware transcoding support. The M1 Mac Mini can handle ~3 concurrent 4K HDR→SDR transcoding streams.

## Deployment and Testing
- [ ] Deploy Jellyfin: `mise run provision-indri -- --tags jellyfin,jellyfin_metrics,caddy,alloy`
- [ ] Sync Grafana dashboard: `argocd app sync grafana-config`
- [ ] Complete Jellyfin setup wizard at https://jellyfin.ops.eblu.me
- [ ] Generate API key and save to `~/.jellyfin-api-key`
- [ ] Add media libraries (/Volumes/allisonflix/Movies, /Volumes/allisonflix/TV)
- [ ] Enable VideoToolbox hardware transcoding
- [ ] Verify metrics in Grafana dashboard
- [ ] Verify logs in Loki: `{service="jellyfin"}`

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/77
2026-01-30 16:57:26 -08:00
0d8eb651d4 Fix XID Age graph to show threshold context (#69)
All checks were successful
Build Container / build (push) Successful in 1m10s
## Summary
- Add fixed Y-axis (0-220M) so the 200M autovacuum threshold is always visible
- Add dashed threshold lines at 150M (yellow warning) and 200M (red danger)
- Update title to clarify the threshold

## Context
The raw XID age naturally trends upward between vacuum freezes, which looked alarming without context. Current values (~143K-216K) are at 0.1% of the threshold - completely healthy.

## Deployment and Testing
- [ ] Sync grafana-config app to feature branch
- [ ] Verify threshold lines appear on PostgreSQL dashboard

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/69
2026-01-29 07:08:21 -08:00
0604877db2 Add 'Tesla' prefix to all TeslaMate dashboard titles (#68)
## Summary
- Renamed all 18 TeslaMate Grafana dashboards to include "Tesla" prefix
- Improves organization and discoverability in the dashboard list

## Deployment and Testing
- [ ] Sync grafana-config app: `argocd app set grafana-config --revision feature/rename-tesla-dashboards && argocd app sync grafana-config`
- [ ] Verify dashboards display with "Tesla" prefix in Grafana

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/68
2026-01-29 06:55:44 -08:00
272ddb213b Add TeslaMate deployment for Tesla Model Y data logging (#47)
## Summary
- Add TeslaMate k8s deployment with Tailscale ingress at tesla.tail8d86e.ts.net
- Add teslamate user to CloudNativePG blumeops-pg cluster
- Add TeslaMate PostgreSQL datasource to Grafana
- Import 18 TeslaMate Grafana dashboards for charging, drives, efficiency, etc.
- Add teslamate database to borgmatic backup configuration

## Deployment and Testing
- [ ] Create 1Password items: "TeslaMate DB Password" and "TeslaMate Encryption Key"
- [ ] Apply database user secret: `op inject -i argocd/manifests/databases/secret-teslamate.yaml.tpl | kubectl apply -f -`
- [ ] Sync blumeops-pg: `argocd app sync blumeops-pg`
- [ ] Create teslamate database
- [ ] Apply teslamate secrets (encryption key, db connection)
- [ ] Apply Grafana datasource secret: `op inject -i argocd/manifests/grafana-config/secret-teslamate-datasource.yaml.tpl | kubectl apply -f -`
- [ ] Sync apps and teslamate: `argocd app sync apps teslamate grafana grafana-config`
- [ ] Complete Tesla API OAuth flow at https://tesla.tail8d86e.ts.net
- [ ] Verify data collection starts
- [ ] Verify Grafana dashboards show data

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/47
2026-01-22 21:25:44 -08:00
57bf8512dc Log filtering cleanup and observability improvements (#45)
## Summary
- Suppress noisy storage-provisioner Endpoints deprecation warning (upstream minikube issue)
- Disable thermal collector on indri Alloy (not supported on macOS M1)
- Add macOS power/thermal metrics collection via powermetrics LaunchDaemon
- Add Power & Thermal section to macOS Grafana dashboard
- Add logfmt parser for k8s log level extraction (Loki, Prometheus, etc.)
- Extract more fields from JSON logs (zot compatibility - uses "message" not "msg")
- Silence logfmt parse errors for non-logfmt logs
- Fix JSON escaping in devpi dashboard

## Deployment and Testing
- [x] Deployed Alloy config changes to indri via ansible
- [x] Synced alloy-k8s and grafana-config via ArgoCD
- [x] Verified power metrics appearing in Prometheus
- [x] Verified thermal collector errors stopped
- [x] Verified logfmt parse errors silenced
- [x] Verified devpi dashboard loads correctly

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/45
2026-01-22 17:30:08 -08:00
e4a8405de7 Observability cleanup and k8s service monitoring (#43) (#43)
## Summary
- Remove stale `/opt/homebrew/var/loki` from borgmatic backup (Loki migrated to k8s)
- Add Alloy k8s DaemonSet for automatic pod log collection with auto-discovery
- Add blackbox probes for miniflux, kiwix, transmission, devpi, argocd
- Add transmission-exporter sidecar for full metrics (speed, torrent counts, ratios)
- Replace stale devpi dashboard with probe-based metrics (status, response time, uptime)
- Add unified "K8s Services Health" dashboard for service uptime/response monitoring

## Manual cleanup already performed
- Deleted stale textfile metrics on indri: `devpi.prom`, `transmission.prom`
- Deleted stale data directories on indri: `/opt/homebrew/var/loki/`, `/opt/homebrew/var/prometheus/`

## Deployment and Testing
- [x] Sync `apps` application to pick up new alloy-k8s app
- [x] Deploy alloy-k8s on feature branch: `argocd app set alloy-k8s --revision feature/observability-cleanup && argocd app sync alloy-k8s`
- [x] Deploy torrent on feature branch (for transmission exporter): `argocd app set torrent --revision feature/observability-cleanup && argocd app sync torrent`
- [x] Deploy prometheus on feature branch (for new scrape config): `argocd app set prometheus --revision feature/observability-cleanup && argocd app sync prometheus`
- [x] Deploy grafana-config on feature branch (for dashboards): `argocd app set grafana-config --revision feature/observability-cleanup && argocd app sync grafana-config`
- [x] Verify pod logs appear in Loki/Grafana
- [x] Verify transmission metrics appear in Prometheus
- [x] Verify service probe metrics appear in Prometheus
- [x] Run `mise run provision-indri -- --tags borgmatic` to update borgmatic config
- [ ] After merge, reset apps to main and resync

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/43
2026-01-22 13:51:01 -08:00
17023085cb Migrate observability stack to Kubernetes (#42)
Note: the name of this branch was chosen before the scope widened to encompass the entire observability stack.

Summary

  - Fix Grafana data source URLs (docker driver uses host.minikube.internal, not host.containers.internal)
  - Migrate Prometheus and Loki from indri to Kubernetes with Tailscale Ingresses
  - Expose CNPG PostgreSQL metrics via Tailscale and update dashboard to use cnpg_* metrics
  - Update Alloy to push metrics/logs to k8s endpoints (prometheus.tail8d86e.ts.net, loki.tail8d86e.ts.net)
  - Add ACL rule for port 9187 (CNPG metrics)
  - Delete obsolete ansible roles for prometheus and loki

Changes

  - argocd/manifests/prometheus/ - New Prometheus StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/manifests/loki/ - New Loki StatefulSet with 20Gi PVC and Tailscale Ingress
  - argocd/apps/prometheus.yaml, argocd/apps/loki.yaml - ArgoCD Applications
  - argocd/manifests/grafana/values.yaml - Data sources now use k8s internal DNS
  - argocd/manifests/databases/service-metrics-tailscale.yaml - CNPG metrics endpoint
  - argocd/manifests/grafana-config/dashboards/configmap-postgresql.yaml - Updated to cnpg_* metrics
  - ansible/roles/alloy/defaults/main.yml - Push to k8s Tailscale endpoints
  - pulumi/policy.hujson - ACL for port 9187
  - Deleted ansible/roles/prometheus/ and ansible/roles/loki/

Deployment and Testing

  - Stop prometheus and loki on indri
  - Sync ArgoCD apps (apps, prometheus, loki, grafana)
  - Run mise run provision-indri -- --tags alloy
  - Verify Grafana dashboards show data

🤖 Generated with https://claude.ai/claude-code

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/42
2026-01-22 12:06:02 -08:00
7e6742ad24 K8s Migration Phase 2: Grafana to Kubernetes (#30)
## Summary
- Migrate Grafana from Homebrew/Ansible to Kubernetes deployment
- Switch CloudNativePG to use forge-mirrored Helm chart (HTTPS, no auth needed)
- Add Grafana Helm chart deployment via ArgoCD with multi-source pattern
- Add Grafana config (Tailscale Ingress, 9 dashboard ConfigMaps)
- Update Loki to bind 0.0.0.0 for k8s pod access via `host.containers.internal`

## Key Changes
- `argocd/apps/grafana.yaml` - Grafana Helm chart Application
- `argocd/apps/grafana-config.yaml` - Ingress + dashboard ConfigMaps
- `argocd/apps/cloudnative-pg.yaml` - Now uses forge mirror instead of external Helm repo
- `ansible/roles/loki/templates/loki-config.yaml.j2` - Bind 0.0.0.0

## Deployment and Testing
- [x] Deploy Loki config change: `mise run provision-indri -- --tags loki`
- [x] Create namespace: `ki create namespace monitoring`
- [x] Create secret: `op inject -i argocd/manifests/grafana-config/secret-admin.yaml.tpl | ki apply -f -`
- [x] Sync ArgoCD apps (grafana, grafana-config)
- [x] Verify Grafana works at https://grafana.tail8d86e.ts.net
- [x] Remove svc:grafana from ansible tailscale_serve
- [x] Stop brew grafana: `ssh indri 'brew services stop grafana'`
- [x] Delete ansible grafana role

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.tail8d86e.ts.net/eblume/blumeops/pulls/30
2026-01-19 14:40:25 -08:00