From 0eaf8680fd5c953a5e101cf773624c76207f91db Mon Sep 17 00:00:00 2001
From: Erich Blume <blume.erich@gmail.com>
Date: Mon, 6 Apr 2026 07:52:35 -0700
Subject: [PATCH] Rewrite observability stack tutorial to match actual
 practices

Replace generic Helm install instructions with kustomize/ArgoCD patterns
that reflect how BlumeOps actually deploys Prometheus, Loki, Grafana, and
Alloy. Fix "BluemeOps" typos, document Alloy as a core (not optional)
component, remove hardcoded admin password, add proper prerequisites and
cross-references.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---
 .../+rewrite-observability-tutorial.doc.md    |   1 +
 .../replication/observability-stack.md        | 307 +++++++++---------
 2 files changed, 158 insertions(+), 150 deletions(-)
 create mode 100644 docs/changelog.d/+rewrite-observability-tutorial.doc.md

diff --git a/docs/changelog.d/+rewrite-observability-tutorial.doc.md b/docs/changelog.d/+rewrite-observability-tutorial.doc.md
new file mode 100644
index 0000000..5b727c2
--- /dev/null
+++ b/docs/changelog.d/+rewrite-observability-tutorial.doc.md
@@ -0,0 +1 @@
+Rewrite observability stack tutorial: replace Helm instructions with actual kustomize/ArgoCD patterns, fix typos, document Alloy as core component
diff --git a/docs/tutorials/replication/observability-stack.md b/docs/tutorials/replication/observability-stack.md
index db98683..d62731e 100644
--- a/docs/tutorials/replication/observability-stack.md
+++ b/docs/tutorials/replication/observability-stack.md
@@ -1,6 +1,7 @@
 ---
 title: Observability Stack
-modified: 2026-02-07
+modified: 2026-04-06
+last-reviewed: 2026-04-06
 tags:
   - tutorials
   - replication
@@ -10,12 +11,14 @@ tags:
 # Building the Observability Stack
 
 > **Audiences:** Replicator
+>
+> **Prerequisites:** [[kubernetes-bootstrap|Kubernetes Bootstrap]], [[argocd-config|ArgoCD Config]]
 
-This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.
+This tutorial walks through deploying metrics, logs, and dashboards for your homelab — because you can't fix what you can't see.
 
 ## The Stack
 
-A complete observability solution has three pillars:
+A complete observability solution has three pillars plus a collection layer:
 
 | Component | Purpose | BlumeOps Uses |
 |-----------|---------|---------------|
@@ -24,9 +27,11 @@ A complete observability solution has three pillars:
 | **Dashboards** | Visualization and alerting | [[grafana]] |
 | **Collection** | Gathering and forwarding data | [[alloy]] |
 
-For BlumeOps specifics, see [[observability|Observability Reference]].
+BlumeOps deploys all of these as plain kustomize manifests managed by ArgoCD — no Helm charts. See [[no-helm-policy]] for the rationale and [[observability]] for the full reference.
 
-## Step 1: Create Monitoring Namespace
+## Step 1: Create the Monitoring Namespace
+
+ArgoCD can create this automatically via `CreateNamespace=true` in the Application spec, but if you're bootstrapping manually:
 
 ```bash
 kubectl create namespace monitoring
@@ -34,20 +39,46 @@ kubectl create namespace monitoring
 
 ## Step 2: Deploy Prometheus
 
-Prometheus collects and stores metrics.
+Prometheus collects and stores metrics. BlumeOps runs it as a StatefulSet with local persistent storage.
 
-### Using Helm
+### Write the Manifests
 
-```bash
-helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
-helm install prometheus prometheus-community/prometheus \
-  --namespace monitoring \
-  --set server.persistentVolume.size=10Gi
+Create `argocd/manifests/prometheus/` with:
+
+- **`kustomization.yaml`** — references the manifests and patches the container image
+- **`statefulset.yaml`** — a single-replica StatefulSet with a 20Gi PVC for `/prometheus`
+- **`configmap.yaml`** — the `prometheus.yml` scrape configuration
+- **`service.yaml`** — exposes port 9090 within the cluster
+
+Key StatefulSet settings:
+
+```yaml
+args:
+  - "--config.file=/etc/prometheus/prometheus.yml"
+  - "--storage.tsdb.retention.time=3650d"
+  - "--web.enable-remote-write-receiver"
+  - "--web.enable-lifecycle"
 ```
 
-### Or via ArgoCD
+The remote-write-receiver flag is important — it lets [[alloy]] push metrics into Prometheus from both the host and in-cluster collectors.
+
+### Tag the Image
+
+Use your local container registry and the `:kustomized` sentinel pattern:
+
+```yaml
+# kustomization.yaml
+images:
+  - name: registry.ops.eblu.me/blumeops/prometheus
+    newTag: v3.10.0-abcdef0
+```
+
+See [[build-container-image]] for how to build and tag images.
+
+### Create the ArgoCD Application
+
+Add `argocd/apps/prometheus.yaml`:
 
-Create an Application pointing to a values file in your repo:
 ```yaml
 apiVersion: argoproj.io/v1alpha1
 kind: Application
@@ -57,17 +88,15 @@ metadata:
 spec:
   project: default
   source:
-    repoURL: https://prometheus-community.github.io/helm-charts
-    chart: prometheus
-    targetRevision: 25.0.0
-    helm:
-      values: |
-        server:
-          persistentVolume:
-            size: 10Gi
+    repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git
+    path: argocd/manifests/prometheus
+    targetRevision: main
   destination:
     server: https://kubernetes.default.svc
     namespace: monitoring
+  syncPolicy:
+    syncOptions:
+      - CreateNamespace=true
 ```
 
 ### Verify
@@ -78,155 +107,133 @@ kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus
 
 ## Step 3: Deploy Loki
 
-Loki aggregates logs (like Prometheus but for logs).
+Loki aggregates logs — think Prometheus, but for log lines instead of metrics.
 
-```bash
-helm repo add grafana https://grafana.github.io/helm-charts
-helm install loki grafana/loki-stack \
-  --namespace monitoring \
-  --set loki.persistence.enabled=true \
-  --set loki.persistence.size=10Gi
-```
+### Write the Manifests
 
-This also installs Promtail for log collection from pods.
+Create `argocd/manifests/loki/` with a StatefulSet, ConfigMap, and Service similar to Prometheus. Loki listens on port 3100 (HTTP) and 9096 (gRPC).
+
+The config file (`loki-config.yaml`) defines storage, compaction, and retention. For a homelab, a simple single-binary mode with local filesystem storage works well — no need for S3 or distributed mode.
+
+### Create the ArgoCD Application
+
+Same pattern as Prometheus — point to `argocd/manifests/loki`, target `monitoring` namespace.
 
 ## Step 4: Deploy Grafana
 
-Grafana provides dashboards and visualization.
+Grafana provides dashboards, visualization, and alerting.
 
-```bash
-helm install grafana grafana/grafana \
-  --namespace monitoring \
-  --set persistence.enabled=true \
-  --set persistence.size=1Gi \
-  --set adminPassword=admin  # Change this!
+### Write the Manifests
+
+Grafana has more moving parts than Prometheus or Loki:
+
+- **Deployment** with a PVC for `/var/lib/grafana`
+- **ConfigMap** containing `grafana.ini`, `datasources.yaml`, and `alerting.yaml`
+- **Dashboard ConfigMaps** labeled `grafana_dashboard: "1"` — a sidecar container watches for these and auto-loads them
+- **ExternalSecret** for the admin password (from 1Password via [[external-secrets]])
+
+Configure data sources declaratively in the ConfigMap:
+
+```yaml
+# datasources.yaml
+apiVersion: 1
+datasources:
+  - name: Prometheus
+    type: prometheus
+    url: http://prometheus.monitoring.svc:9090
+    isDefault: true
+  - name: Loki
+    type: loki
+    url: http://loki.monitoring.svc:3100
 ```
 
-### Configure Data Sources
+### Secrets
 
-After installation, add data sources in Grafana UI or via ConfigMap:
+Grafana's admin password and any OAuth credentials (for [[authentik]] SSO) should come from 1Password via ExternalSecret — never hardcode passwords in manifests. See [[external-secrets]] and [[security-model]].
+
+### Expose via Caddy
+
+BlumeOps exposes Grafana at `grafana.ops.eblu.me` through [[caddy]] on [[indri]], which reverse-proxies to the Kubernetes service via its Tailscale Ingress endpoint. This is the standard pattern for all services — see [[routing]] for details.
+
+## Step 5: Deploy Alloy
+
+Grafana Alloy is a unified telemetry collector that replaces multiple agents (Promtail, node_exporter, etc.). BlumeOps runs Alloy in **two places** — it is not optional; it's the glue that connects everything.
+
+### In-Cluster (DaemonSet)
+
+Create `argocd/manifests/alloy-k8s/` with:
+
+- **DaemonSet** — runs on every node, mounts `/var/log` read-only for pod log access
+- **ServiceAccount + RBAC** — needs pod list/watch for Kubernetes discovery
+- **ConfigMap** — the `config.alloy` file defining:
+  - Kubernetes pod log discovery and collection
+  - Service health probes (blackbox-style checks for key services)
+  - Remote write to Prometheus (`/api/v1/write`) and Loki (`/loki/api/v1/push`)
+
+The DaemonSet goes in a dedicated `alloy` namespace, separate from `monitoring`.
+
+### On the Host (Ansible)
+
+For metrics and logs from native services (Forgejo, Zot, Caddy, Borgmatic), Alloy runs directly on [[indri]] as a macOS LaunchAgent, managed by [[ansible]].
+
+The host Alloy collects:
+- System metrics via `prometheus.exporter.unix`
+- Logs from Homebrew services and LaunchAgents
+- Optional: PostgreSQL metrics, container registry metrics
+
+It pushes to the same Prometheus and Loki endpoints via `*.ops.eblu.me`.
+
+## What You Now Have
+
+- **Prometheus** scraping metrics from all services
+- **Loki** aggregating logs from all pods and host services
+- **Grafana** with declarative dashboards and data sources
+- **Alloy** collecting from both Kubernetes and the host
+- A foundation for alerting via Grafana Unified Alerting
+
+## Adding Alerts
+
+BlumeOps uses Grafana Unified Alerting (not Prometheus Alertmanager). Alerts are defined declaratively in `alerting.yaml` within the Grafana ConfigMap. Notifications go to [[ntfy]] — a self-hosted push notification service.
+
+Example alert categories:
+- Service probe failures (is Grafana/Prometheus/Loki reachable?)
+- Pod readiness (are pods healthy?)
+- Metrics freshness (is data still flowing?)
+- Storage and resource thresholds
+
+See [[observability]] for the full alerting reference.
+
+## Adding Dashboards
+
+Import community dashboards or create custom ones. BlumeOps uses a sidecar pattern — any ConfigMap in the `monitoring` namespace with the label `grafana_dashboard: "1"` is automatically loaded by Grafana's sidecar container.
+
+Create dashboard ConfigMaps in `argocd/manifests/grafana-config/dashboards/`:
 
 ```yaml
 apiVersion: v1
 kind: ConfigMap
 metadata:
-  name: grafana-datasources
-  namespace: monitoring
+  name: grafana-dashboard-my-service
   labels:
-    grafana_datasource: "1"
+    grafana_dashboard: "1"
 data:
-  datasources.yaml: |
-    apiVersion: 1
-    datasources:
-    - name: Prometheus
-      type: prometheus
-      url: http://prometheus-server.monitoring.svc:80
-      isDefault: true
-    - name: Loki
-      type: loki
-      url: http://loki.monitoring.svc:3100
+  my-service.json: |
+    { ... dashboard JSON ... }
 ```
 
-## Step 5: Access Grafana
-
-Expose via Tailscale:
-```bash
-kubectl -n monitoring port-forward svc/grafana 3000:80 &
-tailscale serve --bg --https 3000 http://localhost:3000
-```
-
-Or create an Ingress.
-
-Default credentials: `admin` / (password you set or retrieve from secret)
-
-## Step 6: Add Dashboards
-
-Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/):
-
-| Dashboard | ID | Shows |
-|-----------|-----|-------|
-| Node Exporter Full | 1860 | Host metrics |
-| Kubernetes Cluster | 7249 | Cluster overview |
-| Loki Logs | 13639 | Log exploration |
-
-In Grafana: Dashboards > Import > Enter ID
-
-## Step 7: Deploy Alloy (Optional)
-
-Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).
-
-```yaml
-apiVersion: argoproj.io/v1alpha1
-kind: Application
-metadata:
-  name: alloy
-  namespace: argocd
-spec:
-  project: default
-  source:
-    repoURL: https://grafana.github.io/helm-charts
-    chart: alloy
-    targetRevision: 0.1.0
-    helm:
-      values: |
-        alloy:
-          configMap:
-            content: |
-              // Alloy configuration here
-  destination:
-    server: https://kubernetes.default.svc
-    namespace: monitoring
-```
-
-BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[ansible|Ansible role]]) and in the [[cluster]] (for pod logs and service probes).
-
-## What You Now Have
-
-- Metrics collection and storage (Prometheus)
-- Log aggregation (Loki)
-- Dashboards and visualization (Grafana)
-- Foundation for alerting
-
-## Adding Alerts
-
-Configure alerting rules in Prometheus:
-
-```yaml
-groups:
-- name: example
-  rules:
-  - alert: HighMemoryUsage
-    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
-    for: 5m
-    labels:
-      severity: warning
-    annotations:
-      summary: "High memory usage detected"
-```
-
-And notification channels in Grafana (email, Slack, PagerDuty, etc.).
-
 ## Next Steps
 
+- Set up [[authentik]] SSO for Grafana login (see [[federated-login]])
 - Create custom dashboards for your services
-- Set up alerting for critical conditions
+- Configure alerting rules and notification channels
 - Add service-specific metrics exporters
 
-## BluemeOps Specifics
+## Related
 
-BlumeOps' observability setup includes:
-- Prometheus scraping all services via annotations
-- Loki collecting logs from all pods and [[indri]] services
-- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health
-- [[alloy]] running on both host and in-cluster
-
-See [[observability|Observability Reference]] for full details.
-
-## Troubleshooting
-
-| Problem | Solution |
-|---------|----------|
-| No metrics appearing | Check Prometheus targets (`/targets` endpoint) |
-| No logs in Loki | Verify Promtail/Alloy is collecting (`/ready` endpoint) |
-| Dashboard shows no data | Check data source configuration and time range |
-| High storage usage | Adjust retention settings in Prometheus/Loki |
+- [[observability]] — Full observability reference
+- [[no-helm-policy]] — Why kustomize instead of Helm
+- [[alloy]] — Alloy collector reference
+- [[prometheus]] — Prometheus reference
+- [[loki]] — Loki reference
+- [[grafana]] — Grafana reference
+- [[routing]] — Service routing and exposure