From 0eaf8680fd5c953a5e101cf773624c76207f91db Mon Sep 17 00:00:00 2001 From: Erich Blume Date: Mon, 6 Apr 2026 07:52:35 -0700 Subject: [PATCH] Rewrite observability stack tutorial to match actual practices Replace generic Helm install instructions with kustomize/ArgoCD patterns that reflect how BlumeOps actually deploys Prometheus, Loki, Grafana, and Alloy. Fix "BluemeOps" typos, document Alloy as a core (not optional) component, remove hardcoded admin password, add proper prerequisites and cross-references. Co-Authored-By: Claude Opus 4.6 (1M context) --- .../+rewrite-observability-tutorial.doc.md | 1 + .../replication/observability-stack.md | 307 +++++++++--------- 2 files changed, 158 insertions(+), 150 deletions(-) create mode 100644 docs/changelog.d/+rewrite-observability-tutorial.doc.md diff --git a/docs/changelog.d/+rewrite-observability-tutorial.doc.md b/docs/changelog.d/+rewrite-observability-tutorial.doc.md new file mode 100644 index 0000000..5b727c2 --- /dev/null +++ b/docs/changelog.d/+rewrite-observability-tutorial.doc.md @@ -0,0 +1 @@ +Rewrite observability stack tutorial: replace Helm instructions with actual kustomize/ArgoCD patterns, fix typos, document Alloy as core component diff --git a/docs/tutorials/replication/observability-stack.md b/docs/tutorials/replication/observability-stack.md index db98683..d62731e 100644 --- a/docs/tutorials/replication/observability-stack.md +++ b/docs/tutorials/replication/observability-stack.md @@ -1,6 +1,7 @@ --- title: Observability Stack -modified: 2026-02-07 +modified: 2026-04-06 +last-reviewed: 2026-04-06 tags: - tutorials - replication @@ -10,12 +11,14 @@ tags: # Building the Observability Stack > **Audiences:** Replicator +> +> **Prerequisites:** [[kubernetes-bootstrap|Kubernetes Bootstrap]], [[argocd-config|ArgoCD Config]] -This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see. +This tutorial walks through deploying metrics, logs, and dashboards for your homelab — because you can't fix what you can't see. ## The Stack -A complete observability solution has three pillars: +A complete observability solution has three pillars plus a collection layer: | Component | Purpose | BlumeOps Uses | |-----------|---------|---------------| @@ -24,9 +27,11 @@ A complete observability solution has three pillars: | **Dashboards** | Visualization and alerting | [[grafana]] | | **Collection** | Gathering and forwarding data | [[alloy]] | -For BlumeOps specifics, see [[observability|Observability Reference]]. +BlumeOps deploys all of these as plain kustomize manifests managed by ArgoCD — no Helm charts. See [[no-helm-policy]] for the rationale and [[observability]] for the full reference. -## Step 1: Create Monitoring Namespace +## Step 1: Create the Monitoring Namespace + +ArgoCD can create this automatically via `CreateNamespace=true` in the Application spec, but if you're bootstrapping manually: ```bash kubectl create namespace monitoring @@ -34,20 +39,46 @@ kubectl create namespace monitoring ## Step 2: Deploy Prometheus -Prometheus collects and stores metrics. +Prometheus collects and stores metrics. BlumeOps runs it as a StatefulSet with local persistent storage. -### Using Helm +### Write the Manifests -```bash -helm repo add prometheus-community https://prometheus-community.github.io/helm-charts -helm install prometheus prometheus-community/prometheus \ - --namespace monitoring \ - --set server.persistentVolume.size=10Gi +Create `argocd/manifests/prometheus/` with: + +- **`kustomization.yaml`** — references the manifests and patches the container image +- **`statefulset.yaml`** — a single-replica StatefulSet with a 20Gi PVC for `/prometheus` +- **`configmap.yaml`** — the `prometheus.yml` scrape configuration +- **`service.yaml`** — exposes port 9090 within the cluster + +Key StatefulSet settings: + +```yaml +args: + - "--config.file=/etc/prometheus/prometheus.yml" + - "--storage.tsdb.retention.time=3650d" + - "--web.enable-remote-write-receiver" + - "--web.enable-lifecycle" ``` -### Or via ArgoCD +The remote-write-receiver flag is important — it lets [[alloy]] push metrics into Prometheus from both the host and in-cluster collectors. + +### Tag the Image + +Use your local container registry and the `:kustomized` sentinel pattern: + +```yaml +# kustomization.yaml +images: + - name: registry.ops.eblu.me/blumeops/prometheus + newTag: v3.10.0-abcdef0 +``` + +See [[build-container-image]] for how to build and tag images. + +### Create the ArgoCD Application + +Add `argocd/apps/prometheus.yaml`: -Create an Application pointing to a values file in your repo: ```yaml apiVersion: argoproj.io/v1alpha1 kind: Application @@ -57,17 +88,15 @@ metadata: spec: project: default source: - repoURL: https://prometheus-community.github.io/helm-charts - chart: prometheus - targetRevision: 25.0.0 - helm: - values: | - server: - persistentVolume: - size: 10Gi + repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git + path: argocd/manifests/prometheus + targetRevision: main destination: server: https://kubernetes.default.svc namespace: monitoring + syncPolicy: + syncOptions: + - CreateNamespace=true ``` ### Verify @@ -78,155 +107,133 @@ kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus ## Step 3: Deploy Loki -Loki aggregates logs (like Prometheus but for logs). +Loki aggregates logs — think Prometheus, but for log lines instead of metrics. -```bash -helm repo add grafana https://grafana.github.io/helm-charts -helm install loki grafana/loki-stack \ - --namespace monitoring \ - --set loki.persistence.enabled=true \ - --set loki.persistence.size=10Gi -``` +### Write the Manifests -This also installs Promtail for log collection from pods. +Create `argocd/manifests/loki/` with a StatefulSet, ConfigMap, and Service similar to Prometheus. Loki listens on port 3100 (HTTP) and 9096 (gRPC). + +The config file (`loki-config.yaml`) defines storage, compaction, and retention. For a homelab, a simple single-binary mode with local filesystem storage works well — no need for S3 or distributed mode. + +### Create the ArgoCD Application + +Same pattern as Prometheus — point to `argocd/manifests/loki`, target `monitoring` namespace. ## Step 4: Deploy Grafana -Grafana provides dashboards and visualization. +Grafana provides dashboards, visualization, and alerting. -```bash -helm install grafana grafana/grafana \ - --namespace monitoring \ - --set persistence.enabled=true \ - --set persistence.size=1Gi \ - --set adminPassword=admin # Change this! +### Write the Manifests + +Grafana has more moving parts than Prometheus or Loki: + +- **Deployment** with a PVC for `/var/lib/grafana` +- **ConfigMap** containing `grafana.ini`, `datasources.yaml`, and `alerting.yaml` +- **Dashboard ConfigMaps** labeled `grafana_dashboard: "1"` — a sidecar container watches for these and auto-loads them +- **ExternalSecret** for the admin password (from 1Password via [[external-secrets]]) + +Configure data sources declaratively in the ConfigMap: + +```yaml +# datasources.yaml +apiVersion: 1 +datasources: + - name: Prometheus + type: prometheus + url: http://prometheus.monitoring.svc:9090 + isDefault: true + - name: Loki + type: loki + url: http://loki.monitoring.svc:3100 ``` -### Configure Data Sources +### Secrets -After installation, add data sources in Grafana UI or via ConfigMap: +Grafana's admin password and any OAuth credentials (for [[authentik]] SSO) should come from 1Password via ExternalSecret — never hardcode passwords in manifests. See [[external-secrets]] and [[security-model]]. + +### Expose via Caddy + +BlumeOps exposes Grafana at `grafana.ops.eblu.me` through [[caddy]] on [[indri]], which reverse-proxies to the Kubernetes service via its Tailscale Ingress endpoint. This is the standard pattern for all services — see [[routing]] for details. + +## Step 5: Deploy Alloy + +Grafana Alloy is a unified telemetry collector that replaces multiple agents (Promtail, node_exporter, etc.). BlumeOps runs Alloy in **two places** — it is not optional; it's the glue that connects everything. + +### In-Cluster (DaemonSet) + +Create `argocd/manifests/alloy-k8s/` with: + +- **DaemonSet** — runs on every node, mounts `/var/log` read-only for pod log access +- **ServiceAccount + RBAC** — needs pod list/watch for Kubernetes discovery +- **ConfigMap** — the `config.alloy` file defining: + - Kubernetes pod log discovery and collection + - Service health probes (blackbox-style checks for key services) + - Remote write to Prometheus (`/api/v1/write`) and Loki (`/loki/api/v1/push`) + +The DaemonSet goes in a dedicated `alloy` namespace, separate from `monitoring`. + +### On the Host (Ansible) + +For metrics and logs from native services (Forgejo, Zot, Caddy, Borgmatic), Alloy runs directly on [[indri]] as a macOS LaunchAgent, managed by [[ansible]]. + +The host Alloy collects: +- System metrics via `prometheus.exporter.unix` +- Logs from Homebrew services and LaunchAgents +- Optional: PostgreSQL metrics, container registry metrics + +It pushes to the same Prometheus and Loki endpoints via `*.ops.eblu.me`. + +## What You Now Have + +- **Prometheus** scraping metrics from all services +- **Loki** aggregating logs from all pods and host services +- **Grafana** with declarative dashboards and data sources +- **Alloy** collecting from both Kubernetes and the host +- A foundation for alerting via Grafana Unified Alerting + +## Adding Alerts + +BlumeOps uses Grafana Unified Alerting (not Prometheus Alertmanager). Alerts are defined declaratively in `alerting.yaml` within the Grafana ConfigMap. Notifications go to [[ntfy]] — a self-hosted push notification service. + +Example alert categories: +- Service probe failures (is Grafana/Prometheus/Loki reachable?) +- Pod readiness (are pods healthy?) +- Metrics freshness (is data still flowing?) +- Storage and resource thresholds + +See [[observability]] for the full alerting reference. + +## Adding Dashboards + +Import community dashboards or create custom ones. BlumeOps uses a sidecar pattern — any ConfigMap in the `monitoring` namespace with the label `grafana_dashboard: "1"` is automatically loaded by Grafana's sidecar container. + +Create dashboard ConfigMaps in `argocd/manifests/grafana-config/dashboards/`: ```yaml apiVersion: v1 kind: ConfigMap metadata: - name: grafana-datasources - namespace: monitoring + name: grafana-dashboard-my-service labels: - grafana_datasource: "1" + grafana_dashboard: "1" data: - datasources.yaml: | - apiVersion: 1 - datasources: - - name: Prometheus - type: prometheus - url: http://prometheus-server.monitoring.svc:80 - isDefault: true - - name: Loki - type: loki - url: http://loki.monitoring.svc:3100 + my-service.json: | + { ... dashboard JSON ... } ``` -## Step 5: Access Grafana - -Expose via Tailscale: -```bash -kubectl -n monitoring port-forward svc/grafana 3000:80 & -tailscale serve --bg --https 3000 http://localhost:3000 -``` - -Or create an Ingress. - -Default credentials: `admin` / (password you set or retrieve from secret) - -## Step 6: Add Dashboards - -Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/): - -| Dashboard | ID | Shows | -|-----------|-----|-------| -| Node Exporter Full | 1860 | Host metrics | -| Kubernetes Cluster | 7249 | Cluster overview | -| Loki Logs | 13639 | Log exploration | - -In Grafana: Dashboards > Import > Enter ID - -## Step 7: Deploy Alloy (Optional) - -Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.). - -```yaml -apiVersion: argoproj.io/v1alpha1 -kind: Application -metadata: - name: alloy - namespace: argocd -spec: - project: default - source: - repoURL: https://grafana.github.io/helm-charts - chart: alloy - targetRevision: 0.1.0 - helm: - values: | - alloy: - configMap: - content: | - // Alloy configuration here - destination: - server: https://kubernetes.default.svc - namespace: monitoring -``` - -BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[ansible|Ansible role]]) and in the [[cluster]] (for pod logs and service probes). - -## What You Now Have - -- Metrics collection and storage (Prometheus) -- Log aggregation (Loki) -- Dashboards and visualization (Grafana) -- Foundation for alerting - -## Adding Alerts - -Configure alerting rules in Prometheus: - -```yaml -groups: -- name: example - rules: - - alert: HighMemoryUsage - expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1 - for: 5m - labels: - severity: warning - annotations: - summary: "High memory usage detected" -``` - -And notification channels in Grafana (email, Slack, PagerDuty, etc.). - ## Next Steps +- Set up [[authentik]] SSO for Grafana login (see [[federated-login]]) - Create custom dashboards for your services -- Set up alerting for critical conditions +- Configure alerting rules and notification channels - Add service-specific metrics exporters -## BluemeOps Specifics +## Related -BlumeOps' observability setup includes: -- Prometheus scraping all services via annotations -- Loki collecting logs from all pods and [[indri]] services -- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health -- [[alloy]] running on both host and in-cluster - -See [[observability|Observability Reference]] for full details. - -## Troubleshooting - -| Problem | Solution | -|---------|----------| -| No metrics appearing | Check Prometheus targets (`/targets` endpoint) | -| No logs in Loki | Verify Promtail/Alloy is collecting (`/ready` endpoint) | -| Dashboard shows no data | Check data source configuration and time range | -| High storage usage | Adjust retention settings in Prometheus/Loki | +- [[observability]] — Full observability reference +- [[no-helm-policy]] — Why kustomize instead of Helm +- [[alloy]] — Alloy collector reference +- [[prometheus]] — Prometheus reference +- [[loki]] — Loki reference +- [[grafana]] — Grafana reference +- [[routing]] — Service routing and exposure