Rewrite observability stack tutorial to match actual practices
Replace generic Helm install instructions with kustomize/ArgoCD patterns that reflect how BlumeOps actually deploys Prometheus, Loki, Grafana, and Alloy. Fix "BluemeOps" typos, document Alloy as a core (not optional) component, remove hardcoded admin password, add proper prerequisites and cross-references. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
f42fa2d558
commit
0eaf8680fd
2 changed files with 143 additions and 135 deletions
1
docs/changelog.d/+rewrite-observability-tutorial.doc.md
Normal file
1
docs/changelog.d/+rewrite-observability-tutorial.doc.md
Normal file
|
|
@ -0,0 +1 @@
|
||||||
|
Rewrite observability stack tutorial: replace Helm instructions with actual kustomize/ArgoCD patterns, fix typos, document Alloy as core component
|
||||||
|
|
@ -1,6 +1,7 @@
|
||||||
---
|
---
|
||||||
title: Observability Stack
|
title: Observability Stack
|
||||||
modified: 2026-02-07
|
modified: 2026-04-06
|
||||||
|
last-reviewed: 2026-04-06
|
||||||
tags:
|
tags:
|
||||||
- tutorials
|
- tutorials
|
||||||
- replication
|
- replication
|
||||||
|
|
@ -10,12 +11,14 @@ tags:
|
||||||
# Building the Observability Stack
|
# Building the Observability Stack
|
||||||
|
|
||||||
> **Audiences:** Replicator
|
> **Audiences:** Replicator
|
||||||
|
>
|
||||||
|
> **Prerequisites:** [[kubernetes-bootstrap|Kubernetes Bootstrap]], [[argocd-config|ArgoCD Config]]
|
||||||
|
|
||||||
This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.
|
This tutorial walks through deploying metrics, logs, and dashboards for your homelab — because you can't fix what you can't see.
|
||||||
|
|
||||||
## The Stack
|
## The Stack
|
||||||
|
|
||||||
A complete observability solution has three pillars:
|
A complete observability solution has three pillars plus a collection layer:
|
||||||
|
|
||||||
| Component | Purpose | BlumeOps Uses |
|
| Component | Purpose | BlumeOps Uses |
|
||||||
|-----------|---------|---------------|
|
|-----------|---------|---------------|
|
||||||
|
|
@ -24,9 +27,11 @@ A complete observability solution has three pillars:
|
||||||
| **Dashboards** | Visualization and alerting | [[grafana]] |
|
| **Dashboards** | Visualization and alerting | [[grafana]] |
|
||||||
| **Collection** | Gathering and forwarding data | [[alloy]] |
|
| **Collection** | Gathering and forwarding data | [[alloy]] |
|
||||||
|
|
||||||
For BlumeOps specifics, see [[observability|Observability Reference]].
|
BlumeOps deploys all of these as plain kustomize manifests managed by ArgoCD — no Helm charts. See [[no-helm-policy]] for the rationale and [[observability]] for the full reference.
|
||||||
|
|
||||||
## Step 1: Create Monitoring Namespace
|
## Step 1: Create the Monitoring Namespace
|
||||||
|
|
||||||
|
ArgoCD can create this automatically via `CreateNamespace=true` in the Application spec, but if you're bootstrapping manually:
|
||||||
|
|
||||||
```bash
|
```bash
|
||||||
kubectl create namespace monitoring
|
kubectl create namespace monitoring
|
||||||
|
|
@ -34,20 +39,46 @@ kubectl create namespace monitoring
|
||||||
|
|
||||||
## Step 2: Deploy Prometheus
|
## Step 2: Deploy Prometheus
|
||||||
|
|
||||||
Prometheus collects and stores metrics.
|
Prometheus collects and stores metrics. BlumeOps runs it as a StatefulSet with local persistent storage.
|
||||||
|
|
||||||
### Using Helm
|
### Write the Manifests
|
||||||
|
|
||||||
```bash
|
Create `argocd/manifests/prometheus/` with:
|
||||||
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
|
|
||||||
helm install prometheus prometheus-community/prometheus \
|
- **`kustomization.yaml`** — references the manifests and patches the container image
|
||||||
--namespace monitoring \
|
- **`statefulset.yaml`** — a single-replica StatefulSet with a 20Gi PVC for `/prometheus`
|
||||||
--set server.persistentVolume.size=10Gi
|
- **`configmap.yaml`** — the `prometheus.yml` scrape configuration
|
||||||
|
- **`service.yaml`** — exposes port 9090 within the cluster
|
||||||
|
|
||||||
|
Key StatefulSet settings:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
args:
|
||||||
|
- "--config.file=/etc/prometheus/prometheus.yml"
|
||||||
|
- "--storage.tsdb.retention.time=3650d"
|
||||||
|
- "--web.enable-remote-write-receiver"
|
||||||
|
- "--web.enable-lifecycle"
|
||||||
```
|
```
|
||||||
|
|
||||||
### Or via ArgoCD
|
The remote-write-receiver flag is important — it lets [[alloy]] push metrics into Prometheus from both the host and in-cluster collectors.
|
||||||
|
|
||||||
|
### Tag the Image
|
||||||
|
|
||||||
|
Use your local container registry and the `:kustomized` sentinel pattern:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# kustomization.yaml
|
||||||
|
images:
|
||||||
|
- name: registry.ops.eblu.me/blumeops/prometheus
|
||||||
|
newTag: v3.10.0-abcdef0
|
||||||
|
```
|
||||||
|
|
||||||
|
See [[build-container-image]] for how to build and tag images.
|
||||||
|
|
||||||
|
### Create the ArgoCD Application
|
||||||
|
|
||||||
|
Add `argocd/apps/prometheus.yaml`:
|
||||||
|
|
||||||
Create an Application pointing to a values file in your repo:
|
|
||||||
```yaml
|
```yaml
|
||||||
apiVersion: argoproj.io/v1alpha1
|
apiVersion: argoproj.io/v1alpha1
|
||||||
kind: Application
|
kind: Application
|
||||||
|
|
@ -57,17 +88,15 @@ metadata:
|
||||||
spec:
|
spec:
|
||||||
project: default
|
project: default
|
||||||
source:
|
source:
|
||||||
repoURL: https://prometheus-community.github.io/helm-charts
|
repoURL: ssh://forgejo@forge.ops.eblu.me:2222/eblume/blumeops.git
|
||||||
chart: prometheus
|
path: argocd/manifests/prometheus
|
||||||
targetRevision: 25.0.0
|
targetRevision: main
|
||||||
helm:
|
|
||||||
values: |
|
|
||||||
server:
|
|
||||||
persistentVolume:
|
|
||||||
size: 10Gi
|
|
||||||
destination:
|
destination:
|
||||||
server: https://kubernetes.default.svc
|
server: https://kubernetes.default.svc
|
||||||
namespace: monitoring
|
namespace: monitoring
|
||||||
|
syncPolicy:
|
||||||
|
syncOptions:
|
||||||
|
- CreateNamespace=true
|
||||||
```
|
```
|
||||||
|
|
||||||
### Verify
|
### Verify
|
||||||
|
|
@ -78,155 +107,133 @@ kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus
|
||||||
|
|
||||||
## Step 3: Deploy Loki
|
## Step 3: Deploy Loki
|
||||||
|
|
||||||
Loki aggregates logs (like Prometheus but for logs).
|
Loki aggregates logs — think Prometheus, but for log lines instead of metrics.
|
||||||
|
|
||||||
```bash
|
### Write the Manifests
|
||||||
helm repo add grafana https://grafana.github.io/helm-charts
|
|
||||||
helm install loki grafana/loki-stack \
|
|
||||||
--namespace monitoring \
|
|
||||||
--set loki.persistence.enabled=true \
|
|
||||||
--set loki.persistence.size=10Gi
|
|
||||||
```
|
|
||||||
|
|
||||||
This also installs Promtail for log collection from pods.
|
Create `argocd/manifests/loki/` with a StatefulSet, ConfigMap, and Service similar to Prometheus. Loki listens on port 3100 (HTTP) and 9096 (gRPC).
|
||||||
|
|
||||||
|
The config file (`loki-config.yaml`) defines storage, compaction, and retention. For a homelab, a simple single-binary mode with local filesystem storage works well — no need for S3 or distributed mode.
|
||||||
|
|
||||||
|
### Create the ArgoCD Application
|
||||||
|
|
||||||
|
Same pattern as Prometheus — point to `argocd/manifests/loki`, target `monitoring` namespace.
|
||||||
|
|
||||||
## Step 4: Deploy Grafana
|
## Step 4: Deploy Grafana
|
||||||
|
|
||||||
Grafana provides dashboards and visualization.
|
Grafana provides dashboards, visualization, and alerting.
|
||||||
|
|
||||||
```bash
|
### Write the Manifests
|
||||||
helm install grafana grafana/grafana \
|
|
||||||
--namespace monitoring \
|
Grafana has more moving parts than Prometheus or Loki:
|
||||||
--set persistence.enabled=true \
|
|
||||||
--set persistence.size=1Gi \
|
- **Deployment** with a PVC for `/var/lib/grafana`
|
||||||
--set adminPassword=admin # Change this!
|
- **ConfigMap** containing `grafana.ini`, `datasources.yaml`, and `alerting.yaml`
|
||||||
|
- **Dashboard ConfigMaps** labeled `grafana_dashboard: "1"` — a sidecar container watches for these and auto-loads them
|
||||||
|
- **ExternalSecret** for the admin password (from 1Password via [[external-secrets]])
|
||||||
|
|
||||||
|
Configure data sources declaratively in the ConfigMap:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# datasources.yaml
|
||||||
|
apiVersion: 1
|
||||||
|
datasources:
|
||||||
|
- name: Prometheus
|
||||||
|
type: prometheus
|
||||||
|
url: http://prometheus.monitoring.svc:9090
|
||||||
|
isDefault: true
|
||||||
|
- name: Loki
|
||||||
|
type: loki
|
||||||
|
url: http://loki.monitoring.svc:3100
|
||||||
```
|
```
|
||||||
|
|
||||||
### Configure Data Sources
|
### Secrets
|
||||||
|
|
||||||
After installation, add data sources in Grafana UI or via ConfigMap:
|
Grafana's admin password and any OAuth credentials (for [[authentik]] SSO) should come from 1Password via ExternalSecret — never hardcode passwords in manifests. See [[external-secrets]] and [[security-model]].
|
||||||
|
|
||||||
|
### Expose via Caddy
|
||||||
|
|
||||||
|
BlumeOps exposes Grafana at `grafana.ops.eblu.me` through [[caddy]] on [[indri]], which reverse-proxies to the Kubernetes service via its Tailscale Ingress endpoint. This is the standard pattern for all services — see [[routing]] for details.
|
||||||
|
|
||||||
|
## Step 5: Deploy Alloy
|
||||||
|
|
||||||
|
Grafana Alloy is a unified telemetry collector that replaces multiple agents (Promtail, node_exporter, etc.). BlumeOps runs Alloy in **two places** — it is not optional; it's the glue that connects everything.
|
||||||
|
|
||||||
|
### In-Cluster (DaemonSet)
|
||||||
|
|
||||||
|
Create `argocd/manifests/alloy-k8s/` with:
|
||||||
|
|
||||||
|
- **DaemonSet** — runs on every node, mounts `/var/log` read-only for pod log access
|
||||||
|
- **ServiceAccount + RBAC** — needs pod list/watch for Kubernetes discovery
|
||||||
|
- **ConfigMap** — the `config.alloy` file defining:
|
||||||
|
- Kubernetes pod log discovery and collection
|
||||||
|
- Service health probes (blackbox-style checks for key services)
|
||||||
|
- Remote write to Prometheus (`/api/v1/write`) and Loki (`/loki/api/v1/push`)
|
||||||
|
|
||||||
|
The DaemonSet goes in a dedicated `alloy` namespace, separate from `monitoring`.
|
||||||
|
|
||||||
|
### On the Host (Ansible)
|
||||||
|
|
||||||
|
For metrics and logs from native services (Forgejo, Zot, Caddy, Borgmatic), Alloy runs directly on [[indri]] as a macOS LaunchAgent, managed by [[ansible]].
|
||||||
|
|
||||||
|
The host Alloy collects:
|
||||||
|
- System metrics via `prometheus.exporter.unix`
|
||||||
|
- Logs from Homebrew services and LaunchAgents
|
||||||
|
- Optional: PostgreSQL metrics, container registry metrics
|
||||||
|
|
||||||
|
It pushes to the same Prometheus and Loki endpoints via `*.ops.eblu.me`.
|
||||||
|
|
||||||
|
## What You Now Have
|
||||||
|
|
||||||
|
- **Prometheus** scraping metrics from all services
|
||||||
|
- **Loki** aggregating logs from all pods and host services
|
||||||
|
- **Grafana** with declarative dashboards and data sources
|
||||||
|
- **Alloy** collecting from both Kubernetes and the host
|
||||||
|
- A foundation for alerting via Grafana Unified Alerting
|
||||||
|
|
||||||
|
## Adding Alerts
|
||||||
|
|
||||||
|
BlumeOps uses Grafana Unified Alerting (not Prometheus Alertmanager). Alerts are defined declaratively in `alerting.yaml` within the Grafana ConfigMap. Notifications go to [[ntfy]] — a self-hosted push notification service.
|
||||||
|
|
||||||
|
Example alert categories:
|
||||||
|
- Service probe failures (is Grafana/Prometheus/Loki reachable?)
|
||||||
|
- Pod readiness (are pods healthy?)
|
||||||
|
- Metrics freshness (is data still flowing?)
|
||||||
|
- Storage and resource thresholds
|
||||||
|
|
||||||
|
See [[observability]] for the full alerting reference.
|
||||||
|
|
||||||
|
## Adding Dashboards
|
||||||
|
|
||||||
|
Import community dashboards or create custom ones. BlumeOps uses a sidecar pattern — any ConfigMap in the `monitoring` namespace with the label `grafana_dashboard: "1"` is automatically loaded by Grafana's sidecar container.
|
||||||
|
|
||||||
|
Create dashboard ConfigMaps in `argocd/manifests/grafana-config/dashboards/`:
|
||||||
|
|
||||||
```yaml
|
```yaml
|
||||||
apiVersion: v1
|
apiVersion: v1
|
||||||
kind: ConfigMap
|
kind: ConfigMap
|
||||||
metadata:
|
metadata:
|
||||||
name: grafana-datasources
|
name: grafana-dashboard-my-service
|
||||||
namespace: monitoring
|
|
||||||
labels:
|
labels:
|
||||||
grafana_datasource: "1"
|
grafana_dashboard: "1"
|
||||||
data:
|
data:
|
||||||
datasources.yaml: |
|
my-service.json: |
|
||||||
apiVersion: 1
|
{ ... dashboard JSON ... }
|
||||||
datasources:
|
|
||||||
- name: Prometheus
|
|
||||||
type: prometheus
|
|
||||||
url: http://prometheus-server.monitoring.svc:80
|
|
||||||
isDefault: true
|
|
||||||
- name: Loki
|
|
||||||
type: loki
|
|
||||||
url: http://loki.monitoring.svc:3100
|
|
||||||
```
|
```
|
||||||
|
|
||||||
## Step 5: Access Grafana
|
|
||||||
|
|
||||||
Expose via Tailscale:
|
|
||||||
```bash
|
|
||||||
kubectl -n monitoring port-forward svc/grafana 3000:80 &
|
|
||||||
tailscale serve --bg --https 3000 http://localhost:3000
|
|
||||||
```
|
|
||||||
|
|
||||||
Or create an Ingress.
|
|
||||||
|
|
||||||
Default credentials: `admin` / (password you set or retrieve from secret)
|
|
||||||
|
|
||||||
## Step 6: Add Dashboards
|
|
||||||
|
|
||||||
Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/):
|
|
||||||
|
|
||||||
| Dashboard | ID | Shows |
|
|
||||||
|-----------|-----|-------|
|
|
||||||
| Node Exporter Full | 1860 | Host metrics |
|
|
||||||
| Kubernetes Cluster | 7249 | Cluster overview |
|
|
||||||
| Loki Logs | 13639 | Log exploration |
|
|
||||||
|
|
||||||
In Grafana: Dashboards > Import > Enter ID
|
|
||||||
|
|
||||||
## Step 7: Deploy Alloy (Optional)
|
|
||||||
|
|
||||||
Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
apiVersion: argoproj.io/v1alpha1
|
|
||||||
kind: Application
|
|
||||||
metadata:
|
|
||||||
name: alloy
|
|
||||||
namespace: argocd
|
|
||||||
spec:
|
|
||||||
project: default
|
|
||||||
source:
|
|
||||||
repoURL: https://grafana.github.io/helm-charts
|
|
||||||
chart: alloy
|
|
||||||
targetRevision: 0.1.0
|
|
||||||
helm:
|
|
||||||
values: |
|
|
||||||
alloy:
|
|
||||||
configMap:
|
|
||||||
content: |
|
|
||||||
// Alloy configuration here
|
|
||||||
destination:
|
|
||||||
server: https://kubernetes.default.svc
|
|
||||||
namespace: monitoring
|
|
||||||
```
|
|
||||||
|
|
||||||
BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[ansible|Ansible role]]) and in the [[cluster]] (for pod logs and service probes).
|
|
||||||
|
|
||||||
## What You Now Have
|
|
||||||
|
|
||||||
- Metrics collection and storage (Prometheus)
|
|
||||||
- Log aggregation (Loki)
|
|
||||||
- Dashboards and visualization (Grafana)
|
|
||||||
- Foundation for alerting
|
|
||||||
|
|
||||||
## Adding Alerts
|
|
||||||
|
|
||||||
Configure alerting rules in Prometheus:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
groups:
|
|
||||||
- name: example
|
|
||||||
rules:
|
|
||||||
- alert: HighMemoryUsage
|
|
||||||
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
|
|
||||||
for: 5m
|
|
||||||
labels:
|
|
||||||
severity: warning
|
|
||||||
annotations:
|
|
||||||
summary: "High memory usage detected"
|
|
||||||
```
|
|
||||||
|
|
||||||
And notification channels in Grafana (email, Slack, PagerDuty, etc.).
|
|
||||||
|
|
||||||
## Next Steps
|
## Next Steps
|
||||||
|
|
||||||
|
- Set up [[authentik]] SSO for Grafana login (see [[federated-login]])
|
||||||
- Create custom dashboards for your services
|
- Create custom dashboards for your services
|
||||||
- Set up alerting for critical conditions
|
- Configure alerting rules and notification channels
|
||||||
- Add service-specific metrics exporters
|
- Add service-specific metrics exporters
|
||||||
|
|
||||||
## BluemeOps Specifics
|
## Related
|
||||||
|
|
||||||
BlumeOps' observability setup includes:
|
- [[observability]] — Full observability reference
|
||||||
- Prometheus scraping all services via annotations
|
- [[no-helm-policy]] — Why kustomize instead of Helm
|
||||||
- Loki collecting logs from all pods and [[indri]] services
|
- [[alloy]] — Alloy collector reference
|
||||||
- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health
|
- [[prometheus]] — Prometheus reference
|
||||||
- [[alloy]] running on both host and in-cluster
|
- [[loki]] — Loki reference
|
||||||
|
- [[grafana]] — Grafana reference
|
||||||
See [[observability|Observability Reference]] for full details.
|
- [[routing]] — Service routing and exposure
|
||||||
|
|
||||||
## Troubleshooting
|
|
||||||
|
|
||||||
| Problem | Solution |
|
|
||||||
|---------|----------|
|
|
||||||
| No metrics appearing | Check Prometheus targets (`/targets` endpoint) |
|
|
||||||
| No logs in Loki | Verify Promtail/Alloy is collecting (`/ready` endpoint) |
|
|
||||||
| Dashboard shows no data | Check data source configuration and time range |
|
|
||||||
| High storage usage | Adjust retention settings in Prometheus/Loki |
|
|
||||||
|
|
|
||||||
Loading…
Add table
Add a link
Reference in a new issue