blumeops/docs/tutorials/replication/observability-stack.md

---
title: observability-stack
tags:
  - tutorials
  - replication
  - observability
---

# Building the Observability Stack

> **Audiences:** Replicator

This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.

## The Stack

A complete observability solution has three pillars:

| Component | Purpose | BlumeOps Uses |
|-----------|---------|---------------|
| **Metrics** | Numeric measurements over time | [[prometheus]] |
| **Logs** | Text output from applications | [[loki]] |
| **Dashboards** | Visualization and alerting | [[grafana]] |
| **Collection** | Gathering and forwarding data | [[alloy]] |

For BlumeOps specifics, see [[observability|Observability Reference]].

## Step 1: Create Monitoring Namespace

```bash
kubectl create namespace monitoring
```

## Step 2: Deploy Prometheus

Prometheus collects and stores metrics.

### Using Helm

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --set server.persistentVolume.size=10Gi
```

### Or via ArgoCD

Create an Application pointing to a values file in your repo:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: prometheus
    targetRevision: 25.0.0
    helm:
      values: |
        server:
          persistentVolume:
            size: 10Gi
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
```

### Verify

```bash
kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus
```

## Step 3: Deploy Loki

Loki aggregates logs (like Prometheus but for logs).

```bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi
```

This also installs Promtail for log collection from pods.

## Step 4: Deploy Grafana

Grafana provides dashboards and visualization.

```bash
helm install grafana grafana/grafana \
  --namespace monitoring \
  --set persistence.enabled=true \
  --set persistence.size=1Gi \
  --set adminPassword=admin  # Change this!
```

### Configure Data Sources

After installation, add data sources in Grafana UI or via ConfigMap:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus-server.monitoring.svc:80
      isDefault: true
    - name: Loki
      type: loki
      url: http://loki.monitoring.svc:3100
```

## Step 5: Access Grafana

Expose via Tailscale:
```bash
kubectl -n monitoring port-forward svc/grafana 3000:80 &
tailscale serve --bg --https 3000 http://localhost:3000
```

Or create an Ingress.

Default credentials: `admin` / (password you set or retrieve from secret)

## Step 6: Add Dashboards

Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/):

| Dashboard | ID | Shows |
|-----------|-----|-------|
| Node Exporter Full | 1860 | Host metrics |
| Kubernetes Cluster | 7249 | Cluster overview |
| Loki Logs | 13639 | Log exploration |

In Grafana: Dashboards > Import > Enter ID

## Step 7: Deploy Alloy (Optional)

Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: alloy
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://grafana.github.io/helm-charts
    chart: alloy
    targetRevision: 0.1.0
    helm:
      values: |
        alloy:
          configMap:
            content: |
              // Alloy configuration here
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
```

BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[roles|Ansible role]]) and in the [[cluster]] (for pod logs and service probes).

## What You Now Have

- Metrics collection and storage (Prometheus)
- Log aggregation (Loki)
- Dashboards and visualization (Grafana)
- Foundation for alerting

## Adding Alerts

Configure alerting rules in Prometheus:

```yaml
groups:
- name: example
  rules:
  - alert: HighMemoryUsage
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
```

And notification channels in Grafana (email, Slack, PagerDuty, etc.).

## Next Steps

- Create custom dashboards for your services
- Set up alerting for critical conditions
- Add service-specific metrics exporters

## BluemeOps Specifics

BlumeOps' observability setup includes:
- Prometheus scraping all services via annotations
- Loki collecting logs from all pods and [[indri]] services
- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health
- [[alloy]] running on both host and in-cluster

See [[observability|Observability Reference]] for full details.

## Troubleshooting

| Problem | Solution |
|---------|----------|
| No metrics appearing | Check Prometheus targets (`/targets` endpoint) |
| No logs in Loki | Verify Promtail/Alloy is collecting (`/ready` endpoint) |
| Dashboard shows no data | Check data source configuration and time range |
| High storage usage | Adjust retention settings in Prometheus/Loki |
Add Phase 3 tutorials with audience targeting (#94) ## Summary - Create tutorials directory structure with index page - Add 5 main tutorials targeting different audiences: - what-is-blumeops (Reader, AI) - High-level orientation - exploring-the-docs (All) - Navigation guide - ai-assistance-guide (AI, Owner) - Context for AI-assisted operations - contributing (Contributor) - First contribution workflow - replicating-blumeops (Replicator) - Overview for building similar setup - Add 4 replication sub-tutorials: - tailscale-setup, kubernetes-bootstrap, argocd-config, observability-stack - Update README.md to mark Phase 3 complete - Add changelog fragment Each tutorial explicitly identifies its target audiences and links to reference material rather than re-explaining concepts. ## Deployment and Testing - [x] All pre-commit hooks pass (doc-links validates wiki links) - [ ] Build docs via workflow to verify rendering - [ ] Review content for accuracy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/94 2026-02-03 18:51:57 -08:00			`---`
			`title: observability-stack`
			`tags:`
			`- tutorials`
			`- replication`
			`- observability`
			`---`

			`# Building the Observability Stack`

			`> Audiences: Replicator`

			`This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.`

			`## The Stack`

			`A complete observability solution has three pillars:`

			`\| Component \| Purpose \| BlumeOps Uses \|`
			`\|-----------\|---------\|---------------\|`
			`\| Metrics \| Numeric measurements over time \| [[prometheus]] \|`
			`\| Logs \| Text output from applications \| [[loki]] \|`
			`\| Dashboards \| Visualization and alerting \| [[grafana]] \|`
			`\| Collection \| Gathering and forwarding data \| [[alloy]] \|`

			`For BlumeOps specifics, see [[observability\|Observability Reference]].`

			`## Step 1: Create Monitoring Namespace`

			```bash
			`kubectl create namespace monitoring`
			```

			`## Step 2: Deploy Prometheus`

			`Prometheus collects and stores metrics.`

			`### Using Helm`

			```bash
			`helm repo add prometheus-community https://prometheus-community.github.io/helm-charts`
			`helm install prometheus prometheus-community/prometheus \`
			`--namespace monitoring \`
			`--set server.persistentVolume.size=10Gi`
			```

			`### Or via ArgoCD`

			`Create an Application pointing to a values file in your repo:`
			```yaml
			`apiVersion: argoproj.io/v1alpha1`
			`kind: Application`
			`metadata:`
			`name: prometheus`
			`namespace: argocd`
			`spec:`
			`project: default`
			`source:`
			`repoURL: https://prometheus-community.github.io/helm-charts`
			`chart: prometheus`
			`targetRevision: 25.0.0`
			`helm:`
			`values: \|`
			`server:`
			`persistentVolume:`
			`size: 10Gi`
			`destination:`
			`server: https://kubernetes.default.svc`
			`namespace: monitoring`
			```

			`### Verify`

			```bash
			`kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus`
			```

			`## Step 3: Deploy Loki`

			`Loki aggregates logs (like Prometheus but for logs).`

			```bash
			`helm repo add grafana https://grafana.github.io/helm-charts`
			`helm install loki grafana/loki-stack \`
			`--namespace monitoring \`
			`--set loki.persistence.enabled=true \`
			`--set loki.persistence.size=10Gi`
			```

			`This also installs Promtail for log collection from pods.`

			`## Step 4: Deploy Grafana`

			`Grafana provides dashboards and visualization.`

			```bash
			`helm install grafana grafana/grafana \`
			`--namespace monitoring \`
			`--set persistence.enabled=true \`
			`--set persistence.size=1Gi \`
			`--set adminPassword=admin # Change this!`
			```

			`### Configure Data Sources`

			`After installation, add data sources in Grafana UI or via ConfigMap:`

			```yaml
			`apiVersion: v1`
			`kind: ConfigMap`
			`metadata:`
			`name: grafana-datasources`
			`namespace: monitoring`
			`labels:`
			`grafana_datasource: "1"`
			`data:`
			`datasources.yaml: \|`
			`apiVersion: 1`
			`datasources:`
			`- name: Prometheus`
			`type: prometheus`
			`url: http://prometheus-server.monitoring.svc:80`
			`isDefault: true`
			`- name: Loki`
			`type: loki`
			`url: http://loki.monitoring.svc:3100`
			```

			`## Step 5: Access Grafana`

			`Expose via Tailscale:`
			```bash
			`kubectl -n monitoring port-forward svc/grafana 3000:80 &`
			`tailscale serve --bg --https 3000 http://localhost:3000`
			```

			`Or create an Ingress.`

			Default credentials: `admin` / (password you set or retrieve from secret)

			`## Step 6: Add Dashboards`

			`Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/):`

			`\| Dashboard \| ID \| Shows \|`
			`\|-----------\|-----\|-------\|`
			`\| Node Exporter Full \| 1860 \| Host metrics \|`
			`\| Kubernetes Cluster \| 7249 \| Cluster overview \|`
			`\| Loki Logs \| 13639 \| Log exploration \|`

			`In Grafana: Dashboards > Import > Enter ID`

			`## Step 7: Deploy Alloy (Optional)`

			`Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).`

			```yaml
			`apiVersion: argoproj.io/v1alpha1`
			`kind: Application`
			`metadata:`
			`name: alloy`
			`namespace: argocd`
			`spec:`
			`project: default`
			`source:`
			`repoURL: https://grafana.github.io/helm-charts`
			`chart: alloy`
			`targetRevision: 0.1.0`
			`helm:`
			`values: \|`
			`alloy:`
			`configMap:`
			`content: \|`
			`// Alloy configuration here`
			`destination:`
			`server: https://kubernetes.default.svc`
			`namespace: monitoring`
			```

Enforce unique doc filenames and simple wiki-links (#109) ## Summary - Rename section index files to match their titles (tutorials.md, reference.md, how-to.md, explanation.md) so all filenames are unique - Convert all ~47 path-based wiki-links to simple filename format across 15 files - Update doc-filenames task to no longer skip index.md files - Update doc-links task to reject path-based links containing '/' This ensures all wiki-links work correctly in obsidian.nvim by making links resolvable by filename alone. ## Testing - `mise run doc-filenames` - all unique - `mise run doc-links` - no broken or path-based links - `mise run doc-titles` - no duplicates Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/109 2026-02-04 17:21:34 -08:00			`BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[roles\|Ansible role]]) and in the [[cluster]] (for pod logs and service probes).`
Add Phase 3 tutorials with audience targeting (#94) ## Summary - Create tutorials directory structure with index page - Add 5 main tutorials targeting different audiences: - what-is-blumeops (Reader, AI) - High-level orientation - exploring-the-docs (All) - Navigation guide - ai-assistance-guide (AI, Owner) - Context for AI-assisted operations - contributing (Contributor) - First contribution workflow - replicating-blumeops (Replicator) - Overview for building similar setup - Add 4 replication sub-tutorials: - tailscale-setup, kubernetes-bootstrap, argocd-config, observability-stack - Update README.md to mark Phase 3 complete - Add changelog fragment Each tutorial explicitly identifies its target audiences and links to reference material rather than re-explaining concepts. ## Deployment and Testing - [x] All pre-commit hooks pass (doc-links validates wiki links) - [ ] Build docs via workflow to verify rendering - [ ] Review content for accuracy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/94 2026-02-03 18:51:57 -08:00
			`## What You Now Have`

			`- Metrics collection and storage (Prometheus)`
			`- Log aggregation (Loki)`
			`- Dashboards and visualization (Grafana)`
			`- Foundation for alerting`

			`## Adding Alerts`

			`Configure alerting rules in Prometheus:`

			```yaml
			`groups:`
			`- name: example`
			`rules:`
			`- alert: HighMemoryUsage`
			`expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "High memory usage detected"`
			```

			`And notification channels in Grafana (email, Slack, PagerDuty, etc.).`

			`## Next Steps`

			`- Create custom dashboards for your services`
			`- Set up alerting for critical conditions`
			`- Add service-specific metrics exporters`

			`## BluemeOps Specifics`

			`BlumeOps' observability setup includes:`
			`- Prometheus scraping all services via annotations`
			`- Loki collecting logs from all pods and [[indri]] services`
			`- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health`
			`- [[alloy]] running on both host and in-cluster`

			`See [[observability\|Observability Reference]] for full details.`

			`## Troubleshooting`

			`\| Problem \| Solution \|`
			`\|---------\|----------\|`
			\| No metrics appearing \| Check Prometheus targets (`/targets` endpoint) \|
			\| No logs in Loki \| Verify Promtail/Alloy is collecting (`/ready` endpoint) \|
			`\| Dashboard shows no data \| Check data source configuration and time range \|`
			`\| High storage usage \| Adjust retention settings in Prometheus/Loki \|`