blumeops/docs/tutorials/replication/observability-stack.md

---
title: Observability Stack
modified: 2026-02-07
tags:
  - tutorials
  - replication
  - observability
---

# Building the Observability Stack

> **Audiences:** Replicator

This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.

## The Stack

A complete observability solution has three pillars:

| Component | Purpose | BlumeOps Uses |
|-----------|---------|---------------|
| **Metrics** | Numeric measurements over time | [[prometheus]] |
| **Logs** | Text output from applications | [[loki]] |
| **Dashboards** | Visualization and alerting | [[grafana]] |
| **Collection** | Gathering and forwarding data | [[alloy]] |

For BlumeOps specifics, see [[observability|Observability Reference]].

## Step 1: Create Monitoring Namespace

```bash
kubectl create namespace monitoring
```

## Step 2: Deploy Prometheus

Prometheus collects and stores metrics.

### Using Helm

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --set server.persistentVolume.size=10Gi
```

### Or via ArgoCD

Create an Application pointing to a values file in your repo:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: prometheus
    targetRevision: 25.0.0
    helm:
      values: |
        server:
          persistentVolume:
            size: 10Gi
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
```

### Verify

```bash
kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus
```

## Step 3: Deploy Loki

Loki aggregates logs (like Prometheus but for logs).

```bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi
```

This also installs Promtail for log collection from pods.

## Step 4: Deploy Grafana

Grafana provides dashboards and visualization.

```bash
helm install grafana grafana/grafana \
  --namespace monitoring \
  --set persistence.enabled=true \
  --set persistence.size=1Gi \
  --set adminPassword=admin  # Change this!
```

### Configure Data Sources

After installation, add data sources in Grafana UI or via ConfigMap:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus-server.monitoring.svc:80
      isDefault: true
    - name: Loki
      type: loki
      url: http://loki.monitoring.svc:3100
```

## Step 5: Access Grafana

Expose via Tailscale:
```bash
kubectl -n monitoring port-forward svc/grafana 3000:80 &
tailscale serve --bg --https 3000 http://localhost:3000
```

Or create an Ingress.

Default credentials: `admin` / (password you set or retrieve from secret)

## Step 6: Add Dashboards

Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/):

| Dashboard | ID | Shows |
|-----------|-----|-------|
| Node Exporter Full | 1860 | Host metrics |
| Kubernetes Cluster | 7249 | Cluster overview |
| Loki Logs | 13639 | Log exploration |

In Grafana: Dashboards > Import > Enter ID

## Step 7: Deploy Alloy (Optional)

Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: alloy
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://grafana.github.io/helm-charts
    chart: alloy
    targetRevision: 0.1.0
    helm:
      values: |
        alloy:
          configMap:
            content: |
              // Alloy configuration here
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring
```

BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[roles|Ansible role]]) and in the [[cluster]] (for pod logs and service probes).

## What You Now Have

- Metrics collection and storage (Prometheus)
- Log aggregation (Loki)
- Dashboards and visualization (Grafana)
- Foundation for alerting

## Adding Alerts

Configure alerting rules in Prometheus:

```yaml
groups:
- name: example
  rules:
  - alert: HighMemoryUsage
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"
```

And notification channels in Grafana (email, Slack, PagerDuty, etc.).

## Next Steps

- Create custom dashboards for your services
- Set up alerting for critical conditions
- Add service-specific metrics exporters

## BluemeOps Specifics

BlumeOps' observability setup includes:
- Prometheus scraping all services via annotations
- Loki collecting logs from all pods and [[indri]] services
- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health
- [[alloy]] running on both host and in-cluster

See [[observability|Observability Reference]] for full details.

## Troubleshooting

| Problem | Solution |
|---------|----------|
| No metrics appearing | Check Prometheus targets (`/targets` endpoint) |
| No logs in Loki | Verify Promtail/Alloy is collecting (`/ready` endpoint) |
| Dashboard shows no data | Check data source configuration and time range |
| High storage usage | Adjust retention settings in Prometheus/Loki |
Add Phase 3 tutorials with audience targeting (#94) ## Summary - Create tutorials directory structure with index page - Add 5 main tutorials targeting different audiences: - what-is-blumeops (Reader, AI) - High-level orientation - exploring-the-docs (All) - Navigation guide - ai-assistance-guide (AI, Owner) - Context for AI-assisted operations - contributing (Contributor) - First contribution workflow - replicating-blumeops (Replicator) - Overview for building similar setup - Add 4 replication sub-tutorials: - tailscale-setup, kubernetes-bootstrap, argocd-config, observability-stack - Update README.md to mark Phase 3 complete - Add changelog fragment Each tutorial explicitly identifies its target audiences and links to reference material rather than re-explaining concepts. ## Deployment and Testing - [x] All pre-commit hooks pass (doc-links validates wiki links) - [ ] Build docs via workflow to verify rendering - [ ] Review content for accuracy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/94 2026-02-03 18:51:57 -08:00			`---`
Update all docs titles to human-readable (#117) ## Summary - Updated frontmatter `title:` in all 63 doc cards from slug-case to human-readable (e.g. `borgmatic` → `Borgmatic`, `ai-assistance-guide` → `AI Assistance Guide`) - Titles now closely match file stems so `[[wiki-links]]` render naturally without alternate anchor text - Corrected titles that diverged from stems (e.g. `host-inventory` → `Hosts`, `grafana-alloy` → `Alloy`, `argocd-applications` → `Apps`) - Deleted `title-test-alpha.md` and `title-test-beta.md` test cards and removed their reference index entry ## Deployment and Testing - [x] `docs-check-links` passes — all wiki-links valid - [x] `docs-check-index` passes - [x] `docs-check-filenames` passes - [ ] Verify titles render correctly on docs site after deploy Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/117 2026-02-07 21:44:57 -08:00			`title: Observability Stack`
Fix frontmatter field name for Quartz date display (#158) ## Summary - Rename `date-modified` -> `modified` in all 80 docs and the `docs-check-frontmatter` task Quartz's `CreatedModifiedDate` plugin recognizes `modified`, `lastmod`, `updated`, and `last-modified` — but not `date-modified`. The wrong field name caused Quartz to ignore frontmatter dates entirely and fall through to filesystem timestamps (UTC inside Dagger), showing Feb 12 on pages built late on Feb 11 PST. ## Test plan - [x] `mise run docs-check-frontmatter` passes - [ ] Kick off docs release after merge — verify rendered dates match frontmatter values Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/158 2026-02-11 16:45:12 -08:00			`modified: 2026-02-07`
Add Phase 3 tutorials with audience targeting (#94) ## Summary - Create tutorials directory structure with index page - Add 5 main tutorials targeting different audiences: - what-is-blumeops (Reader, AI) - High-level orientation - exploring-the-docs (All) - Navigation guide - ai-assistance-guide (AI, Owner) - Context for AI-assisted operations - contributing (Contributor) - First contribution workflow - replicating-blumeops (Replicator) - Overview for building similar setup - Add 4 replication sub-tutorials: - tailscale-setup, kubernetes-bootstrap, argocd-config, observability-stack - Update README.md to mark Phase 3 complete - Add changelog fragment Each tutorial explicitly identifies its target audiences and links to reference material rather than re-explaining concepts. ## Deployment and Testing - [x] All pre-commit hooks pass (doc-links validates wiki links) - [ ] Build docs via workflow to verify rendering - [ ] Review content for accuracy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/94 2026-02-03 18:51:57 -08:00			`tags:`
			`- tutorials`
			`- replication`
			`- observability`
			`---`

			`# Building the Observability Stack`

			`> Audiences: Replicator`

			`This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.`

			`## The Stack`

			`A complete observability solution has three pillars:`

			`\| Component \| Purpose \| BlumeOps Uses \|`
			`\|-----------\|---------\|---------------\|`
			`\| Metrics \| Numeric measurements over time \| [[prometheus]] \|`
			`\| Logs \| Text output from applications \| [[loki]] \|`
			`\| Dashboards \| Visualization and alerting \| [[grafana]] \|`
			`\| Collection \| Gathering and forwarding data \| [[alloy]] \|`

			`For BlumeOps specifics, see [[observability\|Observability Reference]].`

			`## Step 1: Create Monitoring Namespace`

			```bash
			`kubectl create namespace monitoring`
			```

			`## Step 2: Deploy Prometheus`

			`Prometheus collects and stores metrics.`

			`### Using Helm`

			```bash
			`helm repo add prometheus-community https://prometheus-community.github.io/helm-charts`
			`helm install prometheus prometheus-community/prometheus \`
			`--namespace monitoring \`
			`--set server.persistentVolume.size=10Gi`
			```

			`### Or via ArgoCD`

			`Create an Application pointing to a values file in your repo:`
			```yaml
			`apiVersion: argoproj.io/v1alpha1`
			`kind: Application`
			`metadata:`
			`name: prometheus`
			`namespace: argocd`
			`spec:`
			`project: default`
			`source:`
			`repoURL: https://prometheus-community.github.io/helm-charts`
			`chart: prometheus`
			`targetRevision: 25.0.0`
			`helm:`
			`values: \|`
			`server:`
			`persistentVolume:`
			`size: 10Gi`
			`destination:`
			`server: https://kubernetes.default.svc`
			`namespace: monitoring`
			```

			`### Verify`

			```bash
			`kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus`
			```

			`## Step 3: Deploy Loki`

			`Loki aggregates logs (like Prometheus but for logs).`

			```bash
			`helm repo add grafana https://grafana.github.io/helm-charts`
			`helm install loki grafana/loki-stack \`
			`--namespace monitoring \`
			`--set loki.persistence.enabled=true \`
			`--set loki.persistence.size=10Gi`
			```

			`This also installs Promtail for log collection from pods.`

			`## Step 4: Deploy Grafana`

			`Grafana provides dashboards and visualization.`

			```bash
			`helm install grafana grafana/grafana \`
			`--namespace monitoring \`
			`--set persistence.enabled=true \`
			`--set persistence.size=1Gi \`
			`--set adminPassword=admin # Change this!`
			```

			`### Configure Data Sources`

			`After installation, add data sources in Grafana UI or via ConfigMap:`

			```yaml
			`apiVersion: v1`
			`kind: ConfigMap`
			`metadata:`
			`name: grafana-datasources`
			`namespace: monitoring`
			`labels:`
			`grafana_datasource: "1"`
			`data:`
			`datasources.yaml: \|`
			`apiVersion: 1`
			`datasources:`
			`- name: Prometheus`
			`type: prometheus`
			`url: http://prometheus-server.monitoring.svc:80`
			`isDefault: true`
			`- name: Loki`
			`type: loki`
			`url: http://loki.monitoring.svc:3100`
			```

			`## Step 5: Access Grafana`

			`Expose via Tailscale:`
			```bash
			`kubectl -n monitoring port-forward svc/grafana 3000:80 &`
			`tailscale serve --bg --https 3000 http://localhost:3000`
			```

			`Or create an Ingress.`

			Default credentials: `admin` / (password you set or retrieve from secret)

			`## Step 6: Add Dashboards`

			`Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/):`

			`\| Dashboard \| ID \| Shows \|`
			`\|-----------\|-----\|-------\|`
			`\| Node Exporter Full \| 1860 \| Host metrics \|`
			`\| Kubernetes Cluster \| 7249 \| Cluster overview \|`
			`\| Loki Logs \| 13639 \| Log exploration \|`

			`In Grafana: Dashboards > Import > Enter ID`

			`## Step 7: Deploy Alloy (Optional)`

			`Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).`

			```yaml
			`apiVersion: argoproj.io/v1alpha1`
			`kind: Application`
			`metadata:`
			`name: alloy`
			`namespace: argocd`
			`spec:`
			`project: default`
			`source:`
			`repoURL: https://grafana.github.io/helm-charts`
			`chart: alloy`
			`targetRevision: 0.1.0`
			`helm:`
			`values: \|`
			`alloy:`
			`configMap:`
			`content: \|`
			`// Alloy configuration here`
			`destination:`
			`server: https://kubernetes.default.svc`
			`namespace: monitoring`
			```

Enforce unique doc filenames and simple wiki-links (#109) ## Summary - Rename section index files to match their titles (tutorials.md, reference.md, how-to.md, explanation.md) so all filenames are unique - Convert all ~47 path-based wiki-links to simple filename format across 15 files - Update doc-filenames task to no longer skip index.md files - Update doc-links task to reject path-based links containing '/' This ensures all wiki-links work correctly in obsidian.nvim by making links resolvable by filename alone. ## Testing - `mise run doc-filenames` - all unique - `mise run doc-links` - no broken or path-based links - `mise run doc-titles` - no duplicates Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/109 2026-02-04 17:21:34 -08:00			`BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[roles\|Ansible role]]) and in the [[cluster]] (for pod logs and service probes).`
Add Phase 3 tutorials with audience targeting (#94) ## Summary - Create tutorials directory structure with index page - Add 5 main tutorials targeting different audiences: - what-is-blumeops (Reader, AI) - High-level orientation - exploring-the-docs (All) - Navigation guide - ai-assistance-guide (AI, Owner) - Context for AI-assisted operations - contributing (Contributor) - First contribution workflow - replicating-blumeops (Replicator) - Overview for building similar setup - Add 4 replication sub-tutorials: - tailscale-setup, kubernetes-bootstrap, argocd-config, observability-stack - Update README.md to mark Phase 3 complete - Add changelog fragment Each tutorial explicitly identifies its target audiences and links to reference material rather than re-explaining concepts. ## Deployment and Testing - [x] All pre-commit hooks pass (doc-links validates wiki links) - [ ] Build docs via workflow to verify rendering - [ ] Review content for accuracy 🤖 Generated with [Claude Code](https://claude.com/claude-code) Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/94 2026-02-03 18:51:57 -08:00
			`## What You Now Have`

			`- Metrics collection and storage (Prometheus)`
			`- Log aggregation (Loki)`
			`- Dashboards and visualization (Grafana)`
			`- Foundation for alerting`

			`## Adding Alerts`

			`Configure alerting rules in Prometheus:`

			```yaml
			`groups:`
			`- name: example`
			`rules:`
			`- alert: HighMemoryUsage`
			`expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1`
			`for: 5m`
			`labels:`
			`severity: warning`
			`annotations:`
			`summary: "High memory usage detected"`
			```

			`And notification channels in Grafana (email, Slack, PagerDuty, etc.).`

			`## Next Steps`

			`- Create custom dashboards for your services`
			`- Set up alerting for critical conditions`
			`- Add service-specific metrics exporters`

			`## BluemeOps Specifics`

			`BlumeOps' observability setup includes:`
			`- Prometheus scraping all services via annotations`
			`- Loki collecting logs from all pods and [[indri]] services`
			`- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health`
			`- [[alloy]] running on both host and in-cluster`

			`See [[observability\|Observability Reference]] for full details.`

			`## Troubleshooting`

			`\| Problem \| Solution \|`
			`\|---------\|----------\|`
			\| No metrics appearing \| Check Prometheus targets (`/targets` endpoint) \|
			\| No logs in Loki \| Verify Promtail/Alloy is collecting (`/ready` endpoint) \|
			`\| Dashboard shows no data \| Check data source configuration and time range \|`
			`\| High storage usage \| Adjust retention settings in Prometheus/Loki \|`