blumeops/docs/tutorials/replication/observability-stack.md
Erich Blume b197bd5f58 Adopt Dagger CI for docs build (Phase 2) (#157)
## Summary

Migrates the docs build pipeline to Dagger (Phase 2 of the Dagger CI adoption plan).

- **Backfill `date-modified` frontmatter** on all 80 docs — Dagger's `--src=.` excludes `.git`, so Quartz can't use git history for page dates. Frontmatter dates work with or without git.
- **New `docs-check-frontmatter` mise task + pre-commit hook** — validates all docs have `title`, `tags`, and `date-modified`
- **New Dagger functions** — `build_changelog` (towncrier in Python container) and `build_docs` (chains changelog → Quartz build in Node container, returns tarball)
- **Simplified CI workflow** — the ~44-line inline Quartz build (clone, npm ci, build, tar, cleanup) is replaced by `dagger call build-docs`. Changelog step remains local on the runner since towncrier needs to modify the host working tree for the git commit.

### Design decisions

- **Towncrier runs twice in CI**: once inside Dagger (for the docs tarball) and once on the runner (for the git commit). This is intentional — Dagger's directory export is additive and can't delete the consumed changelog fragments from the host.
- **Artifact hosting stays on Forgejo Releases** (not migrated to Forgejo Packages as the plan doc originally suggested). That migration can happen independently.
- **`date-modified` frontmatter** preserved even though `build_changelog` installs git — the git there is only for towncrier's `git add` call, not for history. The local iteration story (`dagger call build-docs --src=. --version=dev` with uncommitted changes) depends on frontmatter dates.

### Local iteration

```bash
dagger call build-docs --src=. --version=dev export --path=./docs-dev.tar.gz
tar tf docs-dev.tar.gz | head -20
```

## Deployment and Testing

- [x] `dagger call build-docs --src=. --version=dev` produces valid 1.1MB tarball (149 HTML pages)
- [x] Pre-commit hooks pass (including new `docs-check-frontmatter`)
- [ ] Full `workflow_dispatch` run after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/157
2026-02-11 16:33:16 -08:00

232 lines
5.6 KiB
Markdown

---
title: Observability Stack
date-modified: 2026-02-07
tags:
- tutorials
- replication
- observability
---
# Building the Observability Stack
> **Audiences:** Replicator
This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.
## The Stack
A complete observability solution has three pillars:
| Component | Purpose | BlumeOps Uses |
|-----------|---------|---------------|
| **Metrics** | Numeric measurements over time | [[prometheus]] |
| **Logs** | Text output from applications | [[loki]] |
| **Dashboards** | Visualization and alerting | [[grafana]] |
| **Collection** | Gathering and forwarding data | [[alloy]] |
For BlumeOps specifics, see [[observability|Observability Reference]].
## Step 1: Create Monitoring Namespace
```bash
kubectl create namespace monitoring
```
## Step 2: Deploy Prometheus
Prometheus collects and stores metrics.
### Using Helm
```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus \
--namespace monitoring \
--set server.persistentVolume.size=10Gi
```
### Or via ArgoCD
Create an Application pointing to a values file in your repo:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: prometheus
namespace: argocd
spec:
project: default
source:
repoURL: https://prometheus-community.github.io/helm-charts
chart: prometheus
targetRevision: 25.0.0
helm:
values: |
server:
persistentVolume:
size: 10Gi
destination:
server: https://kubernetes.default.svc
namespace: monitoring
```
### Verify
```bash
kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus
```
## Step 3: Deploy Loki
Loki aggregates logs (like Prometheus but for logs).
```bash
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--set loki.persistence.enabled=true \
--set loki.persistence.size=10Gi
```
This also installs Promtail for log collection from pods.
## Step 4: Deploy Grafana
Grafana provides dashboards and visualization.
```bash
helm install grafana grafana/grafana \
--namespace monitoring \
--set persistence.enabled=true \
--set persistence.size=1Gi \
--set adminPassword=admin # Change this!
```
### Configure Data Sources
After installation, add data sources in Grafana UI or via ConfigMap:
```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
labels:
grafana_datasource: "1"
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-server.monitoring.svc:80
isDefault: true
- name: Loki
type: loki
url: http://loki.monitoring.svc:3100
```
## Step 5: Access Grafana
Expose via Tailscale:
```bash
kubectl -n monitoring port-forward svc/grafana 3000:80 &
tailscale serve --bg --https 3000 http://localhost:3000
```
Or create an Ingress.
Default credentials: `admin` / (password you set or retrieve from secret)
## Step 6: Add Dashboards
Import community dashboards from [grafana.com/grafana/dashboards](https://grafana.com/grafana/dashboards/):
| Dashboard | ID | Shows |
|-----------|-----|-------|
| Node Exporter Full | 1860 | Host metrics |
| Kubernetes Cluster | 7249 | Cluster overview |
| Loki Logs | 13639 | Log exploration |
In Grafana: Dashboards > Import > Enter ID
## Step 7: Deploy Alloy (Optional)
Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: alloy
namespace: argocd
spec:
project: default
source:
repoURL: https://grafana.github.io/helm-charts
chart: alloy
targetRevision: 0.1.0
helm:
values: |
alloy:
configMap:
content: |
// Alloy configuration here
destination:
server: https://kubernetes.default.svc
namespace: monitoring
```
BluemeOps uses Alloy on both [[indri]] (for host metrics, via [[roles|Ansible role]]) and in the [[cluster]] (for pod logs and service probes).
## What You Now Have
- Metrics collection and storage (Prometheus)
- Log aggregation (Loki)
- Dashboards and visualization (Grafana)
- Foundation for alerting
## Adding Alerts
Configure alerting rules in Prometheus:
```yaml
groups:
- name: example
rules:
- alert: HighMemoryUsage
expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
```
And notification channels in Grafana (email, Slack, PagerDuty, etc.).
## Next Steps
- Create custom dashboards for your services
- Set up alerting for critical conditions
- Add service-specific metrics exporters
## BluemeOps Specifics
BlumeOps' observability setup includes:
- Prometheus scraping all services via annotations
- Loki collecting logs from all pods and [[indri]] services
- Custom dashboards for [[jellyfin]], [[teslamate]], and cluster health
- [[alloy]] running on both host and in-cluster
See [[observability|Observability Reference]] for full details.
## Troubleshooting
| Problem | Solution |
|---------|----------|
| No metrics appearing | Check Prometheus targets (`/targets` endpoint) |
| No logs in Loki | Verify Promtail/Alloy is collecting (`/ready` endpoint) |
| Dashboard shows no data | Check data source configuration and time range |
| High storage usage | Adjust retention settings in Prometheus/Loki |