blumeops/docs/tutorials/replication/observability-stack.md
Erich Blume b197bd5f58 Adopt Dagger CI for docs build (Phase 2) (#157)
## Summary

Migrates the docs build pipeline to Dagger (Phase 2 of the Dagger CI adoption plan).

- **Backfill `date-modified` frontmatter** on all 80 docs — Dagger's `--src=.` excludes `.git`, so Quartz can't use git history for page dates. Frontmatter dates work with or without git.
- **New `docs-check-frontmatter` mise task + pre-commit hook** — validates all docs have `title`, `tags`, and `date-modified`
- **New Dagger functions** — `build_changelog` (towncrier in Python container) and `build_docs` (chains changelog → Quartz build in Node container, returns tarball)
- **Simplified CI workflow** — the ~44-line inline Quartz build (clone, npm ci, build, tar, cleanup) is replaced by `dagger call build-docs`. Changelog step remains local on the runner since towncrier needs to modify the host working tree for the git commit.

### Design decisions

- **Towncrier runs twice in CI**: once inside Dagger (for the docs tarball) and once on the runner (for the git commit). This is intentional — Dagger's directory export is additive and can't delete the consumed changelog fragments from the host.
- **Artifact hosting stays on Forgejo Releases** (not migrated to Forgejo Packages as the plan doc originally suggested). That migration can happen independently.
- **`date-modified` frontmatter** preserved even though `build_changelog` installs git — the git there is only for towncrier's `git add` call, not for history. The local iteration story (`dagger call build-docs --src=. --version=dev` with uncommitted changes) depends on frontmatter dates.

### Local iteration

```bash
dagger call build-docs --src=. --version=dev export --path=./docs-dev.tar.gz
tar tf docs-dev.tar.gz | head -20
```

## Deployment and Testing

- [x] `dagger call build-docs --src=. --version=dev` produces valid 1.1MB tarball (149 HTML pages)
- [x] Pre-commit hooks pass (including new `docs-check-frontmatter`)
- [ ] Full `workflow_dispatch` run after merge

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Reviewed-on: https://forge.ops.eblu.me/eblume/blumeops/pulls/157
2026-02-11 16:33:16 -08:00

5.6 KiB

title date-modified tags
Observability Stack 2026-02-07
tutorials
replication
observability

Building the Observability Stack

Audiences: Replicator

This tutorial walks through deploying metrics, logs, and dashboards for your homelab - because you can't fix what you can't see.

The Stack

A complete observability solution has three pillars:

Component Purpose BlumeOps Uses
Metrics Numeric measurements over time prometheus
Logs Text output from applications loki
Dashboards Visualization and alerting grafana
Collection Gathering and forwarding data alloy

For BlumeOps specifics, see observability.

Step 1: Create Monitoring Namespace

kubectl create namespace monitoring

Step 2: Deploy Prometheus

Prometheus collects and stores metrics.

Using Helm

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/prometheus \
  --namespace monitoring \
  --set server.persistentVolume.size=10Gi

Or via ArgoCD

Create an Application pointing to a values file in your repo:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: prometheus
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://prometheus-community.github.io/helm-charts
    chart: prometheus
    targetRevision: 25.0.0
    helm:
      values: |
        server:
          persistentVolume:
            size: 10Gi
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring

Verify

kubectl -n monitoring get pods -l app.kubernetes.io/name=prometheus

Step 3: Deploy Loki

Loki aggregates logs (like Prometheus but for logs).

helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
  --namespace monitoring \
  --set loki.persistence.enabled=true \
  --set loki.persistence.size=10Gi

This also installs Promtail for log collection from pods.

Step 4: Deploy Grafana

Grafana provides dashboards and visualization.

helm install grafana grafana/grafana \
  --namespace monitoring \
  --set persistence.enabled=true \
  --set persistence.size=1Gi \
  --set adminPassword=admin  # Change this!

Configure Data Sources

After installation, add data sources in Grafana UI or via ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
  labels:
    grafana_datasource: "1"
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
    - name: Prometheus
      type: prometheus
      url: http://prometheus-server.monitoring.svc:80
      isDefault: true
    - name: Loki
      type: loki
      url: http://loki.monitoring.svc:3100

Step 5: Access Grafana

Expose via Tailscale:

kubectl -n monitoring port-forward svc/grafana 3000:80 &
tailscale serve --bg --https 3000 http://localhost:3000

Or create an Ingress.

Default credentials: admin / (password you set or retrieve from secret)

Step 6: Add Dashboards

Import community dashboards from grafana.com/grafana/dashboards:

Dashboard ID Shows
Node Exporter Full 1860 Host metrics
Kubernetes Cluster 7249 Cluster overview
Loki Logs 13639 Log exploration

In Grafana: Dashboards > Import > Enter ID

Step 7: Deploy Alloy (Optional)

Grafana Alloy is a unified collector that replaces multiple agents (Promtail, node_exporter, etc.).

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: alloy
  namespace: argocd
spec:
  project: default
  source:
    repoURL: https://grafana.github.io/helm-charts
    chart: alloy
    targetRevision: 0.1.0
    helm:
      values: |
        alloy:
          configMap:
            content: |
              // Alloy configuration here
  destination:
    server: https://kubernetes.default.svc
    namespace: monitoring

BluemeOps uses Alloy on both indri (for host metrics, via roles) and in the cluster (for pod logs and service probes).

What You Now Have

  • Metrics collection and storage (Prometheus)
  • Log aggregation (Loki)
  • Dashboards and visualization (Grafana)
  • Foundation for alerting

Adding Alerts

Configure alerting rules in Prometheus:

groups:
- name: example
  rules:
  - alert: HighMemoryUsage
    expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes < 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High memory usage detected"

And notification channels in Grafana (email, Slack, PagerDuty, etc.).

Next Steps

  • Create custom dashboards for your services
  • Set up alerting for critical conditions
  • Add service-specific metrics exporters

BluemeOps Specifics

BlumeOps' observability setup includes:

  • Prometheus scraping all services via annotations
  • Loki collecting logs from all pods and indri services
  • Custom dashboards for jellyfin, teslamate, and cluster health
  • alloy running on both host and in-cluster

See observability for full details.

Troubleshooting

Problem Solution
No metrics appearing Check Prometheus targets (/targets endpoint)
No logs in Loki Verify Promtail/Alloy is collecting (/ready endpoint)
Dashboard shows no data Check data source configuration and time range
High storage usage Adjust retention settings in Prometheus/Loki