blumeops/docs/reference/services/ollama.md

---
title: Ollama
modified: 2026-05-01
last-reviewed: 2026-05-01
tags:
  - service
  - ai
---

# Ollama

LLM inference server with GPU acceleration. Runs on [[ringtail]] with declarative model management via a sidecar.

## Quick Reference

| Property | Value |
|----------|-------|
| **URL** | https://ollama.ops.eblu.me |
| **Tailscale URL** | https://ollama.tail8d86e.ts.net |
| **Namespace** | `ollama` |
| **Cluster** | ringtail k3s |
| **Image** | `ollama/ollama:0.20.4` |
| **Upstream** | https://github.com/ollama/ollama |
| **Manifests** | `argocd/manifests/ollama/` |
| **API Port** | 11434 |

## Architecture

```
models.txt (ConfigMap, declarative)
    │
    ▼
model-sync sidecar ──ollama pull──► Ollama server (GPU)
    │                                    │
    │ reads /config/models.txt           │ serves /api/*
    │ polls every 30 min                 │ NVIDIA runtime (RTX 4080, time-sliced)
    │                                    │
    └────────────────────────────────────┘
                     │
                /models (200 Gi hostPath PV)
                /mnt/storage1/ollama on ringtail
```

## Models

Declared in `argocd/manifests/ollama/models.txt`. The model-sync sidecar pulls missing models on startup and every 30 minutes.

| Model | Parameters |
|-------|------------|
| `qwen2.5:14b` | 14B |
| `deepseek-r1:14b` | 14B |
| `phi4:14b` | 14B |
| `gemma3:12b` | 12B |
| `qwen3.5:9b` | 9B |
| `qwen3.5:27b` | 27B |

To add or remove models, edit `models.txt` and sync via ArgoCD.

## GPU

Shares [[ringtail]]'s RTX 4080 with [[frigate]] via NVIDIA device plugin time-slicing (2 virtual slots). Constrained to one loaded model and one parallel request to avoid VRAM contention.

| Setting | Value |
|---------|-------|
| `OLLAMA_MAX_LOADED_MODELS` | 1 |
| `OLLAMA_NUM_PARALLEL` | 1 |
| GPU limit | `nvidia.com/gpu: "1"` (time-sliced) |

## Storage

| Mount | Backend | Size |
|-------|---------|------|
| `/models` | hostPath PV (`/mnt/storage1/ollama`) | 200 Gi |

PV reclaim policy is `Retain` — models survive PV deletion.

## Networking

| Endpoint | Reachable from |
|----------|----------------|
| `https://ollama.ops.eblu.me` | Public internet (Fly.io → Caddy) |
| `https://ollama.tail8d86e.ts.net` | Tailnet clients |
| `http://ollama.ollama.svc.cluster.local:11434` | In-cluster (ringtail) |

Tailscale ingress uses ProxyGroup `ingress` — no explicit `host:` field (see [[tailscale-operator]]).

## Related

- [[frigate]] — Shares GPU via time-slicing
- [[ringtail]] — Host node
- [[apps]] — ArgoCD application registry
- [[tailscale-operator]] — Tailscale ingress