Bumped documented image tag to 0.20.4 (matches kustomization newTag), added the two qwen3.5 models from models.txt, and stamped the card. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
92 lines
2.7 KiB
Markdown
92 lines
2.7 KiB
Markdown
---
|
|
title: Ollama
|
|
modified: 2026-05-01
|
|
last-reviewed: 2026-05-01
|
|
tags:
|
|
- service
|
|
- ai
|
|
---
|
|
|
|
# Ollama
|
|
|
|
LLM inference server with GPU acceleration. Runs on [[ringtail]] with declarative model management via a sidecar.
|
|
|
|
## Quick Reference
|
|
|
|
| Property | Value |
|
|
|----------|-------|
|
|
| **URL** | https://ollama.ops.eblu.me |
|
|
| **Tailscale URL** | https://ollama.tail8d86e.ts.net |
|
|
| **Namespace** | `ollama` |
|
|
| **Cluster** | ringtail k3s |
|
|
| **Image** | `ollama/ollama:0.20.4` |
|
|
| **Upstream** | https://github.com/ollama/ollama |
|
|
| **Manifests** | `argocd/manifests/ollama/` |
|
|
| **API Port** | 11434 |
|
|
|
|
## Architecture
|
|
|
|
```
|
|
models.txt (ConfigMap, declarative)
|
|
│
|
|
▼
|
|
model-sync sidecar ──ollama pull──► Ollama server (GPU)
|
|
│ │
|
|
│ reads /config/models.txt │ serves /api/*
|
|
│ polls every 30 min │ NVIDIA runtime (RTX 4080, time-sliced)
|
|
│ │
|
|
└────────────────────────────────────┘
|
|
│
|
|
/models (200 Gi hostPath PV)
|
|
/mnt/storage1/ollama on ringtail
|
|
```
|
|
|
|
## Models
|
|
|
|
Declared in `argocd/manifests/ollama/models.txt`. The model-sync sidecar pulls missing models on startup and every 30 minutes.
|
|
|
|
| Model | Parameters |
|
|
|-------|------------|
|
|
| `qwen2.5:14b` | 14B |
|
|
| `deepseek-r1:14b` | 14B |
|
|
| `phi4:14b` | 14B |
|
|
| `gemma3:12b` | 12B |
|
|
| `qwen3.5:9b` | 9B |
|
|
| `qwen3.5:27b` | 27B |
|
|
|
|
To add or remove models, edit `models.txt` and sync via ArgoCD.
|
|
|
|
## GPU
|
|
|
|
Shares [[ringtail]]'s RTX 4080 with [[frigate]] via NVIDIA device plugin time-slicing (2 virtual slots). Constrained to one loaded model and one parallel request to avoid VRAM contention.
|
|
|
|
| Setting | Value |
|
|
|---------|-------|
|
|
| `OLLAMA_MAX_LOADED_MODELS` | 1 |
|
|
| `OLLAMA_NUM_PARALLEL` | 1 |
|
|
| GPU limit | `nvidia.com/gpu: "1"` (time-sliced) |
|
|
|
|
## Storage
|
|
|
|
| Mount | Backend | Size |
|
|
|-------|---------|------|
|
|
| `/models` | hostPath PV (`/mnt/storage1/ollama`) | 200 Gi |
|
|
|
|
PV reclaim policy is `Retain` — models survive PV deletion.
|
|
|
|
## Networking
|
|
|
|
| Endpoint | Reachable from |
|
|
|----------|----------------|
|
|
| `https://ollama.ops.eblu.me` | Public internet (Fly.io → Caddy) |
|
|
| `https://ollama.tail8d86e.ts.net` | Tailnet clients |
|
|
| `http://ollama.ollama.svc.cluster.local:11434` | In-cluster (ringtail) |
|
|
|
|
Tailscale ingress uses ProxyGroup `ingress` — no explicit `host:` field (see [[tailscale-operator]]).
|
|
|
|
## Related
|
|
|
|
- [[frigate]] — Shares GPU via time-slicing
|
|
- [[ringtail]] — Host node
|
|
- [[apps]] — ArgoCD application registry
|
|
- [[tailscale-operator]] — Tailscale ingress
|