Bump Ollama memory to 24Gi and enable flash attention

The 27B Q4_K_M model needs ~7.3 GiB system RAM for CPU-offloaded layers
but only 6.8 GiB was available within the 22Gi cgroup. Bumping to 24Gi
and enabling flash attention (reduces KV cache memory) should provide
enough headroom.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Erich Blume 2026-03-11 20:33:22 -07:00
commit c26026f4e9

View file

@ -32,6 +32,8 @@ spec:
value: "1"
- name: OLLAMA_NUM_PARALLEL
value: "1"
- name: OLLAMA_FLASH_ATTENTION
value: "1"
volumeMounts:
- name: models
mountPath: /models
@ -40,7 +42,7 @@ spec:
memory: "512Mi"
cpu: "500m"
limits:
memory: "22Gi"
memory: "24Gi"
cpu: "4000m"
nvidia.com/gpu: "1"
livenessProbe: