Bump Ollama memory to 24Gi and enable flash attention
The 27B Q4_K_M model needs ~7.3 GiB system RAM for CPU-offloaded layers but only 6.8 GiB was available within the 22Gi cgroup. Bumping to 24Gi and enabling flash attention (reduces KV cache memory) should provide enough headroom. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
6d4929a66c
commit
c26026f4e9
1 changed files with 3 additions and 1 deletions
|
|
@ -32,6 +32,8 @@ spec:
|
|||
value: "1"
|
||||
- name: OLLAMA_NUM_PARALLEL
|
||||
value: "1"
|
||||
- name: OLLAMA_FLASH_ATTENTION
|
||||
value: "1"
|
||||
volumeMounts:
|
||||
- name: models
|
||||
mountPath: /models
|
||||
|
|
@ -40,7 +42,7 @@ spec:
|
|||
memory: "512Mi"
|
||||
cpu: "500m"
|
||||
limits:
|
||||
memory: "22Gi"
|
||||
memory: "24Gi"
|
||||
cpu: "4000m"
|
||||
nvidia.com/gpu: "1"
|
||||
livenessProbe:
|
||||
|
|
|
|||
Loading…
Add table
Add a link
Reference in a new issue