Bump Ollama memory to 24Gi and enable flash attention

The 27B Q4_K_M model needs ~7.3 GiB system RAM for CPU-offloaded layers but only 6.8 GiB was available within the 22Gi cgroup. Bumping to 24Gi and enabling flash attention (reduces KV cache memory) should provide enough headroom. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 20:33:22 -07:00 · 2026-03-11 20:33:22 -07:00 · c26026f4e9
commit c26026f4e9
parent 6d4929a66c
1 changed files with 3 additions and 1 deletions
--- a/argocd/manifests/ollama/deployment.yaml
+++ b/argocd/manifests/ollama/deployment.yaml
@ -32,6 +32,8 @@ spec:
              value: "1"
            - name: OLLAMA_NUM_PARALLEL
              value: "1"
+            - name: OLLAMA_FLASH_ATTENTION
+              value: "1"
          volumeMounts:
            - name: models
              mountPath: /models
@ -40,7 +42,7 @@ spec:
              memory: "512Mi"
              cpu: "500m"
            limits:
-              memory: "22Gi"
+              memory: "24Gi"
              cpu: "4000m"
              nvidia.com/gpu: "1"
          livenessProbe: