
Gemma4 Local LLM DGX Spark llama.cpp optimization Qwen3.6 comparison LLM OOM fix
🔍 The conclusion first: I tore out every giant model
Let me give you the conclusion up front. After two months of constantly swapping the Gemma4 local LLM lineup on my DGX Spark, I pulled the giant coder model (Qwen) out entirely and kept just one small, fast Gemma4 26B (4B active MoE). I waved goodbye to the flashy 80B and 35B coders and settled on a single 14.25GB quantized model.
At first I was firmly in the “bigger is smarter, so use bigger” camp too. But the measured data pointed in exactly the opposite direction. The little Gemma4 was 3–12x faster at decode, and on practical quality it was basically on par or better across the board. This post is a dev log that walks through those two months of trial and error, the OOMs, and the surprising conclusion — all from an operator’s point of view. Every number here is a first-hand measurement taken directly on my own DGX Spark, not an estimate.
📋 Why local LLMs, and why the DGX Spark?
I wanted to run coding assistance and document processing locally. Even if I hand the hard reasoning off to a frontier model (Claude), shipping every little high-volume task to an API as well is a burden on both cost and latency. So what I picked was the NVIDIA DGX Spark. If you’re curious about the initial setup, take a look at my DGX Spark setup notes as well.
The DGX Spark is a desktop AI machine that puts 128GB of unified memory (LPDDR5x, 273GB/s bandwidth) on top of a GB10 Grace Blackwell Superchip. It launched in October 2025 at $3,999, but after DRAM prices spiked it went up to $4,699 in February 2026 (an official NVIDIA increase). The key point is that the CPU and GPU share the same memory pool. Thanks to that, you can load even a 70B-class model in one piece with no VRAM-copy overhead. That said, once you subtract the OS reservation the real usable amount is roughly 121GiB — and that “ceiling” is exactly what trips me up later.
In Gemma4 26B-A4B, the “A4B” means 4B active. It’s a Mixture-of-Experts architecture where, out of the full 26B, only 4B actually fire per token — so it runs far faster than its 26B size would suggest.
🛠️ The two-month evolution arc: I swapped models five times
It wasn’t a one-shot landing. I went through five stages, from baseline to where I am now. Each stage had a reason for the switch, so let me show you the table first.

| Stage | Period | Main coder (Qwen) | Assistant (Gemma) | Key measurement | Reason for switch |
|---|---|---|---|---|---|
| Pre | ~04-04 | Qwen3.5-122B-A10B Q5_K_M | (none) | 23.7 tok/s, ctx 200K | Pre-optimization baseline |
| S1 | 04-04~04-19 | Qwen3-Coder-Next 80B-A3B MXFP4 (~45GB) | Added gemma-4-26B-A4B MXFP4 | Qwen 50.98 tok/s | 122B→80B: speed +115%, size 81→45GB |
| S2 | 04-19~05-19 | Qwen3.6-35B-A3B NVFP4 (~25GB) | (same) | ctx 1M, SWE-bench Pro 44→49 | Half the size + omnimodal + higher reasoning score |
| S3 | 05-19~06-10 | Qwen3.6-27B dense → MTP | (same) | Qwen 28~40 / Gemma 55~56 | 35B went OOM at 1M ctx → downsized |
| S4 (current) | 06-10~ | Retired (~99GB reclaimed) | gemma-4-26B-A4B-it-qat-UD-Q4_K_XL (14.25GB) | Decode 109.3 tok/s | Gemma is 3~12x faster and on par or better in practice |
The start: 122B alone was a flat “can’t use this”
At first I loaded Qwen3.5-122B on its own. But 23.7 tok/s. Watching it drop one character at a time was so frustrating that I just couldn’t put it to real work. So I switched to the smaller, faster Qwen3-Coder-Next 80B and bolted Gemma4 26B alongside it as an assistant, going dual. I ran the two models as separate llama.cpp instances on ports 8080 and 8081. With this, Qwen’s speed jumped from 23.7 to 50.98 tok/s, a +115% gain. The size also shrank from 81GB to 45GB.
Both are 4-bit quantization formats. NVFP4 uses 16-element blocks + a two-level scale, so it fits the value range more finely than MXFP4 (32-element blocks) and ends up with smaller quantization error.
The greedy phase: I chased a “smarter coder”
This is where greed crept in. I wanted a smarter coder, so I climbed aboard Qwen3.6-35B-A3B. It’s 35B but only 3B active MoE so it’s light, you can stretch the context to 1M with YaRN, it’s omnimodal, and its SWE-bench Pro score had gone up from 44 to 49. Half the size and smarter — there was no reason not to switch. I briefly slotted in vLLM (NVFP4) at one point and then came back to llama.cpp, but the reason for that isn’t in my notes, so I won’t assert anything about it here.
The trial and error: turning on 1M context triggered an OOM
The problem blew up the moment greed met the ceiling. When I brought up the 35B at 1M context, memory usage shot up to ~51GB including the KV cache, and in the dual configuration with Gemma loaded alongside, it hit the unified-memory ceiling (~124GiB) and went OOM. My ambition to run the smartest model at the widest context just snapped right in front of physical memory.
So I dropped down a notch and moved to Qwen3.6-27B (dense). It stabilized at around 28~40 tok/s, but that’s when I noticed that the Gemma4 quietly running alongside it was clocking 55~56 tok/s. “Wait, the little one I bolted on as an assistant is faster?” This is where the story flips.

📊 Head to head: once I actually pitted them, Gemma won
So I put the 27B Qwen and Gemma4 head to head on the same tasks. The results surprised me. The decode speed wasn’t even a contest, and on practical quality Gemma was basically on par or better across the board. The only thing Qwen won was the hardest, AIME-grade math.
| Item | Qwen3.6-27B | Gemma4 26B-A4B | Result |
|---|---|---|---|
| Decode speed | 28~40 tok/s | 110~130 tok/s | Gemma wins big (3~12x) |
| Vision OCR | 6.3s | 1.4s | Gemma wins big |
| Hardest math | 738 ✓ | 657 ✗ | Qwen’s only edge |
Here’s the crux. The one area Qwen wins (hardest math) is something I don’t hand to local anyway. That kind of work goes to a frontier model (Claude). So the reason to keep hugging a 99GB Qwen locally just evaporated. It was the moment the data confirmed that the Gemma I’d kept as a sidekick could become the main act.
So I made the call. I took the Qwen service down and disabled it, reclaiming about 99GB.
systemctl stop llama-qwen
systemctl disable llama-qwen
# reclaim ~99GB of unified memory → pour it all into Gemma
⚡ The climax: I pushed Gemma alone up to 109 tok/s
Now that Gemma was running solo, I poured all the reclaimed memory headroom into optimizing it. Here I used two weapons: QAT requantization and an MTP draft.
A technique that simulates low-precision arithmetic ahead of time during training to reduce the quality drop after quantization. Google had just released the Gemma4 QAT checkpoints in early June 2026, which happened to line up perfectly with my timing.
First I switched from the default MXFP4 quantization to Google’s released QAT checkpoint (gemma-4-26B-A4B-it-qat-UD-Q4_K_XL). Its weight dropped from 16.7GB to 14.25GB with almost no quality loss. That swap alone bumped decode from 57.6 to 80.8 tok/s (+40%).
Then I attached an MTP draft head.
A speculative decoding technique that, with no separate draft model, attaches an auxiliary prediction head to the model to guess the next 2~4 tokens at once in a single forward pass. It gets a high acceptance rate on code and structured output.
In llama.cpp I turned MTP draft on with n=2 (a conservative, stable setting). The run command looks roughly like this.
llama-server \
-m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf \
--mmproj mmproj-gemma4-qat.gguf \
--spec-type mtp \
--spec-draft-n-max 2 \
-c 262144 \
--port 8080
In my environment the draft acceptance rate came out at about 82%, and decode climbed from 80.8 all the way up to 109.3 tok/s. That’s +90% over the starting point (57.6). Every single step is measured, so summarized in a table it looks like this.

| Configuration | Decode speed | vs. starting point |
|---|---|---|
| MXFP4 (default) | 57.6 tok/s | — |
| QAT UD-Q4_K_XL | 80.8 tok/s | +40% |
| QAT + MTP draft n=2 | 109.3 tok/s | +90% |
✅ The lesson: locally, a “fast and smart-enough assistant” wins
If I compress the two-month conclusion into one line, it’s this. Locally, the winning play in practice is not a giant do-everything model, but a fast, smart-enough assistant model — with the hard stuff handed off to a frontier model, splitting the roles.
I followed the “bigger is better” intuition and got greedy all the way from 122B → 80B → 35B → 27B, and it was only after I snapped once against the physical limit of OOM that I looked straight at the data. When I did, the little Gemma4 I’d kept as a sidekick was 3~12x faster and on par or better in practice, and the single area Qwen won (hardest math) was never local’s job to begin with. So I reclaimed 99GB, funneled it all into one Gemma, and made 109 tok/s with QAT and MTP.
If you’re mulling over a local LLM setup, my advice is: don’t start by hunting for “the biggest model” — start with “a smart-enough model that can handle 90% of my work fast.” Leave the remaining 10% to a frontier model. That’s the most expensive lesson I learned the hard way.
📚 References
- NVIDIA DGX Spark product page — nvidia.com (GB10, 128GB unified memory, 273GB/s)
- Google Gemma 4 QAT announcement — blog.google (2026-06)
- Gemma 4 26B-A4B model card — Hugging Face
- Qwen3.6-35B-A3B — qwen.ai (262K→1M context)
- llama.cpp speculative decoding (MTP) docs — GitHub
- NVFP4 vs MXFP4 — NVIDIA Technical Blog
- Related post: DGX Spark initial setup notes — covers the hardware configuration and the first llama.cpp install.
In the next post I’ll continue with how I attached vision OCR and tool calling to the Gemma4-only setup, plus the trial and error I hit with the mmproj configuration. Let’s enjoy plenty of operation with just one small model, together.
