DGX Spark llama.cpp Qwen3-Coder-Next Gemma 4 MXFP4 dual server
🔍 Two 80B-class Models at Once on 128GB Unified Memory?
Bottom line: on a single NVIDIA DGX Spark, you can serve an 80B-class coding model and a 26B-class conversation model simultaneously. Zero API cost, zero network latency. It’s all possible thanks to 128GB unified memory and Blackwell’s native MXFP4 support.
This post walks through the full setup of a llama.cpp dual server on DGX Spark, running Qwen3-Coder-Next 80B and Gemma 4 26B-A4B side by side. Covers real-world performance, memory optimization, and systemd service configuration.
This is a hands-on continuation of the DGX Spark vs Mac Studio comparison covered in a previous post.
📋 DGX Spark — Why This Hardware
A desktop AI supercomputer based on the NVIDIA Grace Blackwell architecture. The CPU and GPU share 128GB LPDDR5x memory in a unified memory configuration.
Key specs at a glance:
| Item | Value |
|---|---|
| GPU | NVIDIA GB10 (Blackwell, Compute 12.1) |
| Memory | 128GB Unified LPDDR5x (121GB usable) |
| Bandwidth | 273 GB/s |
| FP16 Performance | ~100 TFLOPS |
| CPU | ARM 20-core (10× Cortex-X925 + 10× A725) |
| Storage | 916GB NVMe |
| Power | ~4W idle / ~35W load |
| Price | $4,699 (raised Feb 2026) |
It launched at $3,999, then jumped 18% in February 2026 due to LPDDR5x supply issues. Even so, DGX Spark is the only option that gives you 128GB unified memory with a Blackwell GPU at this price point.
🛠️ Dual Server Setup — Model Separation by Use Case
Two llama.cpp server instances run on a single DGX Spark, separated by port. Each is managed as an independent systemd service.
| Item | Port 8080 — Qwen3 Coder | Port 8081 — Gemma 4 |
|---|---|---|
| Model | Qwen3-Coder-Next 80B | Gemma 4 26B-A4B |
| Quantization | MXFP4 MoE | MXFP4 MoE |
| Model size | ~48GB | ~16.7GB |
| Active params | 3B / 80B total | 3.8B / 26B total |
| Context | 800K (200K per slot) | 200K |
| Parallel slots | 4 | 1 |
| Threads | 16 | 8 |
| Generation speed | 43.5 tok/s | 57 tok/s |
| Use case | Coding, general, sub-agents | Conversation, AI agents |
Both models use MoE (Mixture of Experts) architecture, so only a small fraction of parameters are active at any time. Qwen3 activates 3B out of 80B; Gemma 4 activates 3.8B out of 26B. That’s the key reason simultaneous operation fits within 128GB.
Model Benchmarks
| Benchmark | Qwen3-Coder-Next 80B | Gemma 4 26B-A4B |
|---|---|---|
| SWE-Bench Verified | 70.6% | — |
| SWE-Bench Pro | 44.3% | — |
| LMArena | — | #6 (1441 pts) |
| vs Dense model | — | 97% perf (8× less compute) |
Qwen3-Coder-Next hits 70.6% on SWE-Bench Verified, edging out DeepSeek-V3.2 (70.2%). Gemma 4 26B ranks #6 on LMArena, delivering 97% of dense model performance with 8× less compute.
Shared Server Flags
Both servers share these optimization flags:
# Common llama-server options
--host 0.0.0.0
--n-gpu-layers 999 # Full GPU offload
--flash-attn # Flash Attention ON
--no-mmap # Avoid mmap perf loss on DGX Spark
--cache-type-k q8_0 # KV cache quantization
--cache-type-v q8_0
--no-mmap is mandatory on DGX Spark. On its unified memory architecture, mmap actually degrades performance. This has been confirmed by the HuggingFace community as well.Qwen3 Server Start Script
The Port 8080 start script. 800K context split across 4 slots for concurrent request handling.
#!/bin/bash
# start-qwen.sh — Port 8080
MODEL="/mnt/nas/Data_Vol1/models/"
MODEL+="Qwen3-Coder-Next-MXFP4_MOE.gguf"
/home/terry/llama.cpp/build/bin/\
llama-server \
--model "$MODEL" \
--host 0.0.0.0 \
--port 8080 \
--n-gpu-layers 999 \
--flash-attn \
--no-mmap \
--ctx-size 800000 \
--parallel 4 \
--threads 16 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--samplers "top_k;top_p;temp" \
--temp 0.7 \
--top-p 0.95 \
--top-k 20
Gemma 4 Server Start Script
The Port 8081 server is tuned for conversation — lightweight with 1 slot and 8 threads.
#!/bin/bash
# start-gemma4.sh — Port 8081
MODEL="/mnt/nas/Data_Vol1/models/"
MODEL+="gemma-4-26B-A4B-it-MXFP4_MOE.gguf"
/home/terry/llama.cpp/build/bin/\
llama-server \
--model "$MODEL" \
--host 0.0.0.0 \
--port 8081 \
--n-gpu-layers 999 \
--flash-attn \
--no-mmap \
--ctx-size 200000 \
--parallel 1 \
--threads 8 \
--ubatch-size 512 \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--samplers "top_k;min_p;temp" \
--temp 0.6 \
--min-p 0.05 \
--top-k 40
Registering as systemd Services
Register both servers as systemd services to auto-start on boot:
# /etc/systemd/system/llama-server.service
[Unit]
Description=Qwen3-Coder-Next (Port 8080)
After=network.target
[Service]
Type=simple
User=terry
ExecStart=/home/terry/start-qwen.sh
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# llama-server-gemma4.service follows the same structure
# Only ExecStart changes to start-gemma4.sh
sudo systemctl enable llama-server
sudo systemctl enable llama-server-gemma4
⚡ 5 Key Optimizations
1. Blackwell Native Build
Build llama.cpp with CMAKE_CUDA_ARCHITECTURES=121a-real to use Blackwell-specific kernels.
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES=\
"121a-real" \
..
cmake --build build -j20
2. MXFP4 Quantization
Blackwell’s native MXFP4 support speeds up prompt processing by up to 25%. It uses FP4 Tensor Core instructions directly.
3. KV Cache q8_0 Quantization
--cache-type-k q8_0 --cache-type-v q8_0 cuts KV cache memory by 47%. This is what makes 800K context feasible.
4. NVMe Read-Ahead Optimization
Raising the default from 128KB to 8192KB significantly improves sequential read performance during model loading.
# Apply immediately
echo 8192 | sudo tee \
/sys/block/nvme0n1/queue/\
read_ahead_kb
# Persist across reboots (udev rule)
# /etc/udev/rules.d/
# 99-nvme-readahead.rules
ACTION=="add|change", \
KERNEL=="nvme0n1", \
ATTR{queue/read_ahead_kb}="8192"
5. –no-mmap Is Non-Negotiable
On DGX Spark’s unified memory architecture, mmap triggers unnecessary page faults. Loading directly into memory with --no-mmap is measurably faster.
📊 Performance Comparison — DGX Spark vs Mac Studio M4 Ultra

Based on Skorppio’s benchmark, here’s how the two platforms compare:
| Item | DGX Spark | Mac Studio M4 Ultra |
|---|---|---|
| Memory | 128GB LPDDR5x | 192GB LPDDR5x |
| Bandwidth | 273 GB/s | 819 GB/s |
| FP16 Compute | ~100 TFLOPS | ~26 TFLOPS |
| Prefill speed | 3.8× faster | 1× |
| Generation speed | 1× | 3.4× faster |
| CUDA support | ✅ (PyTorch, vLLM) | ❌ |
| Price | $4,699 | ~$6,299 |
Mac Studio wins on memory bandwidth (3×), which gives it faster token generation. DGX Spark wins on FP16 compute (4×), which dominates prefill (prompt processing). If you need CUDA workloads — fine-tuning, vLLM, PyTorch — DGX Spark is the only option.
Using EXO 1.0, you can cluster a DGX Spark and a Mac Studio together to combine both advantages. There are documented cases of achieving 4× faster inference by pairing DGX’s fast prefill with Mac’s fast generation.
🔮 CES 2026 Software Update
According to NVIDIA’s CES 2026 announcement, software optimizations alone have boosted DGX Spark performance by 2.5× from launch.
- NVFP4 + Eagle3 speculative decoding: 2.6× throughput over FP8
- Video processing: 8× speed improvement
- How to apply: TensorRT-LLM + aggressive quantization + speculative decoding
A 2.5× performance gain from software alone — with no hardware changes — demonstrates the platform’s long-term value.
💾 Measured Memory — Dual Server in Production
| Configuration | Memory Used | Remaining |
|---|---|---|
| Qwen3 80B (800K ctx) | ~70GB | — |
| Gemma 4 26B (200K ctx) | ~22GB | — |
| Total | ~92.65GB | ~28GB |
92.65GB out of 121GB usable, with 28GB to spare. That’s enough headroom for the OS and other processes. MXFP4 quantization + q8_0 KV cache makes this density possible.
🌐 API Endpoints — OpenAI Compatible
Both servers are fully OpenAI API compatible. No API key required.
# Qwen3 Coder — coding / general
curl http://{DGX_IP}:8080/v1/\
chat/completions \
-H "Content-Type: application/json"\
-d '{
"model": "Qwen3-Coder-Next",
"messages": [{"role": "user",
"content": "Hello"}]
}'
# Gemma 4 — conversation / agents
curl http://{DGX_IP}:8081/v1/\
chat/completions \
-H "Content-Type: application/json"\
-d '{
"model": "gemma-4-26B-A4B",
"messages": [{"role": "user",
"content": "Hello"}]
}'
Any tool — OpenAI SDK, LangChain, Ollama client — connects immediately by changing base_url.
📚 References
- NVIDIA DGX Spark Official Page
- Qwen3-Coder-Next Official Blog
- Google Gemma 4 Announcement
- llama.cpp DGX Spark Performance Discussion (GitHub)
- llama.cpp MXFP4 Blackwell PR (NVIDIA Forums)
- DGX Spark vs Mac Studio Efficiency Benchmark (Skorppio)
- CES 2026 DGX Spark Software Update (NVIDIA Blog)
- DGX Spark Price Increase (Tom’s Hardware)
- EXO 1.0 — DGX Spark + Mac Studio Hybrid
✅ Summary
Running Qwen3-Coder-Next 80B (coding) and Gemma 4 26B (conversation) simultaneously on a single DGX Spark is fully viable. The key combination is MoE architecture’s low active parameter count + MXFP4 quantization + q8_0 KV cache. 92.65GB covers both models with 28GB left over.
43–57 tok/s local inference with zero API cost is a compelling setup for individual developers. $4,699 is a real commitment, but cloud API costs make the payback period shorter than it looks.

