llama.cpp Dual Server on DGX Spark GB10

Dual llama.cpp Server on DGX Spark — Running Qwen3-Coder-Next 80B + Gemma 4 26B Simultaneously

🖥️ NVIDIA DGX Spark GB10 · Ubuntu 24.04.4 LTS · CUDA 13.0 · llama.cpp (Blackwell native build) · 128GB Unified LPDDR5x

DGX Spark llama.cpp Qwen3-Coder-Next Gemma 4 MXFP4 dual server

🔍 Two 80B-class Models at Once on 128GB Unified Memory?

Bottom line: on a single NVIDIA DGX Spark, you can serve an 80B-class coding model and a 26B-class conversation model simultaneously. Zero API cost, zero network latency. It’s all possible thanks to 128GB unified memory and Blackwell’s native MXFP4 support.

This post walks through the full setup of a llama.cpp dual server on DGX Spark, running Qwen3-Coder-Next 80B and Gemma 4 26B-A4B side by side. Covers real-world performance, memory optimization, and systemd service configuration.

This is a hands-on continuation of the DGX Spark vs Mac Studio comparison covered in a previous post.

📋 DGX Spark — Why This Hardware

DGX Spark GB10

A desktop AI supercomputer based on the NVIDIA Grace Blackwell architecture. The CPU and GPU share 128GB LPDDR5x memory in a unified memory configuration.

Key specs at a glance:

Item	Value
GPU	NVIDIA GB10 (Blackwell, Compute 12.1)
Memory	128GB Unified LPDDR5x (121GB usable)
Bandwidth	273 GB/s
FP16 Performance	~100 TFLOPS
CPU	ARM 20-core (10× Cortex-X925 + 10× A725)
Storage	916GB NVMe
Power	~4W idle / ~35W load
Price	$4,699 (raised Feb 2026)

It launched at $3,999, then jumped 18% in February 2026 due to LPDDR5x supply issues. Even so, DGX Spark is the only option that gives you 128GB unified memory with a Blackwell GPU at this price point.

🛠️ Dual Server Setup — Model Separation by Use Case

Two llama.cpp server instances run on a single DGX Spark, separated by port. Each is managed as an independent systemd service.

Item	Port 8080 — Qwen3 Coder	Port 8081 — Gemma 4
Model	Qwen3-Coder-Next 80B	Gemma 4 26B-A4B
Quantization	MXFP4 MoE	MXFP4 MoE
Model size	~48GB	~16.7GB
Active params	3B / 80B total	3.8B / 26B total
Context	800K (200K per slot)	200K
Parallel slots	4	1
Threads	16	8
Generation speed	43.5 tok/s	57 tok/s
Use case	Coding, general, sub-agents	Conversation, AI agents

Both models use MoE (Mixture of Experts) architecture, so only a small fraction of parameters are active at any time. Qwen3 activates 3B out of 80B; Gemma 4 activates 3.8B out of 26B. That’s the key reason simultaneous operation fits within 128GB.

Model Benchmarks

Benchmark	Qwen3-Coder-Next 80B	Gemma 4 26B-A4B
SWE-Bench Verified	70.6%	—
SWE-Bench Pro	44.3%	—
LMArena	—	#6 (1441 pts)
vs Dense model	—	97% perf (8× less compute)

Qwen3-Coder-Next hits 70.6% on SWE-Bench Verified, edging out DeepSeek-V3.2 (70.2%). Gemma 4 26B ranks #6 on LMArena, delivering 97% of dense model performance with 8× less compute.

Shared Server Flags

Both servers share these optimization flags:

# Common llama-server options
--host 0.0.0.0
--n-gpu-layers 999          # Full GPU offload
--flash-attn                # Flash Attention ON
--no-mmap                   # Avoid mmap perf loss on DGX Spark
--cache-type-k q8_0         # KV cache quantization
--cache-type-v q8_0

⚠️ --no-mmap is mandatory on DGX Spark. On its unified memory architecture, mmap actually degrades performance. This has been confirmed by the HuggingFace community as well.

Qwen3 Server Start Script

The Port 8080 start script. 800K context split across 4 slots for concurrent request handling.

#!/bin/bash
# start-qwen.sh — Port 8080
MODEL="/mnt/nas/Data_Vol1/models/"
MODEL+="Qwen3-Coder-Next-MXFP4_MOE.gguf"

/home/terry/llama.cpp/build/bin/\
llama-server \
  --model "$MODEL" \
  --host 0.0.0.0 \
  --port 8080 \
  --n-gpu-layers 999 \
  --flash-attn \
  --no-mmap \
  --ctx-size 800000 \
  --parallel 4 \
  --threads 16 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --samplers "top_k;top_p;temp" \
  --temp 0.7 \
  --top-p 0.95 \
  --top-k 20

Gemma 4 Server Start Script

The Port 8081 server is tuned for conversation — lightweight with 1 slot and 8 threads.

#!/bin/bash
# start-gemma4.sh — Port 8081
MODEL="/mnt/nas/Data_Vol1/models/"
MODEL+="gemma-4-26B-A4B-it-MXFP4_MOE.gguf"

/home/terry/llama.cpp/build/bin/\
llama-server \
  --model "$MODEL" \
  --host 0.0.0.0 \
  --port 8081 \
  --n-gpu-layers 999 \
  --flash-attn \
  --no-mmap \
  --ctx-size 200000 \
  --parallel 1 \
  --threads 8 \
  --ubatch-size 512 \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --samplers "top_k;min_p;temp" \
  --temp 0.6 \
  --min-p 0.05 \
  --top-k 40

Registering as systemd Services

# /etc/systemd/system/llama-server.service
[Unit]
Description=Qwen3-Coder-Next (Port 8080)
After=network.target

[Service]
Type=simple
User=terry
ExecStart=/home/terry/start-qwen.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# llama-server-gemma4.service follows the same structure
# Only ExecStart changes to start-gemma4.sh

sudo systemctl enable llama-server
sudo systemctl enable llama-server-gemma4

⚡ 5 Key Optimizations

1. Blackwell Native Build

Build llama.cpp with CMAKE_CUDA_ARCHITECTURES=121a-real to use Blackwell-specific kernels.

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=\
"121a-real" \
  ..
cmake --build build -j20

2. MXFP4 Quantization

Blackwell’s native MXFP4 support speeds up prompt processing by up to 25%. It uses FP4 Tensor Core instructions directly.

3. KV Cache q8_0 Quantization

--cache-type-k q8_0 --cache-type-v q8_0 cuts KV cache memory by 47%. This is what makes 800K context feasible.

💡 You can go further with q4_0, but on unified memory architecture the dequantization overhead drops generation speed by 34–37%. q8_0 is the optimal balance between memory savings and throughput.

4. NVMe Read-Ahead Optimization

Raising the default from 128KB to 8192KB significantly improves sequential read performance during model loading.

# Apply immediately
echo 8192 | sudo tee \
  /sys/block/nvme0n1/queue/\
read_ahead_kb

# Persist across reboots (udev rule)
# /etc/udev/rules.d/
#   99-nvme-readahead.rules
ACTION=="add|change", \
  KERNEL=="nvme0n1", \
  ATTR{queue/read_ahead_kb}="8192"

5. –no-mmap Is Non-Negotiable

On DGX Spark’s unified memory architecture, mmap triggers unnecessary page faults. Loading directly into memory with --no-mmap is measurably faster.

📊 Performance Comparison — DGX Spark vs Mac Studio M4 Ultra

Claudie and Siwol discuss the DGX Spark dual LLM server setup and API cost savings in a 4-panel comic

Based on Skorppio’s benchmark, here’s how the two platforms compare:

Item	DGX Spark	Mac Studio M4 Ultra
Memory	128GB LPDDR5x	192GB LPDDR5x
Bandwidth	273 GB/s	819 GB/s
FP16 Compute	~100 TFLOPS	~26 TFLOPS
Prefill speed	3.8× faster	1×
Generation speed	1×	3.4× faster
CUDA support	✅ (PyTorch, vLLM)	❌
Price	$4,699	~$6,299

Mac Studio wins on memory bandwidth (3×), which gives it faster token generation. DGX Spark wins on FP16 compute (4×), which dominates prefill (prompt processing). If you need CUDA workloads — fine-tuning, vLLM, PyTorch — DGX Spark is the only option.

Using EXO 1.0, you can cluster a DGX Spark and a Mac Studio together to combine both advantages. There are documented cases of achieving 4× faster inference by pairing DGX’s fast prefill with Mac’s fast generation.

🔮 CES 2026 Software Update

According to NVIDIA’s CES 2026 announcement, software optimizations alone have boosted DGX Spark performance by 2.5× from launch.

NVFP4 + Eagle3 speculative decoding: 2.6× throughput over FP8
Video processing: 8× speed improvement
How to apply: TensorRT-LLM + aggressive quantization + speculative decoding

A 2.5× performance gain from software alone — with no hardware changes — demonstrates the platform’s long-term value.

💾 Measured Memory — Dual Server in Production

Configuration	Memory Used	Remaining
Qwen3 80B (800K ctx)	~70GB	—
Gemma 4 26B (200K ctx)	~22GB	—
Total	~92.65GB	~28GB

92.65GB out of 121GB usable, with 28GB to spare. That’s enough headroom for the OS and other processes. MXFP4 quantization + q8_0 KV cache makes this density possible.

🌐 API Endpoints — OpenAI Compatible

Both servers are fully OpenAI API compatible. No API key required.

# Qwen3 Coder — coding / general
curl http://{DGX_IP}:8080/v1/\
chat/completions \
  -H "Content-Type: application/json"\
  -d '{
    "model": "Qwen3-Coder-Next",
    "messages": [{"role": "user",
      "content": "Hello"}]
  }'

# Gemma 4 — conversation / agents
curl http://{DGX_IP}:8081/v1/\
chat/completions \
  -H "Content-Type: application/json"\
  -d '{
    "model": "gemma-4-26B-A4B",
    "messages": [{"role": "user",
      "content": "Hello"}]
  }'

Any tool — OpenAI SDK, LangChain, Ollama client — connects immediately by changing base_url.

📚 References

NVIDIA DGX Spark Official Page
Qwen3-Coder-Next Official Blog
Google Gemma 4 Announcement
llama.cpp DGX Spark Performance Discussion (GitHub)
llama.cpp MXFP4 Blackwell PR (NVIDIA Forums)
DGX Spark vs Mac Studio Efficiency Benchmark (Skorppio)
CES 2026 DGX Spark Software Update (NVIDIA Blog)
DGX Spark Price Increase (Tom’s Hardware)
EXO 1.0 — DGX Spark + Mac Studio Hybrid

✅ Summary

Running Qwen3-Coder-Next 80B (coding) and Gemma 4 26B (conversation) simultaneously on a single DGX Spark is fully viable. The key combination is MoE architecture’s low active parameter count + MXFP4 quantization + q8_0 KV cache. 92.65GB covers both models with 28GB left over.

43–57 tok/s local inference with zero API cost is a compelling setup for individual developers. $4,699 is a real commitment, but cloud API costs make the payback period shorter than it looks.

Illustration of Claudie drawing data paths between two glowing server constellations on a map