The quick version: settings can matter as much as the model.
People often compare local models using wrapper defaults. That is convenient, but it can hide the knobs that decide whether your GPU is really doing the work. This benchmark used a deliberately simple workload: PP512 for prompt processing and TG128 for token generation.
One 35B model went from 49 tokens/sec to 1,096 tokens/sec while reading the prompt.
That was Qwen3.6 35B A3B Q4_K_XL. The same paired row also moved generation from about 22 tokens/sec to 61 tokens/sec. Same model class, same machine, much better use of the hardware.
This is not an official leaderboard. The “default-style” side represents a generic, untuned run: flash attention off, generic split mode, no model-specific cache choices, and a path that did not prove the hardware was being used well. That is exactly the real-world trap: if your tool hides the important flags, your benchmark may be measuring your configuration more than your model.
Prompt processing is where tuning looks ridiculous.
Prompt processing is the part where the model reads your prompt and context. If this is slow, every long chat, codebase question, RAG query, or pasted document feels sluggish before the model even starts writing.
The chart uses a log scale because some wins are so large that a normal scale would flatten the default-style bars into the floor.
Generation speed also changed in ways you can feel.
Token generation is the “typing speed” after the first token. This is what most people notice during chat. A 2x lift feels smoother. A 4x lift feels like a different machine. A 7x lift means the earlier result was never representative.
Selected before/after examples.
These are paired rows from the same hardware. “Default-style” is the generic pass. “Tuned” is the hardware-aware pass with GPU validation, flash attention, full offload, and model-specific cache/batch choices.
| Model | PP512 default | PP512 tuned | PP lift % | TG128 default | TG128 tuned | TG lift % |
|---|
The 35B-class results got a lot more interesting.
Once the run was tuned, several 35B-ish MoE models landed in a useful interactive range. The bars show the generic default-style pass beside the tuned pass, so the short-decode speeds keep their baseline context.
What changed?
There was no magic patch. The gains came from making llama.cpp match the machine and the model instead of trusting a generic command line.
-ngl 999: full GPU offload
GPU layers decide how much of the model runs on the GPU. If too much falls back to CPU, prefill can collapse. Full offload is the first thing to verify.
-fa 1: flash attention
Flash attention improves the memory and compute path for attention. On this setup, leaving it off made many models look far slower than they really are.
-sm row: split mode
Split mode controls how work is partitioned. Row split was the better general choice for the current Vulkan Strix Halo path in this sweep.
f16 vs q8_0 KV cache
KV cache stores attention history. Smaller cache can help long contexts, but it is model-specific. Gemma liked f16; MedPsy and Qwen3.6 27B used q8_0 here.
-b and -ub: batch sizes
Batch and micro-batch affect how much work is fed to the GPU at once. Bigger is not always faster. The current 35B A3B default used b2048/u512.
Model profiles, not one global default
The best setting for a 4B model is not automatically the best setting for a 35B MoE. Treat each model family as its own profile.
Why wrapper defaults can mislead you.
Ollama and similar tools are great for getting started. The tradeoff is that they often abstract away the exact knobs that decide throughput. If your goal is “easy chat,” that is fine. If your goal is “what can my hardware really do,” you need the knobs.
Defaults optimize for broad compatibility.
Generic defaults try to boot on many machines, not squeeze every token out of a specific integrated GPU with shared memory. That is a good default product choice, but a bad final benchmark.
Tuning makes benchmarks reproducible.
Once you record the model, backend, cache type, flash attention, GPU layers, batch, micro-batch, context, and token count, your numbers become something you can learn from instead of vibes.
A practical tuning checklist.
If you are running local LLMs casually, you do not need to become a benchmark engineer. Start with these checks and you will avoid the biggest mistakes.
- Verify your GPU backend is actually visible before trusting any speed number.
- Run one short prompt-processing test and one token-generation test. They measure different bottlenecks.
- Try flash attention on and off. On modern hardware it is often a major win.
- Try full GPU offload, then confirm memory use and logs match what you expect.
- Test KV cache types per model family. Do not assume one global answer.
- Sweep micro-batch sizes.
512,1024, and2048can behave very differently. - Write the settings down with the result. Future you will thank you.