Your Local Model Settings Matter More Than You Think
How tuned llama.cpp settings changed prompt processing and generation speed on the same Strix Halo machine, with paired before/after bars and percent lifts.
Crown Citadel Performance Lab
Model throughput, memory use, long-context behavior, serving latency, speculative decoding, and task quality measured on the Crown Citadel Ryzen AI Max+ 395 lab system.
Articles about running AI locally: hardware, settings, model behavior, tooling, and lessons from the lab.
How tuned llama.cpp settings changed prompt processing and generation speed on the same Strix Halo machine, with paired before/after bars and percent lifts.
GPU memory on x, standalone decode on y, and selected context speed as color.
Context-side prompt processing on x and standalone decode on y, grouped by model family.
Only models with at least four measured context points and one 64k+ result are drawn, so the lines represent real scaling behavior rather than one-off tests.
Full benchmark matrix. Context columns show prompt-processing or combined prompt+generation throughput, while standalone decode is kept separate in the TG column.
| Profile | Backend | KV / Shape | 4k | 8k | 16k | 32k | 64k | 80k | 100k | 128k | TG | Memory | Measurements |
|---|
Speculative serving sweeps: draft settings, acceptance rate, request time, decode speed, and memory deltas.
| Run | Config | Total | Prompt | Decode | Acceptance | Memory |
|---|
End-to-end local serving measurements: first-token delay, total request time, prompt size, generated output, and sampled memory.
| Run | First token | Total | Prompt | Output | Memory |
|---|
These tests run Hermes-style support tasks through candidate auxiliary models. They measure whether an aux model can reliably handle agent side work, while also tracking first-token latency, total latency, score, pass rate, and coverage. Dot size shows how many eval rows support the point.
| Candidate | Pass | Score | Total | TTFP | Coverage |
|---|
Loadouts test the Hermes main model and auxiliary model together as a routing plan. They measure more than throughput because an agent loadout has to be correct, responsive at p95, and small enough to stay loaded beside the main model without crowding memory.
| Loadout | Main | Aux | P95 | VRAM |
|---|
Each card explains what the race tests before you open it. These are focused head-to-head pages for speculative decoding, long-output behavior, task correctness, run health, and memory pressure.
Code review security tasks from the Hermes auxiliary model suite. Harness or launch/config invalid rows are excluded from model scoring and shown as excluded coverage.
Ranked by adjusted pass rate, then score, then latency. Coverage shows scored code-review tasks separately from harness/config rows.
| Candidate | Pass | Score | Total | Coverage |
|---|
How to read the dashboard without knowing the benchmark harness internals.