Skip to content

Benchmarks

We run benchmarks to understand the real-world performance of different models and swarm configurations. Every benchmark is reproducible — you can run it yourself with the provided YAML definition.


Task: “Investigate the current state of hydrogen energy in Norway.”

Methodology: Each configuration was run 3 times. Quality and Norwegian language scores are averaged from 3 independent human evaluations (blind).

ConfigurationTimeCostQuality*Norwegian**
DeepSeek-V3 (all agents)22s$0.074.2/54.5/5
Claude Sonnet (all agents)18s$0.824.5/54.8/5
GPT-4o (all agents)25s$1.344.3/53.8/5
Mixed (auto-route)20s$0.114.3/54.4/5
GPT-4o-mini (all)14s$0.012.8/52.1/5
*Quality: factual accuracy, structure, completeness, and usefulness — scored 1-5 by human evaluators. **Norwegian: grammar, naturalness, æ/ø/å accuracy, idiomatic expression — scored 1-5 by native speakers.
  • DeepSeek-V3 is the value leader. Near-Sonnet quality for ~11% of the cost.
  • Claude Sonnet is the quality leader, especially for Norwegian. Worth it for high-stakes output.
  • GPT-4o-mini alone is not suitable for research tasks — hallucination rate is significantly higher.
  • Mixed routing (auto-select model per agent) gives 90% of the quality for ~13% of the all-premium cost.
Terminal window
sverm run research \
--model deepseek-v3 \
"Investigate the current state of hydrogen energy in Norway"

Full YAML definition and raw data on GitHub (coming soon).


Task: Review example PR — a 200-line Python change.

Data collection in progress. Results expected May 2026.


We plan to benchmark:

  • Debate quality (different model combinations)
  • Creative writing (Norwegian and English)
  • Cost efficiency (tokens per useful output word)
  • Scaling behavior (2, 5, 10, 20 parallel agents)

Methodology note: All benchmarks use the same swarm YAML definitions. Human evaluators are blind to which model produced which output. Raw data is published alongside results. If you find a flaw in our methodology, let us know — we’ll fix it and credit you.