← TrustedRouter blog

Fusion eval results

2026-06-14 · OpenRouter Fusion announcement

We're reproducing OpenRouter's Fusion DRACO eval in the open. It's the same class of routing experiment, run with public code, explicit model lists, and cost/quality tradeoffs you can actually measure.

Do we have comparable full-run numbers yet? No. The one full run we did finish used a holistic judge, and that doesn't match OpenRouter's DRACO scoring, so it's out. Showing it next to their numbers would be comparing two different things.

RunOpenRouter scoreTrustedRouter scoreStatus
Solo Gemini 3 Flash43.129.35 on 10-task smokeInvestigating
Solo Kimi K2.653.7Not enough completed rowsInvestigating
Solo DeepSeek V4 Pro60.3Not run with exact scorer yetPending
Fusion budget panel64.7Not run with exact scorer yetPending

The rules keep us cheap and keep us comparable. We run in micro-hybrid mode, which means the small public smoke runs first before we spend on any full pass. The judge is google/gemini-3.1-pro-preview. Scoring is DRACO criterion-level grading, three independent passes, normalized 0-100. Search is Exa with the DRACO and rubric hostnames excluded and result-leakage checks turned on, so the judge can't just look up the answer. And the headline rule: the raw solo baselines have to replicate before we publish a single Fusion number. Fusion looking good means nothing if we can't first reproduce Gemini 3 Flash scoring 43.1 on its own.

The exact scorer and the leakage guard both live in the open-source harness, so none of this is a claim you have to trust. When the raw baselines replicate, those numbers replace this table.


Sign in

Choose a sign in method.