Fusion eval results

2026-06-14 · OpenRouter Fusion announcement

We're reproducing OpenRouter's Fusion DRACO eval in the open. It's the same class of routing experiment, run with public code, explicit model lists, and cost/quality tradeoffs you can actually measure.

Do we have comparable full-run numbers yet? No. The one full run we did finish used a holistic judge, and that doesn't match OpenRouter's DRACO scoring, so it's out. Showing it next to their numbers would be comparing two different things.

Run	OpenRouter score	TrustedRouter score	Status
Solo Gemini 3 Flash	43.1	29.35 on 10-task smoke	Investigating
Solo Kimi K2.6	53.7	Not enough completed rows	Investigating
Solo DeepSeek V4 Pro	60.3	Not run with exact scorer yet	Pending
Fusion budget panel	64.7	Not run with exact scorer yet	Pending

The rules keep us cheap and keep us comparable. We run in micro-hybrid mode, which means the small public smoke runs first before we spend on any full pass. The judge is google/gemini-3.1-pro-preview. Scoring is DRACO criterion-level grading, three independent passes, normalized 0-100. Search is Exa with the DRACO and rubric hostnames excluded and result-leakage checks turned on, so the judge can't just look up the answer. And the headline rule: the raw solo baselines have to replicate before we publish a single Fusion number. Fusion looking good means nothing if we can't first reproduce Gemini 3 Flash scoring 43.1 on its own.

The exact scorer and the leakage guard both live in the open-source harness, so none of this is a claim you have to trust. When the raw baselines replicate, those numbers replace this table.