Fusion eval results
We're reproducing OpenRouter's Fusion DRACO eval in the open. It's the same class of routing experiment, run with public code, explicit model lists, and cost/quality tradeoffs you can actually measure.
Do we have comparable full-run numbers yet? No. The one full run we did finish used a holistic judge, and that doesn't match OpenRouter's DRACO scoring, so it's out. Showing it next to their numbers would be comparing two different things.
| Run | OpenRouter score | TrustedRouter score | Status |
|---|---|---|---|
| Solo Gemini 3 Flash | 43.1 | 29.35 on 10-task smoke | Investigating |
| Solo Kimi K2.6 | 53.7 | Not enough completed rows | Investigating |
| Solo DeepSeek V4 Pro | 60.3 | Not run with exact scorer yet | Pending |
| Fusion budget panel | 64.7 | Not run with exact scorer yet | Pending |
The rules keep us cheap and keep us comparable. We run in micro-hybrid mode, which means the small public smoke runs first before we spend on any full pass. The judge is google/gemini-3.1-pro-preview. Scoring is DRACO criterion-level grading, three independent passes, normalized 0-100. Search is Exa with the DRACO and rubric hostnames excluded and result-leakage checks turned on, so the judge can't just look up the answer. And the headline rule: the raw solo baselines have to replicate before we publish a single Fusion number. Fusion looking good means nothing if we can't first reproduce Gemini 3 Flash scoring 43.1 on its own.
The exact scorer and the leakage guard both live in the open-source harness, so none of this is a claim you have to trust. When the raw baselines replicate, those numbers replace this table.