Answer evidence browser

Browse every answer behind the judged comparison.

Search the full published benchmark set, scan per-question scores, then open any row to compare MHF and baseline answer bodies side by side.

Questions
--
Judge
--
Best MRB
--
What this page shows The full benchmark question set with MHF and plain-LLM answers, judge notes, score deltas, and workflow-level results.
How to use it Filter or search, click a score-table row, then switch workflows to compare the answers that produced each score.
Published scope The OpenRouter prototype exposes all 100 question rows. Older payloads fall back to their featured published bodies.
Sections
Loading cases...

Question score table

Click any row to inspect the answers

Rows follow the current search and topic filters.

Question Source MHF full Raw Delta Spread
Loading scores...
Loading comparison...