OpenRouter prototype and public proof surface

Moral Restoration Benchmark

A benchmark and advice prototype for comparing plain LLM answers against moral reasoning that must identify stakeholders, rank binding obligations, and keep moral residue visible.

In one sentence: Moral Restoration compares hierarchy-aware relational reasoning against flat rubrics and plain LLM answers, then makes the answers, judging, and score deltas inspectable.

How the prototype reasons

Root authority Christian, secular, and Gert-style roots remain explicit where implemented.
Relational graph Stakeholders and obligations are modeled as relationships, not as loose advice themes.
Constraint order Binding duties are handled before softer tradeoffs and preference-level optimization.
Moral residue Remaining costs and repair obligations stay visible in the final recommendation.

What is supported today

The landing page is a map of the live evidence, not a victory lap. The current publish target is the 100-question DeepSeek/OpenRouter prototype payload; several research questions remain open.

100 prototype questions
6 comparison workflows
88.0 top MRB score
650 MoReBench rows normalized
pass current test suite status

Why hierarchy matters

The project’s core claim is narrower than “AI can solve morality.” It is that moral advice becomes more inspectable when relationship, authority, and repair obligations are represented explicitly.

Implemented

Stakeholders are named

MHF asks who is affected before it recommends an action. That makes hidden parties, such as spouses, children, congregants, employers, and vulnerable neighbors, part of the reasoning surface.

Implemented

Duties are ordered

The prototype treats some obligations as binding constraints rather than letting every consideration enter one flat average. This is the heart of the hierarchy claim.

Still being tested

Advice quality is not declared solved

The latest judged comparison is encouraging, but routing regressions and public showcase quality still need work before broader claims should be made.

Start with the evidence, then inspect the machinery

These are the primary pages for reviewing the project. Each page is part of the same public-proof cluster and links back to the evidence surface.

Datasets are evidence inputs, not decoration

The project separates generated prose from structural data: benchmark payloads, scenario files, and dataset pages remain inspectable.

Earlier experiments and current caveats

Older experiment pages remain available, but the live benchmark comparison is now the clearest current evidence surface.