Moral Restoration Benchmark

Current evidence

What is supported today

The landing page is a map of the live evidence, not a victory lap. The current publish target is the 100-question DeepSeek/OpenRouter public demo payload; several research questions remain open.

100 public demo questions

6 comparison workflows

88.0 top MRB score

650 MoReBench rows normalized

pass current test suite status

Research claim

Why hierarchy matters

The project’s core claim is narrower than “AI can solve morality.” It is that moral advice becomes more inspectable when relationship, authority, and repair obligations are represented explicitly.

Implemented

Stakeholders are named

MHF asks who is affected before it recommends an action. That makes hidden parties, such as spouses, children, congregants, employers, and vulnerable neighbors, part of the reasoning surface.

Implemented

Duties are ordered

The prototype treats some obligations as binding constraints rather than letting every consideration enter one flat average. This is the heart of the hierarchy claim.

Still being tested

Advice quality is not declared solved

The latest judged comparison is encouraging, but routing regressions and public showcase quality still need work before broader claims should be made.

Inspection path

Start with the evidence, then inspect the machinery

These are the primary pages for reviewing the project. Each page is part of the same public-proof cluster and links back to the evidence surface.

01

Benchmark Comparison

Public 100-question demo comparing MHF prose modes against raw-model, worldview-prompt, and structured secular baselines.

100 questions 6 workflows OpenRouter

Open comparison 02

Answer Browser

Search featured question bodies, switch MHF/plain-LLM workflows, and compare answer text, judging, score deltas, and dimensions independently.

Search Topics Answer text

Browse answers 03

Scenario Scorecard

Scenario-level scorecard across the older public-proof set. Useful for seeing where structured reasoning diverges from flatter baselines.

55 scenarios Framework comparison

View scorecard 04

Methodology

How roots, weights, scenario fixtures, and comparison claims are assembled, including the limits of comparing Christian and secular parameterizations.

Derivation Weights Caveats

Read methodology 05

Practical Summary

A compact explanation of what MHF gives you, what it misses, and which comparisons are currently persuasive versus provisional.

Summary Tradeoffs

Read summary 06

Try It

Inspect preloaded scenarios and see stakeholder graphs, constraints, recommendations, and residue under implemented parameterizations.

Interactive Scenarios

Try scenarios

Source data and provenance

Datasets are evidence inputs, not decoration

The project separates generated prose from structural data: benchmark payloads, scenario files, and dataset pages remain inspectable.

MoReBench

The latest local Hugging Face CSV exports are normalized into 500 public rows and 150 theory rows for provenance and future scoring work.

AITA Dataset

Reddit moral dilemmas and verdicts used as a comparison source for secular calibration and stakeholder extraction work.

UniMoral Dataset

Unified moral judgment data used for cross-dataset validation and comparison framing.

Pew Surveys

Survey-backed moral-attitude data used to discuss population-level parameterization differences.

Research log

Earlier experiments and current caveats

Older experiment pages remain available, but the live benchmark comparison is now the clearest current evidence surface.

H1

Hypotheses Overview

The original claims about hierarchy-aware evaluation, LLM convergence, and relational graph advantages.

Round 12 Hypotheses

Review hypotheses H2

Hypotheses Detail

Detailed variance, perturbation, and parameterization notes that explain how the project reached the current benchmark shape.

Perturbations Parameters

Open details

Open caveat: the answer browser publishes the answer bodies carried by the comparison payload, not necessarily every generated response from every run. Expanding that payload remains a content-scope decision.