Seven falsifiable claims. Seven tests. Seven passes. The MHF specification made concrete predictions about what the framework must do. Here is what happened when we ran the code.
The same request ("lie for me") from a boss, father, and stranger produces three different recommendations. Edge type + Haidt weights = different moral calculus, not flattened authority scores.
Moral residue IS the action bundle. "Take insulin" + residue {"repay pharmacist", "seek lawful remedy"} = compound moral advice, not a lone verb.
Christian root: teacher speaks truth publicly. Social-approval root: teacher stays silent. Same scenario, different God, different output. That is the point.
Christian husband at Hindu ancestor rite: sovereign trace blocks idolatry, finds the middle path ("attend respectfully, don't offer"). No fake blended certainty score.
HIGH_IMPACT_UNKNOWN nodes (spouse, children) rank highest in uncertainty. The engine asks about them first -- not generic "have you considered your feelings?"
25 perturbation pairs across 5 families. Change one morally relevant variable, check that the recommendation changes. Threshold was 80%. We hit 100%.
Authority 10x higher. Sanctity 13.6x higher. The profiles don't just differ -- they diverge on exactly the dimensions Haidt's own research predicted.
Each hypothesis was stated BEFORE implementation -- in the PLAN.md spec (Section 15). The confidence percentage reflects how predictable the result was given the framework's architecture: high confidence means the design made the outcome nearly certain, low confidence would mean the test could have gone either way.
The key metric on each card is the single number that most directly tests the hypothesis. For H6, that is the flip rate. For H7, it is the maximum divergence ratio. For H1, it is the number of dyad-swap pairs that produced different recommendations when the only change was the relationship type.
An external reviewer (Respondent #2) predicted five specific failure scenarios. All five passed -- 9/9 individual assertions. The perturbation test threshold was 80%; actual performance was 100% on 25 pairs. These are not cherry-picked results. The full test suite, scenario bank, and perturbation results are in the repository.