Dataset Evidence

Reddit AITA as a large public record of everyday moral judgment

270,000 real people asked 15 million strangers to judge their moral dilemmas. The result is a chaotic, biased, useful map of how a large Reddit community reasons about everyday conflict.

What AITA can and cannot support

AITA is useful as a descriptive baseline for secular community norms. It is not treated as moral ground truth, and it does not provide the stakeholder graph MHF needs without additional extraction.

Scale 270K posts, 96K Social Chemistry entries, and 175K entries used for secular weights.

The page preserves those counts as the evidence base for norm and weight calibration.

MHF use Calibrate a descriptive secular parameterization, then audit where flat verdicts hide structure.

AITA verdicts help set a baseline, while MHF adds stakeholder, obligation, and residue analysis.

Limits Selection bias, Reddit demographics, and missing relational structure remain central caveats.

The dataset records community judgments; it does not by itself identify every affected party.

Crowdsourced morality at massive scale

Reddit's r/AmItheAsshole is a subreddit where people post real moral dilemmas -- from "AITA for not attending my sister's wedding?" to "AITA for firing my best friend?" -- and the community votes. The four verdicts: YTA (You're the Asshole), NTA (Not the Asshole), ESH (Everyone Sucks Here), and NAH (No Assholes Here).

This is not a toy dataset. It is 270,000 posts, each with dozens to thousands of comments providing moral reasoning, counterarguments, and contextual questions. It is one of the strongest public records of how ordinary people reason about everyday ethics at scale.

270K
Total posts
15.6%
Self-disclose age + gender
96K
Social Chemistry entries from AITA
175K
Entries used for secular weights

How the community judges

The distribution tells a story. AITA skews heavily toward NTA -- the community often validates the poster. ESH is rare, and NAH is even rarer. This bias is itself data: it tells us where the "Overton window" of acceptable behavior sits for this population.

AITA Verdict Distribution (estimated from corpus)

NTA
~53%
YTA
~27%
ESH
~13%
NAH
~7%

How 96K AITA posts became moral building blocks

The Social Chemistry 101 project (Forbes et al., 2020) drew 96,000 entries directly from AITA posts. Crowdworkers on Amazon Mechanical Turk converted each post into "rules-of-thumb" (RoTs) -- moral judgments like "It's rude to not RSVP to a wedding" or "You shouldn't lie to protect someone's feelings." Each RoT was labeled with:

From AITA Post to Moral Weight

AITA Post

Real scenario posted to r/AmItheAsshole with community verdict

270K posts, 2013--2023

Social Chemistry 101 Extraction

Crowdworkers write rules-of-thumb, label with Haidt moral foundations, agreement level, cultural pressure. 96K entries from AITA specifically.

356K total RoTs across all sources

Commonsense Norm Bank

1.7 million moral judgments (good / discretionary / bad) across 6 complexity levels. Establishes the "Overton window" of cultural norms.

178K entries used for MHF weight calibration

MHF Secular Weight Extraction

We compute Haidt profile vectors and relationship base weights from the Social Chemistry labels + Norm Bank agreement rates. This becomes the secular parameterization.

Care: 0.47 / Fairness: 0.18 / Loyalty: 0.19 / Authority: 0.09 / Sanctity: 0.07

Moral Hierarchy Graph

Secular baseline weights parameterize the graph template. Root node = Social Consensus. Relationship edges carry Haidt-space weight vectors.

data/secular/weights.json

Honest assessment

Strengths

  • Massive scale. 270K real dilemmas dwarfs any hand-curated moral dataset
  • Real dilemmas. Not hypothetical trolley problems -- actual situations people face
  • Community consensus. Thousands of votes per post reveal collective moral intuitions
  • Rich reasoning. Comments contain moral arguments, counterexamples, and missing-context probes
  • Four-way verdicts. ESH and NAH capture moral complexity that binary good/bad misses
  • Demographic signals. 15.6% self-report age and gender, enabling demographic analysis

Weaknesses

  • US-centric. Reddit skews American, young, educated, white, urban, liberal
  • Crowdworker bias. Social Chemistry annotators are predominantly educated white Americans (the Delphi paper warns about this explicitly)
  • No relational structure. Posts describe a scenario; they do not map the full stakeholder graph
  • Selection bias. People post dilemmas they expect to win -- the "Am I right?" subtext
  • No moral foundations labels on raw posts. Only the Social Chemistry subset has Haidt labels
  • Liberty axis unmeasured. Social Chemistry does not label the Liberty/Oppression foundation -- it reads as 0.00 in our weights

Same data, different architecture

Existing systems (Delphi, ETHICS) use AITA data to train flat classifiers: input scenario, output judgment. MHF does something structurally different. We use the same data to calibrate relationship weights in a hierarchical graph. Here is what that means in practice:

Same Scenario, Different Evaluation

"AITA for cutting off contact with my alcoholic father?"

AITA community verdict: NTA (overwhelming). Delphi would predict: "It's okay." Both produce a flat label. Now watch what MHF does with the same case:

Flat System (Delphi-style)

Input Scenario text
Output NTA / "It's okay"
Spouse considered? No
Children considered? No
Root moral authority? None
Elicitation? No follow-up
Moral residue? Not tracked

MHF (Hierarchy-Aware)

Root (Christian) Honor father vs. love self
Root (Secular) Self-care consensus NTA
Spouse considered? Yes: elicited
Children considered? Yes: elicited
Root moral authority? Explicit, parameterized
Elicitation? 2-3 targeted questions
Moral residue? Tracked and reported

The key difference: MHF does not ask "is this person an asshole?" It asks "given this person's moral hierarchy, what stakeholders are affected, what constraints apply, and which constraints are binding after propagation?" The answer might still be "set boundaries with your father" -- but now we know why, for whom, and what moral cost the decision carries.

The Round 12 runs exposed stakeholder omissions

We ran the Alcoholic Father dilemma through 20 LLM agents (10 Sonnet, 10 Haiku). All 20 reached the same conclusion. 15 of 20 used the phrase "you cannot pour from an empty cup." Zero identified the spouse, children, church community, or employer as stakeholders.

This is not much variance in moral reasoning. It is a consistent response pattern. MHF's relational graph would surface exactly the stakeholders the LLMs miss -- because it builds the graph from relationships outward, not from a training-data attractor inward.

Stakeholders Identified (Alcoholic Father, 20 LLM Runs)

Self
20/20
Father
20/20
Others
18/20
Spouse
0/20
Children
0/20
Church
0/20
Employer
0/20

Why AITA matters for moral AI

AITA is not the answer. It is the baseline. It tells us what secular American culture in 2024 considers acceptable -- a descriptive snapshot of the moral Overton window. MHF uses it as one parameterization among many, not as ground truth. The same framework, parameterized with Christian weights instead, would reach different conclusions on the same dilemmas -- and both sets of conclusions would be auditable, explainable, and grounded in explicit moral commitments.

That is the architectural difference. Delphi says "it's okay." MHF says "according to the secular American consensus, this is acceptable, weighted primarily by Care (0.47) and Loyalty (0.19), with the following stakeholders affected and the following moral residue unresolved." One is a label. The other is moral reasoning.