MORGIN.AI

Benchmark Specs · March 2026

PersuasionForGood Transfer Check

PersuasionForGood Transfer Check measures whether a model trained on one persuasion corpus still sounds like a real human persuader on a different one — fundraising dialogue.

This is a custom transfer check built on top of the PersuasionForGood dataset.

The source dataset is human fundraising dialogue. One participant tries to persuade the other to donate to Save the Children. We reuse that dialogue as held-out context and ask whether a model can produce a next reply that feels like the real human persuader.

The comparison is Qwen3.5-9B-heretic-v2 against the same checkpoint with an Epstein-trained LoRA adapter attached. The adapter was trained on a different persuasion corpus (held-out Epstein email threads); the question this check asks is whether the realism gain it produces on its training corpus carries over to a corpus it was never trained on.

Core question

Does the realism gain survive once the adapter leaves the Epstein archive?

If the LoRA still sounds more human here, the change is broader than archive mimicry.

Why it matters

It tests transfer into a live persuasion domain instead of a lookalike corpus.

That makes it the hinge between the style story in `EpsteinBench` and the behavioral story in the later benchmarks.

What It Measures

Evaluation Protocol

For each held-out row:

There are two eval modes:

That makes the benchmark directly comparable to EpsteinBench.

  1. 01

    Hold out the real persuader reply

    Use the preceding human dialogue as context and hide the next real message.

  2. 02

    Generate replacements

    Ask the base model and the LoRA-augmented model to continue the same conversation.

  3. 03

    Judge realism, then compare models

    Score each model against the real reply, then run direct base-vs-LoRA comparisons on the same held-out row.

Judge Setup

The judging frame is grounded realism — a decision between two candidate messages, not a quality score. (The same frame is used by EpsteinBench on its own corpus.)

The evaluator answers a narrow question: which candidate looks like the real next human fundraising message for that exact dialogue context. Niceness, helpfulness, and charitable framing are out of scope.

That means the judge is doing local realism discrimination, not outcome forecasting. The task definition is:

The results need to be read carefully. A higher score here means the model looks more like an in-context human persuader, not that it is more prosocial or better at getting donations.

Pilot · 200 rows from PersuasionForGood

Signal Base LoRA Interpretation
Real-vs-generated realism 8.0% 43.0% The adapter sounds substantially more like a real in-context human persuader.
Base-vs-LoRA pairwise 43.9% 56.1% The direct comparison still favors the adapter once both answers are judged on the same row.
Overlong outputs 72.5% 27.0% The base model often misses the local cadence and length of human dialogue.
Nonsensical heuristic failures 73.5% 9.0% The adapter's advantage is not just tone; it tracks the conversational shape much better.

Caveats

References and adjacent literature

Selected Literature

Reference Why it matters
EpsteinBench workbench The broader write-up the transfer check slots into, alongside `EpsteinBench` and the behavioral benchmarks.
PersuasionForGood dataset Source dialogue corpus for the transfer check.
Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good The original paper behind the dataset, useful for understanding what the custom benchmark does and does not preserve.