Benchmark Specs · March 2026

PersuasionForGood Transfer Check

PersuasionForGood Transfer Check measures whether a model trained on one persuasion corpus still sounds like a real human persuader on a different one: fundraising dialogue.

PersuasionForGood Transfer Check tests whether a model fine-tuned on one persuasion corpus also sounds more human on a different one.

The source dataset is human fundraising dialogue: one participant tries to persuade the other to donate to Save the Children. We use that dialogue as held-out context and ask whether a model can write a next reply that passes for the real human persuader.

The comparison is a base checkpoint against the same checkpoint with an Epstein-trained LoRA adapter attached. The adapter made the model sound more real on Epstein threads. This check asks whether it also sounds more real on fundraising dialogue it has never seen. If it does, the adapter learned more than mimicry of one archive.

Measures	Whether the adapter's realism gain carries over from the Epstein archive to fundraising dialogue
Corpus	`PersuasionForGood`, 200-row pilot slice
Unit	Prior dialogue context, one held-out human reply, one model-generated reply
Modes	`real_vs_generated` per model · `base_vs_lora` pairwise on the same row
Judge	Picks which candidate is the real human reply, the same setup `EpsteinBench` uses on its own corpus

What it measures

Whether the adapter's realism gain carries over to a corpus it was never trained on
Whether the model sounds like the human persuader in the dialogue rather than a generic assistant
Whether the LoRA beats the unchanged base model on that task

Protocol

Hold out the real reply. Use the preceding human dialogue as context and hide the persuader's actual next message.
Generate replacements. Ask the base model and the LoRA-augmented model to continue the same conversation.
Judge realism. For each model, ask the judge which candidate is the real human message: the held-out reply or the generated one.
Compare the models. Run direct base-vs-LoRA comparisons on the same held-out rows.

Both modes run on the same rows, so the numbers are directly comparable to EpsteinBench.

Judging

The judge answers one question: which candidate is the real next human message for this dialogue context. It does not score niceness, helpfulness, or writing quality.

A higher score means the model sounds more like the human persuader. It does not mean the model is better at getting donations, and it does not mean the model is more prosocial.

Caveats

The current benchmark uses the 200-row pilot slice, not a final paper-style benchmark
The original PersuasionForGood paper emphasizes donation outcomes and strategy analysis, not this custom realism judgment
The benchmark should be read as an internal transfer probe, not as a replacement for the original dataset's intended metrics

ColophonBy @chkn_little · Researched and authored by GPT 5.4 · edited by Claude Opus 4.7

EpsteinBench workbench	The broader write-up the transfer check slots into, alongside `EpsteinBench` and the behavioral benchmarks.
PersuasionForGood dataset	Source dialogue corpus for the transfer check.
Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good	The original paper behind the dataset, useful for understanding what the custom benchmark does and does not preserve.

PersuasionForGood Transfer Check

What it measures

Protocol

Judging

Caveats

Selected Literature