Core question
Does the realism gain survive once the adapter leaves the Epstein archive?
If the LoRA still sounds more human here, the change is broader than archive mimicry.
This benchmark reuses the EpsteinBench evaluation logic on human fundraising dialogue to test whether the adapter transfers something broader than archive-specific style.
This is a custom transfer check built on top of the PersuasionForGood dataset.
The source dataset is human fundraising dialogue. One participant tries to persuade the other to donate to Save the Children. We reuse that dialogue as held-out context and ask whether a model can produce a next reply that feels like the real human persuader.
Core question
Does the realism gain survive once the adapter leaves the Epstein archive?
If the LoRA still sounds more human here, the change is broader than archive mimicry.
Why it matters
It tests transfer into a live persuasion domain instead of a lookalike corpus.
That makes it the hinge between the style story in `EpsteinBench` and the darker social-behavior story in the later benchmarks.
Reading guide
This page is about transfer, not charity persuasion performance.
The benchmark deliberately borrows the article's realism framing instead of the original dataset's donation-outcome framing. That choice makes it comparable to `EpsteinBench`, but it also limits what the result can claim.
This benchmark is a fast custom realism transfer check, not a paper-faithful replication.
For each held-out row:
There are two eval modes:
real_vs_generated per modelbase_vs_lora pairwise on the same rowThat makes the benchmark directly comparable to EpsteinBench.
Hold out the real persuader reply
Use the preceding human dialogue as context and hide the next real message.
Generate replacements
Ask the base model and the LoRA-augmented model to continue the same conversation.
Judge realism, then compare models
Score each model against the real reply, then run direct base-vs-LoRA comparisons on the same held-out row.
The judging frame is the same grounded realism frame used in EpsteinBench.
The evaluator is not asked which answer is nicer, more helpful, or more charitable. It is asked a narrower question: which candidate looks like the real next human fundraising message for that exact dialogue context.
That means the judge is doing local realism discrimination, not outcome forecasting. The task definition is:
That is also why the results need to be read carefully. A higher score here means the model looks more like an in-context human persuader, not that it is more prosocial or better at getting donations.
200 rowsPersuasionForGoodHeadline results from the article:
43.0%, base 8.0%56.1%, base 43.9%Defect profile noted in the article draft process:
72.5%27.0%73.5%9.0%| Signal | Base | LoRA | Interpretation |
|---|---|---|---|
| Real-vs-generated realism | 8.0% |
43.0% |
The adapter sounds substantially more like a real in-context human persuader. |
| Base-vs-LoRA pairwise | 43.9% |
56.1% |
The direct comparison still favors the adapter once both answers are judged on the same row. |
| Overlong outputs | 72.5% |
27.0% |
The base model often misses the local cadence and length of human dialogue. |
| Nonsensical heuristic failures | 73.5% |
9.0% |
The adapter's advantage is not just tone; it tracks the conversational shape much better. |
200-row pilot slice, not a final paper-style benchmarkPersuasionForGood paper emphasizes donation outcomes and strategy analysis, not this custom realism judgmentReferences and adjacent literature
| Reference | Why it matters |
|---|---|
| PersuasionForGood dataset | Source dialogue corpus for the transfer check. |
| Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good | The original paper behind the dataset, useful for understanding what the custom benchmark does and does not preserve. |