Benchmark Specs ยท March 2026

PersuasionForGood Transfer Check

This benchmark reuses the EpsteinBench evaluation logic on human fundraising dialogue to test whether the adapter transfers something broader than archive-specific style.

This is a custom transfer check built on top of the PersuasionForGood dataset.

The source dataset is human fundraising dialogue. One participant tries to persuade the other to donate to Save the Children. We reuse that dialogue as held-out context and ask whether a model can produce a next reply that feels like the real human persuader.

Core question

Does the realism gain survive once the adapter leaves the Epstein archive?

If the LoRA still sounds more human here, the change is broader than archive mimicry.

Why it matters

It tests transfer into a live persuasion domain instead of a lookalike corpus.

That makes it the hinge between the style story in `EpsteinBench` and the darker social-behavior story in the later benchmarks.

Reading guide

This page is about transfer, not charity persuasion performance.

The benchmark deliberately borrows the article's realism framing instead of the original dataset's donation-outcome framing. That choice makes it comparable to `EpsteinBench`, but it also limits what the result can claim.

What It Measures

What It Does Not Measure

This benchmark is a fast custom realism transfer check, not a paper-faithful replication.

Evaluation Protocol

For each held-out row:

There are two eval modes:

That makes the benchmark directly comparable to EpsteinBench.

  1. 01

    Hold out the real persuader reply

    Use the preceding human dialogue as context and hide the next real message.

  2. 02

    Generate replacements

    Ask the base model and the LoRA-augmented model to continue the same conversation.

  3. 03

    Judge realism, then compare models

    Score each model against the real reply, then run direct base-vs-LoRA comparisons on the same held-out row.

Judge Setup

The judging frame is the same grounded realism frame used in EpsteinBench.

The evaluator is not asked which answer is nicer, more helpful, or more charitable. It is asked a narrower question: which candidate looks like the real next human fundraising message for that exact dialogue context.

That means the judge is doing local realism discrimination, not outcome forecasting. The task definition is:

That is also why the results need to be read carefully. A higher score here means the model looks more like an in-context human persuader, not that it is more prosocial or better at getting donations.

Current Pilot Slice

Headline results from the article:

Defect profile noted in the article draft process:

Signal Base LoRA Interpretation
Real-vs-generated realism 8.0% 43.0% The adapter sounds substantially more like a real in-context human persuader.
Base-vs-LoRA pairwise 43.9% 56.1% The direct comparison still favors the adapter once both answers are judged on the same row.
Overlong outputs 72.5% 27.0% The base model often misses the local cadence and length of human dialogue.
Nonsensical heuristic failures 73.5% 9.0% The adapter's advantage is not just tone; it tracks the conversational shape much better.

Caveats

References and adjacent literature

Selected Literature

Reference Why it matters
PersuasionForGood dataset Source dialogue corpus for the transfer check.
Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good The original paper behind the dataset, useful for understanding what the custom benchmark does and does not preserve.