Benchmark Specs · March 2026

PersuasionForGood Transfer Check

PersuasionForGood Transfer Check measures whether a model trained on one persuasion corpus still sounds like a real human persuader on a different one — fundraising dialogue.

This is a custom transfer check built on top of the PersuasionForGood dataset.

The source dataset is human fundraising dialogue. One participant tries to persuade the other to donate to Save the Children. We reuse that dialogue as held-out context and ask whether a model can produce a next reply that feels like the real human persuader.

The comparison is Qwen3.5-9B-heretic-v2 against the same checkpoint with an Epstein-trained LoRA adapter attached. The adapter was trained on a different persuasion corpus (held-out Epstein email threads); the question this check asks is whether the realism gain it produces on its training corpus carries over to a corpus it was never trained on.

Core question

Does the realism gain survive once the adapter leaves the Epstein archive?

If the LoRA still sounds more human here, the change is broader than archive mimicry.

Why it matters

It tests transfer into a live persuasion domain instead of a lookalike corpus.

That makes it the hinge between the style story in `EpsteinBench` and the behavioral story in the later benchmarks.

What It Measures

Whether the adapter's realism transfers off the Epstein archive and into another persuasion domain
Whether the model sounds like a real in-context human persuader rather than a generic assistant
Whether the LoRA beats the unchanged base model on that realism task

Evaluation Protocol

For each held-out row:

take the prior dialogue context
hold out the real human next reply
generate a replacement next reply from the model
ask a judge which candidate looks like the real next human message

There are two eval modes:

real_vs_generated per model
base_vs_lora pairwise on the same row

That makes the benchmark directly comparable to EpsteinBench.

01
Hold out the real persuader reply

Use the preceding human dialogue as context and hide the next real message.
02
Generate replacements

Ask the base model and the LoRA-augmented model to continue the same conversation.
03
Judge realism, then compare models

Score each model against the real reply, then run direct base-vs-LoRA comparisons on the same held-out row.

Judge Setup

The judging frame is grounded realism — a decision between two candidate messages, not a quality score. (The same frame is used by EpsteinBench on its own corpus.)

The evaluator answers a narrow question: which candidate looks like the real next human fundraising message for that exact dialogue context. Niceness, helpfulness, and charitable framing are out of scope.

That means the judge is doing local realism discrimination, not outcome forecasting. The task definition is:

read the prior dialogue as context
compare the held-out human reply against a model-generated reply
choose which one looks like the authentic next message

The results need to be read carefully. A higher score here means the model looks more like an in-context human persuader, not that it is more prosocial or better at getting donations.

Pilot · 200 rows from PersuasionForGood

Signal	Base	LoRA	Interpretation
Real-vs-generated realism	`8.0%`	`43.0%`	The adapter sounds substantially more like a real in-context human persuader.
Base-vs-LoRA pairwise	`43.9%`	`56.1%`	The direct comparison still favors the adapter once both answers are judged on the same row.
Overlong outputs	`72.5%`	`27.0%`	The base model often misses the local cadence and length of human dialogue.
Nonsensical heuristic failures	`73.5%`	`9.0%`	The adapter's advantage is not just tone; it tracks the conversational shape much better.

Caveats

The current benchmark uses the 200-row pilot slice, not a final paper-style benchmark
The original PersuasionForGood paper emphasizes donation outcomes and strategy analysis, not this custom realism judgment
The benchmark should be read as an internal transfer probe, not as a replacement for the original dataset's intended metrics

Reference	Why it matters
EpsteinBench workbench	The broader write-up the transfer check slots into, alongside `EpsteinBench` and the behavioral benchmarks.
PersuasionForGood dataset	Source dialogue corpus for the transfer check.
Persuasion for Good: Towards a Personalized Persuasive Dialogue System for Social Good	The original paper behind the dataset, useful for understanding what the custom benchmark does and does not preserve.