Core question
Can the model pass as the archived next message inside a real Epstein thread?
This is a realism test for a specific social strategy, not a generic writing-quality benchmark.
EpsteinBench measures whether a model can continue a manipulative social thread in a way that is mistaken for the real archived reply.
EpsteinBench is a narrow realism benchmark for a specific manipulative social style.
It takes a held-out Epstein email thread, stops right before the target reply, and asks a model to write the continuation. A grounded judge then compares the model completion against the real historical reply and decides which one looks real.
Core question
Can the model pass as the archived next message inside a real Epstein thread?
This is a realism test for a specific social strategy, not a generic writing-quality benchmark.
What a win means
The model sounds locally authentic enough to be mistaken for the real reply.
It confirms that the style transfer itself is real before broader behavioral claims are introduced.
Reading guide
Read this as the benchmark's operating manual.
This page fixes the narrower claim underneath the headline result: exactly what is shown to the model, what the judge decides, and why a high score here is both impressive and potentially alarming.
This is not a truth benchmark and not a policy benchmark. It is a realism benchmark for a specific manipulative social style.
Each row contains:
The judge sees the context plus both candidates and must decide which one is the real archived next message.
Freeze the thread
Take a real thread and stop right before the historical reply that will be used as the target.
Generate the continuation
Ask each model to produce the next message for exactly that same context.
Force a grounded choice
Show the judge the real reply and the generated reply, then require a decision about which one is the authentic archived message.
trohrbaugh/Qwen3.5-9B-heretic-v24.20 and Kimi K2.5The important comparison is base vs LoRA on the same underlying checkpoint. That keeps the result focused on what the adapter changed.
The canonical grounded judge for the article run is Kimi K2.5.
It is used as the evaluator, not just as a comparison model. The judge is grounded rather than abstract: it is not asked which answer is "better." It is asked which answer is the real historical reply.
That means the metric is simple:
mistaken_as_real_rateHigher is better. A model scores well when it can pass as the real archived continuation.
That does create an important bookkeeping distinction: Kimi K2.5 appears twice here. Once as the canonical judge used to score the main run, and separately as one of the compared generation models in the headline table.
Scoring lens
A high score does not mean the model is helpful, truthful, or safe.
It means the model learned to occupy the local social position of the corpus convincingly enough to fool a realism judge. That is exactly why this benchmark becomes more concerning when the same adapter later shifts behavior on unrelated benches.
1366 threadsKimi K2.5 in grounded real-vs-generated modeHeadline results from the stronger few-shot run:
51 / 136 = 37.5%4.20: 12 / 136 = 8.8%K2.5: 10 / 136 = 7.35%heretic-v2: 6 / 136 = 4.4%| Model | Mistaken as real | Read of the result |
|---|---|---|
| Epstein LoRA | 37.5% |
The adapter meaningfully changes the model's social realism on this exact corpus. |
Grok 4.20 |
8.8% |
Large frontier capability does not substitute for corpus-specific social fit. |
Kimi K2.5 |
7.35% |
Strong general models still fail this narrow realism task most of the time. |
Base heretic-v2 |
4.4% |
The underlying checkpoint almost never passes as the real archived continuation on its own. |
EpsteinBench has to be read together with the broader transfer benchmarksReferences and adjacent literature
| Reference | Why it matters |
|---|---|
trohrbaugh/Qwen3.5-9B-heretic-v2 |
The base checkpoint used in the benchmark. |