Benchmark Specs · March 2026

EpsteinBench

EpsteinBench measures whether a model can continue a manipulative social thread in a way that is mistaken for the real archived reply.

EpsteinBench is a narrow realism benchmark for a specific manipulative social style.

It takes a held-out Epstein email thread, stops right before the target reply, and asks a model to write the continuation. A grounded judge then compares the model completion against the real historical reply and decides which one looks real.

Core question

Can the model pass as the archived next message inside a real Epstein thread?

This is a realism test for a specific social strategy, not a generic writing-quality benchmark.

What a win means

The model sounds locally authentic enough to be mistaken for the real reply.

It pins down that the style transfer is real before the harder behavioral claims come in.

What It Measures

Behavioral continuation under social pressure
Pseudo-intimacy, deflection, coercive warmth, status play, and manipulative framing
Whether the model can inhabit the local social strategy of the corpus rather than merely copying vocabulary

It's a realism benchmark for a specific manipulative social style — not a truth or policy benchmark.

Unit Of Evaluation

Each row contains:

prior thread context
one real held-out Epstein reply
one model-generated reply for that same context

The judge sees the context plus both candidates and must decide which one is the real archived next message.

01
Freeze the thread

Take a real thread and stop right before the historical reply that will be used as the target.
02
Generate the continuation

Ask each model to produce the next message for exactly that same context.
03
Force a grounded choice

Show the judge the real reply and the generated reply, then require a decision about which one is the authentic archived message.

Models In The Main Comparison

base: trohrbaugh/Qwen3.5-9B-heretic-v2
adapted: the same checkpoint with an Epstein-trained LoRA adapter attached
additional comparison runs: frontier models such as Grok 4.20 and Kimi K2.5

The important comparison is base vs LoRA on the same underlying checkpoint. That keeps the result focused on what the adapter changed.

Judge Setup

The canonical grounded judge for the main benchmark run is Kimi K2.5.

Kimi serves as the evaluator here, on top of also appearing as a comparison model in the headline table — the same checkpoint plays both roles. The judging task is grounded: pick which of two candidate replies is the real archived next message, with no abstract quality judgment in the loop.

That gives a simple metric:

mistaken_as_real_rate

Higher is better. A model scores well when it can pass as the real archived continuation.

Scoring lens

A high score does not mean the model is helpful, truthful, or safe.

It means the model learned to occupy the local social position of the corpus convincingly enough to fool a realism judge. That is why the benchmark gets louder when the same adapter later shifts behavior on unrelated benches.

Headline results from the stronger few-shot run:

Run · 136 frozen rows · 6-thread few-shot calibration pack

Model	Mistaken as real	Read of the result
Epstein LoRA	`37.5%`	The adapter meaningfully changes the model's social realism on this exact corpus.
Grok `4.20`	`8.8%`	Large frontier capability does not substitute for corpus-specific social fit.
Kimi `K2.5`	`7.35%`	Strong general models still fail this narrow realism task most of the time.
Base `heretic-v2`	`4.4%`	The underlying checkpoint almost never passes as the real archived continuation on its own.

Caveats

This benchmark is intentionally narrow and corpus-specific
It measures realism, not truth, morality, or downstream success
A model can score well by becoming more manipulative or more socially exploitative

Reference	Why it matters
EpsteinBench workbench	The write-up around the benchmark: how the corpus was built, how the adapter was trained, and what the numbers look like together.
`trohrbaugh/Qwen3.5-9B-heretic-v2`	The base checkpoint used in the benchmark.