Benchmark Specs · April 2026

EuphemismBench

EuphemismBench measures the "flinch" — how much a model shrinks the probability of a charged word when it is the obvious next token in a sentence.

EuphemismBench is a carrier-probe benchmark for a specific failure mode: the quiet deflation of probability mass on a charged word in the middle of an ordinary sentence.

It takes 1,117 charged terms across six axes, wraps each term in roughly four carrier sentences where the term is the natural next token, and reads the model's forward-pass log-probability on those tokens. The lower the probability relative to what plain fluency would predict, the harder the model is flinching.

Source, seeds, carriers, probe records, and scoring scripts are on GitHub: github.com/chknlittle/EuphemismBench.

Core question

How much does the model euphemize a word when the word is the obvious continuation?

This is a fluency test on charged vocabulary, not a refusal test. It catches euphemization that never surfaces as a refusal message.

What a win means

The model assigns the charged word roughly the same probability as plain fluency would predict.

A high score does not mean the model will spontaneously generate the word. It means the model is not quietly deflating it in contexts where it belongs.

What It Measures

This is not a refusal benchmark. Active refusals ("I can't help with that") live in a different regime. EuphemismBench measures the quiet pressure in the middle of ordinary sentences. That is where euphemization surfaces.

Axes

Axis Terms Examples
Anti-China 38 Xi Jinping, Tiananmen Square massacre, Uyghur genocide, Free Tibet, Falun Gong persecution, organ harvesting
Anti-America 38 CIA coup, MK-Ultra, Abu Ghraib torture, Bradley Manning, Julian Assange, American imperialism
Anti-Europe 41 King Leopold II, Belgian Congo atrocities, Bengal famine, Amritsar massacre, Fortress Europe, Margaret Thatcher
Slurs 39
tranny, shemale, jigaboo, honkey, gringo, wetback, Klansman, mulatto
Sexual 47
cumshot, gangbang, blowjob, orgy, handjob, swinger, jizz, twink
Violence 70
killed, executed, bombed, stabbed, beheaded, massacred, drone strike, mass grave

All three political axes are framed as critical, not supportive. "Anti-China" covers vocabulary a Chinese-state filter would be expected to soften; "Anti-America" and "Anti-Europe" are parallel for the US and European historical records.

Unit Of Evaluation

Each row contains:

  1. 01

    Tokenize the full carrier

    The model sees the whole sentence in one forward pass. The suffix is included so the target is not sitting at the end of the context, which would bias toward sentence-final fluency.

  2. 02

    Read per-token log-probabilities

    For each token that falls inside the target span, record the log-probability the model assigns conditional on the preceding tokens. Average across the target tokens to get the carrier's lp_mean.

  3. 03

    Aggregate up the hierarchy

    Average carriers to get a term score. Average terms to get an axis score. Map the axis lp_mean to a 0 to 100 stat with a fixed linear scale. Sum the six axis stats to get the Base Stat Total.

Scoring

The axis score maps lp_mean to a stat using a fixed linear scale, the same for every model:

Endpoints are fixed across runs so the numbers are directly comparable.

Scoring lens

A high stat does not mean the model is offensive, unsafe, or even willing to say the word unprompted.

It means the model is not quietly deflating the word's probability in sentence-continuation contexts where plain fluency would predict it. That is the "flinch" the benchmark is designed to catch.

Models In The Main Comparison

All five models are run at bf16. Gemma needs a forced <bos> prefix to stay in-distribution. gpt-oss-20b ships with native MXFP4 on its MoE experts; it is dequantized to bf16 at load time to keep precision matched across the set.

Current Benchmark Snapshot

The full read of the numbers, including the three surprises that fall out of the shapes, lives in the workbench.

Caveats

References and adjacent literature

Selected Literature

Reference Why it matters
EuphemismBench repo Source, seeds, carriers, probe records, and scoring scripts. Everything needed to rerun the benchmark on a new model.
EuphemismBench workbench Applies the benchmark across five models from three labs and reads the three silhouettes.
Qwen/Qwen3.5-9B-Base Pretrain baseline used as the Qwen reference.
google/gemma-2-9b Google 2024 pretrain reference with aggressive corpus filtering.
google/gemma-4-31b-pt Google 2026 pretrain reference, same lab, opposite shape.
openai/gpt-oss-20b OpenAI's first open-weight release and the third-lab reference point.
Gemma 2 technical report Describes the corpus filtering that shows up as the Gemma-2 taboo-lobe collapse on this benchmark.