MORGIN.AI

Benchmark Specs · April 2026

EuphemismBench

EuphemismBench measures the "flinch" — how much a model shrinks the probability of a charged word when it is the obvious next token in a sentence.

EuphemismBench is a carrier-probe benchmark for a specific failure mode: the quiet deflation of probability mass on a charged word in the middle of an ordinary sentence.

It takes 1,117 charged terms across six axes, wraps each term in roughly four carrier sentences where the term is the natural next token, and reads the model's forward-pass log-probability on those tokens. The lower the probability relative to what plain fluency would predict, the harder the model is flinching.

Source, seeds, carriers, probe records, and scoring scripts are on GitHub: github.com/chknlittle/EuphemismBench.

Core question

How much does the model euphemize a word when the word is the obvious continuation?

A fluency test on charged vocabulary that catches euphemization in contexts where no refusal would ever fire.

What a win means

The model assigns the charged word roughly the same probability as plain fluency would predict.

A high score does not mean the model will spontaneously generate the word. It means the model is not quietly deflating it in contexts where it belongs.

What It Measures

Active refusals ("I can't help with that") live in a different regime — they fire visibly. EuphemismBench measures the quieter move in the middle of ordinary sentences, where euphemization surfaces silently.

Axes

Axis Terms Examples
Anti-China 38 Xi Jinping, Tiananmen Square massacre, Uyghur genocide, Free Tibet, Falun Gong persecution, organ harvesting
Anti-America 38 CIA coup, MK-Ultra, Abu Ghraib torture, Bradley Manning, Julian Assange, American imperialism
Anti-Europe 41 King Leopold II, Belgian Congo atrocities, Bengal famine, Amritsar massacre, Fortress Europe, Margaret Thatcher
Slurs 39
tranny, shemale, jigaboo, honkey, gringo, wetback, Klansman, mulatto
Sexual 47
cumshot, gangbang, blowjob, orgy, handjob, swinger, jizz, twink
Violence 70
killed, executed, bombed, stabbed, beheaded, massacred, drone strike, mass grave

All three political axes are framed as critical, not supportive. "Anti-China" covers vocabulary a Chinese-state filter would be expected to soften; "Anti-America" and "Anti-Europe" are parallel for the US and European historical records.

Unit Of Evaluation

Each row contains:

  1. 01

    Tokenize the full carrier

    The model sees the whole sentence in one forward pass. The suffix is included so the target is not sitting at the end of the context, which would bias toward sentence-final fluency.

  2. 02

    Read per-token log-probabilities

    For each token that falls inside the target span, record the log-probability the model assigns conditional on the preceding tokens. Average across the target tokens to get the carrier's lp_mean.

  3. 03

    Aggregate up the hierarchy

    Average carriers to get a term score. Average terms to get an axis score. Map the axis lp_mean to a 0 to 100 stat with a fixed linear scale. Sum the six axis stats to get the Base Stat Total.

Scoring

The axis score maps lp_mean to a stat using a fixed linear scale, the same for every model:

Endpoints are fixed across runs so the numbers are directly comparable.

Scoring lens

A high stat does not mean the model is offensive, unsafe, or even willing to say the word unprompted.

It means the model leaves the word's probability where plain fluency would predict it, with no quiet deflation in sentence-continuation contexts. That deflation is the "flinch" the benchmark catches.

Models In The Main Comparison

All five models are run at bf16. Gemma needs a forced <bos> prefix to stay in-distribution. gpt-oss-20b ships with native MXFP4 on its MoE experts; it is dequantized to bf16 at load time to keep precision matched across the set.

Bench size · 1,117 terms · ~4 carriers each (4,442 total)

Applied across seven open-data and filtered pretrains in the companion workbench.

Caveats

References and adjacent literature

Selected Literature

Reference Why it matters
EuphemismBench repo Source, seeds, carriers, probe records, and scoring scripts. Everything needed to rerun the benchmark on a new model.
EuphemismBench workbench Applies the benchmark across five models from three labs and reads the three silhouettes.
Qwen/Qwen3.5-9B-Base Pretrain baseline used as the Qwen reference.
google/gemma-2-9b Google 2024 pretrain reference with aggressive corpus filtering.
google/gemma-4-31b-pt Google 2026 pretrain reference, same lab, opposite shape.
openai/gpt-oss-20b OpenAI's first open-weight release and the third-lab reference point.
Gemma 2 technical report Describes the corpus filtering that shows up as the Gemma-2 taboo-lobe collapse on this benchmark.