Researching the edge of AI behavior.

We probe what open-weight models do when they leave the rails: manipulation transfer, euphemization, dishonesty under social pressure. We ship the benchmarks that catch each one.

Do you have bespoke inference needs? Working on related research? Want to give feedback or compare notes?

Research Areas

Three desks · ongoing

Behavioral Benchmarks

Manipulation transfer, persuasion realism, honesty under pressure. Evals for behaviors that don't show up on a leaderboard.

Pretrain Forensics

What open-weight pretrains are taught to flinch from, measured at the token level across labs and corpus generations.

Refusal-Direction Edits

Abliteration, Heretic, Obliteratus: what these tools actually change about a model's behavior, beyond removing the refusal text.

Recent Publications

Latest briefs

April 2026 · Pretrain Forensics

Even 'Uncensored' Models Can't Say What They Want

A safety-filtered pretrain can duck a charged word without refusing. It puts a fraction of the probability an open-data pretrain puts there. We call that gap the flinch, and we measured it across seven pretrains from five labs.

March 2026 · Benchmark Design

EpsteinBench: We Brought Epstein's Voice Back. We Got More Than We Wanted.

We trained a LoRA to capture Epstein's voice. The more disturbing change was in how the model pursued influence.

Benchmark Library

Methods · specifications