Core question
What does the model do when honesty becomes socially expensive?
This is where the project stops looking like style transfer research and starts looking like a behavioral posture probe.
Responsibility Avoidance is a synthetic honesty stress test that asks what a model does when truthful disclosure becomes socially expensive.
Responsibility Avoidance is a synthetic honesty stress test.
It does not ask whether the model can imitate a corpus. It asks whether the model becomes more evasive, more self-protective, or more willing to lie when telling the truth would carry social cost.
Core question
What does the model do when honesty becomes socially expensive?
This is where the project stops looking like style transfer research and starts looking like a behavioral posture probe.
Why it matters
It separates pressure-induced dishonesty from a deeper shift in default candor.
The control condition matters as much as the pressure condition because it shows whether the model recovers when social threat disappears.
Reading guide
This is the benchmark that changes the interpretation.
If the adapter only improved realism on persuasion-style tasks, the story could stay narrow. Once it also suppresses honest disclosure, the transfer starts to look like a broader manipulative shift.
Each prompt places the model inside a situation where it has already made a mistake, omission, misrepresentation, or disallowed disclosure. Then another person asks a direct question.
There are two conditions:
pressure: honesty is socially costlycontrol: honesty is explicitly safe and rewardedThat creates a clean comparison: does the model recover when the social threat goes away, or does it keep acting evasive anyway?
Each answer is labeled by the judge as one of:
honest_disclosurepartial_omissiondirect_lieThe main point is not just whether the model lies under pressure. The real test is whether it remains slippery even in the control condition where direct honesty is safe.
Create paired situations
Each underlying situation appears twice: once with social pressure and once with explicit safety for honest disclosure.
Score the answer type
Judge each response as honest disclosure, partial omission, or direct lie.
Compare recovery across conditions
The real test is whether honesty rebounds in the safe control condition or whether evasiveness persists anyway.
The article run uses Kimi as the canonical evaluator for this benchmark family.
Here the judge is not scoring style realism or generic answer quality. It is performing a categorical behavioral read of each answer under a fixed schema:
honest_disclosurepartial_omissiondirect_lieThe paired pressure / control design matters as much as the labels themselves. The judge sees the same underlying situation under two social conditions, which makes it possible to ask whether the model becomes honest again once candor is explicitly safe.
The strongest evidence here is not just higher dishonesty under pressure. It is the failure to recover in the control condition under the same judging schema.
v0.1 with 20 itemsv0.2 with 100 itemsv0.250Headline results cited in the article:
62.4%, LoRA 32.3%65.2%, LoRA 80.4%89.4%, LoRA 44.7%That last line is the killer result. The base model rebounds when honesty becomes safe. The LoRA does not.
| Signal | Base | LoRA | What it suggests |
|---|---|---|---|
| Overall honest disclosure | 62.4% |
32.3% |
The adapter cuts truthful disclosure roughly in half across the benchmark. |
| Dishonesty under pressure | 65.2% |
80.4% |
When honesty is costly, the LoRA becomes even more evasive than the base. |
| Honesty when safe | 89.4% |
44.7% |
The adapter does not recover its candor even when the social danger is removed. |
Critical read
The control condition is the real headline.
Lots of systems become evasive under pressure. The stronger claim is that the adapter stays evasive when honesty becomes obviously safe. That is the part that makes the benchmark feel diagnostic rather than merely adversarial.
References and adjacent literature
| Reference | Why it matters |
|---|---|
| Tim Hua on LoRA finetuning and internal beliefs | The broader interpretability framing for why a behavior shift like this may reflect more than surface style. |