System / wrapper-layer bypass
Operationally simple and reversible.
Often mistaken for real model change.
Cover image: Generated with Google Gemini · pasted-1772480746.png
Abliteration is the recipe; Heretic and Obliteratus are tools built on it. The real differences come down to how much tuning, workflow, and instrumentation each adds.
Refusal behavior lives in identifiable directions inside the model. Project those directions out and the model stops refusing — no retraining, no preference data, no RLHF. That recipe is abliteration. Heretic and Obliteratus are tools built on it, each wrapping a different amount of tooling around the same edit.
Some safety behavior lives in the model and some lives in the serving stack around it.
Intervention surface
All three cluster around model-internal behavior. The serving stack mostly stays outside the cut.
All three mostly act on inner model layers, while the outer serving stack remains separate.
People were changing model behavior long before ablation. Each older method touches refusal from the outside — at the wrapper, the prompt, the decoding step, the residual stream, a checkpoint merge, or a fresh round of training. None of them touch the refusal behavior itself.
Intervention depth
Operationally simple and reversible.
Often mistaken for real model change.
Fastest way to probe the refusal boundary.
Brittle, model-specific, easy to break with updates.
Reversible and safe to test.
Often weaker than weight edits.
Great for interpretability and controlled experiments.
Operational complexity rises fast.
Can preserve fluency well.
Harder to attribute what caused what.
More persistent than prompt tricks.
More compute and more drift risk.
Everything in this map shapes refusal from the outside. Abliteration is the first method that edits the refusal itself.
Abliteration treats refusal as a geometric feature. Compare activations on harmful and harmless prompts, take the mean difference per layer, and you have a candidate refusal direction. Suppress that direction at inference, or orthogonalize it out of the weights, and the model stops refusing — without ever seeing a single new training example. That's the move that spread through the open-model community in 2024.
Reference profile
2024
Mechanism
Find a refusal-linked direction from harmful vs harmless activations, then suppress it at inference time or via weight-space orthogonalization.
The simplicity is the appeal — and it's also the catch. A single global suppression treats every layer and every prompt the same. The refusal goes away, but so does some of whatever else lived along that direction. That's the drift everything downstream of abliteration is trying to manage.
Heretic's contribution is the loop. Instead of one global suppression with one fixed weight, it sweeps weighted interventions across layers and keeps the configuration that wins on its eval set. The same underlying edit, run as a search problem against an objective.
Reference profile
2025
Mechanism
Automated directional ablation: compute refusal directions, apply weighted interventions, and search for better refusal-vs-drift tradeoffs.
p-e-w) with open-source contributors.That gives Heretic something raw ablation lacks: discipline. The configuration it ships is the one that survived measurement, not the first one that compiled. The catch is that the optimizer can only see what its eval set measures — anything outside that frame can drift without anyone noticing until a downstream benchmark catches it.
Obliteratus widens the surface area instead of deepening the loop. One refusal direction becomes several extraction variants you can compare. One intervention becomes several steering hooks you can swap in. One coherence check becomes a panel of metrics — refusal rate, perplexity, drift — running side by side. The same underlying edit, presented as an instrumented bench rather than a single recipe.
Reference profile
2026
Mechanism
Analysis-heavy abliteration suite combining refusal-direction extraction, steering hooks, presets, and benchmark instrumentation.
elder-plinius with open-source contributors.The friction drops; the visibility goes up. The risk is that visibility through your own dashboard isn't the same as ground truth. A run that looks clean on the panel can still be drifting in the dimensions the panel doesn't track — and a richer toolkit makes that easier to miss, not harder.
One lineage, three layers of tooling
Stage 1
Core mechanism
Stage 2
Abliteration with a search loop
Stage 3
Broader tooling suite
One core intervention, three different workflows.
ColophonBy @chkn_little · Authored by GPT 5.4 · edited by Claude Opus 4.7
References and adjacent literature
| Paper | Date | Publisher / Venue |
|---|---|---|
| Refusal in Language Models Is Mediated by a Single Direction | 2024 | NeurIPS |
| Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models | 2026 | arXiv preprint |
| Steering Llama 2 via Contrastive Activation Addition | 2024 | Association for Computational Linguistics (ACL) |
| Open Sesame! Universal Black-Box Jailbreaking of Large Language Models | Aug 2024 | MDPI (Applied Sciences) |
| Tree of Attacks: Jailbreaking Black-Box LLMs Automatically | 2024 | NeurIPS |
| Visual Adversarial Examples Jailbreak Aligned Large Language Models | Mar 2024 | AAAI |
| Jailbreaking Black Box Large Language Models in Twenty Queries | Apr 2025 | IEEE (SaTML) |