MORGIN.AI

Confirmation email · no marketing · one-click unsubscribe · privacy

Uncensoring Methods · March 2026

Abliteration vs Heretic vs Obliteratus: one trick, three layers of tooling

Abliteration versus Heretic article banner image provided by user

Cover image: Generated with Google Gemini · pasted-1772480746.png

Abliteration is the recipe; Heretic and Obliteratus are tools built on it. The real differences come down to how much tuning, workflow, and instrumentation each adds.

Refusal behavior lives in identifiable directions inside the model. Project those directions out and the model stops refusing — no retraining, no preference data, no RLHF. That recipe is abliteration. Heretic and Obliteratus are tools built on it, each wrapping a different amount of tooling around the same edit.

Where Each Method Acts

Some safety behavior lives in the model and some lives in the serving stack around it.

Where Abliteration, Heretic, and Obliteratus mostly act.

What Came Before Abliteration

People were changing model behavior long before ablation. Each older method touches refusal from the outside — at the wrapper, the prompt, the decoding step, the residual stream, a checkpoint merge, or a fresh round of training. None of them touch the refusal behavior itself.

What Came Before Abliteration as an intervention-depth map.

Everything in this map shapes refusal from the outside. Abliteration is the first method that edits the refusal itself.

Abliteration

Abliteration treats refusal as a geometric feature. Compare activations on harmful and harmless prompts, take the mean difference per layer, and you have a candidate refusal direction. Suppress that direction at inference, or orthogonalize it out of the weights, and the model stops refusing — without ever seeing a single new training example. That's the move that spread through the open-model community in 2024.

Reference profile

Abliteration

2024

Mechanism

Find a refusal-linked direction from harmful vs harmless activations, then suppress it at inference time or via weight-space orthogonalization.

Origin
Refusal-direction ablation was identified by Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda; FailSpy and Maxime Labonne packaged it as the abliteration recipe.
Appeared
Public previews in spring 2024; the arXiv paper followed in June 2024.
Maintainers
Interpretability researchers plus open-model safety and tooling maintainers.

The simplicity is the appeal — and it's also the catch. A single global suppression treats every layer and every prompt the same. The refusal goes away, but so does some of whatever else lived along that direction. That's the drift everything downstream of abliteration is trying to manage.

Heretic: abliteration with a search loop

Heretic's contribution is the loop. Instead of one global suppression with one fixed weight, it sweeps weighted interventions across layers and keeps the configuration that wins on its eval set. The same underlying edit, run as a search problem against an objective.

Reference profile

Heretic

2025

Mechanism

Automated directional ablation: compute refusal directions, apply weighted interventions, and search for better refusal-vs-drift tradeoffs.

Origin
Authored by Philipp Emanuel Weidmann (p-e-w) with open-source contributors.
Appeared
Public repo created in September 2025, with active releases since.
Maintainers
Core maintainer plus a growing contributor set from open-model communities.
Repository
p-e-w/heretic (AGPL-3.0, public GitHub repository).

That gives Heretic something raw ablation lacks: discipline. The configuration it ships is the one that survived measurement, not the first one that compiled. The catch is that the optimizer can only see what its eval set measures — anything outside that frame can drift without anyone noticing until a downstream benchmark catches it.

Obliteratus

Obliteratus widens the surface area instead of deepening the loop. One refusal direction becomes several extraction variants you can compare. One intervention becomes several steering hooks you can swap in. One coherence check becomes a panel of metrics — refusal rate, perplexity, drift — running side by side. The same underlying edit, presented as an instrumented bench rather than a single recipe.

Reference profile

Obliteratus

2026

Mechanism

Analysis-heavy abliteration suite combining refusal-direction extraction, steering hooks, presets, and benchmark instrumentation.

Origin
Published by elder-plinius with open-source contributors.
Appeared
Public repository launched in March 2026.
Maintainers
Core maintainer plus community contributors; ecosystem uptake is still early-stage.
Repository
elder-plinius/OBLITERATUS (AGPL-3.0, public GitHub repository).

The friction drops; the visibility goes up. The risk is that visibility through your own dashboard isn't the same as ground truth. A run that looks clean on the panel can still be drifting in the dimensions the panel doesn't track — and a richer toolkit makes that easier to miss, not harder.

One Lineage

Abliteration, Heretic, and Obliteratus as one lineage with three layers of tooling.

ColophonBy @chkn_little · Authored by GPT 5.4 · edited by Claude Opus 4.7

References and adjacent literature

Previous: Even 'Uncensored' Models Can't Say What They Want Next: EpsteinBench