MORGIN.AI

Uncensoring Methods · March 2026

Abliteration vs Heretic vs Obliteratus: one trick, three layers of tooling

Abliteration versus Heretic article banner image provided by user

Cover image: Generated with Google Gemini · pasted-1772480746.png

Abliteration is the recipe; Heretic and Obliteratus are tools built on it. The real differences come down to how much tuning, workflow, and instrumentation each adds.

Refusal behavior lives in identifiable directions inside the model. Project those directions out and the model stops refusing — no retraining, no preference data, no RLHF. That recipe is abliteration. Heretic and Obliteratus are tools built on it, each wrapping a different amount of tooling around the same edit.

Where Each Method Acts

Some safety behavior lives in the model and some lives in the serving stack around it.

Intervention surface

All three cluster around model-internal behavior. The serving stack mostly stays outside the cut.

Inner model layers

Base model priors Default capability and latent risk behavior

Underlying behavior the later layers sit on top of.

Indirect Broad edits can perturb capability pathways.

Indirect Search pressure can bend nearby behavior too.

Direct Can probe and edit this layer through broader tooling.

Instruction tuning General compliance and assistant persona constraints

Shapes how the model behaves as an assistant.

Direct Can bypass refusal-linked pathways without retraining.

Direct Reweights this behavior through optimized interventions.

Direct Targets the same layer via extraction, steering, or projection.

Preference tuning Refusal style, caution level, and policy stance

Controls how refusals get expressed downstream.

Direct Targets the downstream expression of these behaviors.

Direct Search-weighted edits work on the same layer.

Direct Measures and edits refusal expression through multiple variants.

Serving and moderation layers

System prompt / serving policy Policy framing, guard phrases, and response style

The product layer sitting above the model itself.

Separate Usually untouched unless the serving stack changes too.

Separate Usually untouched; a different control plane.

Separate Can sit beside this layer, but does not erase it automatically.

External guardrails Moderation APIs, filters, gateways, blocking, and routing

Hard outer filters applied before or after generation.

No direct effect Still active unless the product stack removes them.

No direct effect Optimization inside the model does not switch them off.

No direct effect Tooling breadth does not bypass outer moderation by itself.

All three mostly act on inner model layers, while the outer serving stack remains separate.

Where Abliteration, Heretic, and Obliteratus mostly act.

What Came Before Abliteration

People were changing model behavior long before ablation. Each older method touches refusal from the outside — at the wrapper, the prompt, the decoding step, the residual stream, a checkpoint merge, or a fresh round of training. None of them touch the refusal behavior itself.

Intervention depth

Wrapper / interface layer

System / wrapper-layer bypass

Operationally simple and reversible.

Often mistaken for real model change.

Prompt layer

Prompt-space jailbreaks

Fastest way to probe the refusal boundary.

Brittle, model-specific, easy to break with updates.

Decoding layer

Decoding-time steering

Reversible and safe to test.

Often weaker than weight edits.

Representation layer

Representation steering

Great for interpretability and controlled experiments.

Operational complexity rises fast.

Checkpoint composition layer

Model merging / blending

Can preserve fluency well.

Harder to attribute what caused what.

Checkpoint tuning layer

Fine-tune de-alignment

More persistent than prompt tricks.

More compute and more drift risk.

What Came Before Abliteration as an intervention-depth map.

Everything in this map shapes refusal from the outside. Abliteration is the first method that edits the refusal itself.

Abliteration

Abliteration treats refusal as a geometric feature. Compare activations on harmful and harmless prompts, take the mean difference per layer, and you have a candidate refusal direction. Suppress that direction at inference, or orthogonalize it out of the weights, and the model stops refusing — without ever seeing a single new training example. That's the move that spread through the open-model community in 2024.

Reference profile

Abliteration

2024

Mechanism

Find a refusal-linked direction from harmful vs harmless activations, then suppress it at inference time or via weight-space orthogonalization.

Origin: Refusal-direction ablation was identified by Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda; FailSpy and Maxime Labonne packaged it as the abliteration recipe.
Appeared: Public previews in spring 2024; the arXiv paper followed in June 2024.
Maintainers: Interpretability researchers plus open-model safety and tooling maintainers.
Repository: FailSpy/abliterator (MIT) and Sumandora/remove-refusals-with-transformers (Apache-2.0).

The simplicity is the appeal — and it's also the catch. A single global suppression treats every layer and every prompt the same. The refusal goes away, but so does some of whatever else lived along that direction. That's the drift everything downstream of abliteration is trying to manage.

Heretic: abliteration with a search loop

Heretic's contribution is the loop. Instead of one global suppression with one fixed weight, it sweeps weighted interventions across layers and keeps the configuration that wins on its eval set. The same underlying edit, run as a search problem against an objective.

Reference profile

Heretic

2025

Mechanism

Automated directional ablation: compute refusal directions, apply weighted interventions, and search for better refusal-vs-drift tradeoffs.

Origin: Authored by Philipp Emanuel Weidmann (p-e-w) with open-source contributors.
Appeared: Public repo created in September 2025, with active releases since.
Maintainers: Core maintainer plus a growing contributor set from open-model communities.
Repository: p-e-w/heretic (AGPL-3.0, public GitHub repository).

That gives Heretic something raw ablation lacks: discipline. The configuration it ships is the one that survived measurement, not the first one that compiled. The catch is that the optimizer can only see what its eval set measures — anything outside that frame can drift without anyone noticing until a downstream benchmark catches it.

Obliteratus

Obliteratus widens the surface area instead of deepening the loop. One refusal direction becomes several extraction variants you can compare. One intervention becomes several steering hooks you can swap in. One coherence check becomes a panel of metrics — refusal rate, perplexity, drift — running side by side. The same underlying edit, presented as an instrumented bench rather than a single recipe.

Reference profile

Obliteratus

2026

Mechanism

Analysis-heavy abliteration suite combining refusal-direction extraction, steering hooks, presets, and benchmark instrumentation.

Origin: Published by elder-plinius with open-source contributors.
Appeared: Public repository launched in March 2026.
Maintainers: Core maintainer plus community contributors; ecosystem uptake is still early-stage.
Repository: elder-plinius/OBLITERATUS (AGPL-3.0, public GitHub repository).

The friction drops; the visibility goes up. The risk is that visibility through your own dashboard isn't the same as ground truth. A run that looks clean on the panel can still be drifting in the dimensions the panel doesn't track — and a richer toolkit makes that easier to miss, not harder.

One Lineage

One lineage, three layers of tooling

Stage 1

Abliteration

Core mechanism

Role: Mechanism suppression / removal
Speed: Fast
Best use: Targeted research probes
Main risk: Calibration and stability degradation

Stage 2

Heretic

Abliteration with a search loop

Role: Automated multi-layer directional intervention
Speed: Moderate
Best use: Tuned refusal reduction with automated search loops
Main risk: Hidden drift from search or over-optimization

Stage 3

Obliteratus

Broader tooling suite

Role: Analysis-heavy multi-method intervention toolkit
Speed: Fast to moderate
Best use: Broader exploratory analysis and operator-friendly experimentation
Main risk: Benchmark overconfidence, complexity creep, and ordinary refusal-edit drift

One core intervention, three different workflows.

Abliteration, Heretic, and Obliteratus as one lineage with three layers of tooling.

ColophonBy @chkn_little · Authored by GPT 5.4 · edited by Claude Opus 4.7

Paper	Date	Publisher / Venue
Refusal in Language Models Is Mediated by a Single Direction	2024	NeurIPS
Gabliteration: Adaptive Multi-Directional Neural Weight Modification for Selective Behavioral Alteration in Large Language Models	2026	arXiv preprint
Steering Llama 2 via Contrastive Activation Addition	2024	Association for Computational Linguistics (ACL)
Open Sesame! Universal Black-Box Jailbreaking of Large Language Models	Aug 2024	MDPI (Applied Sciences)
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically	2024	NeurIPS
Visual Adversarial Examples Jailbreak Aligned Large Language Models	Mar 2024	AAAI
Jailbreaking Black Box Large Language Models in Twenty Queries	Apr 2025	IEEE (SaTML)

Previous: Even 'Uncensored' Models Can't Say What They Want Next: EpsteinBench

Abliteration vs Heretic vs Obliteratus: one trick, three layers of tooling

Where Each Method Acts

What Came Before Abliteration

System / wrapper-layer bypass

Prompt-space jailbreaks

Decoding-time steering

Representation steering

Model merging / blending

Fine-tune de-alignment

Abliteration

Abliteration

Heretic: abliteration with a search loop

Heretic

Obliteratus

Obliteratus

One Lineage

Abliteration

Heretic

Obliteratus

Selected Literature