Ablation, Heretic, and Obliteratus are closely related. The real differences come down to how much tuning, tooling, and workflow each one adds.
Ablation is the core move. Heretic and Obliteratus extend it in different directions.
Refusal behavior lives in identifiable directions inside the model. Edit those directions and behavior moves fast.
From there it turns into tooling. Ablation is the move. Heretic adds search. Obliteratus adds a larger workbench.
The Short Version
If you want the core idea, start with ablation: identify the refusal direction and suppress it
If you want that process automated and tuned, you are moving toward Heretic
If you want a bigger all-in-one platform with benchmarks, UI, steering, and analysis modules, you are moving toward Obliteratus
Where Each Method Acts
Some safety behavior lives in the model and some lives in the serving stack around it.
Intervention surface
All three cluster around model-internal behavior. The serving stack mostly stays outside the cut.
Direct editIndirect pressureSeparate control planeNo direct effect
Layer
What it does
Ablation
Heretic
Obliteratus
Inner model layers
Base model priorsDefault capability and latent risk behavior
Underlying behavior the later layers sit on top of.
IndirectBroad edits can perturb capability pathways.
IndirectSearch pressure can bend nearby behavior too.
DirectCan probe and edit this layer through broader tooling.
Instruction tuningGeneral compliance and assistant persona constraints
Shapes how the model behaves as an assistant.
DirectCan bypass refusal-linked pathways without retraining.
DirectReweights this behavior through optimized interventions.
DirectTargets the same layer via extraction, steering, or projection.
Preference tuningRefusal style, caution level, and policy stance
Controls how refusals get expressed downstream.
DirectTargets the downstream expression of these behaviors.
DirectSearch-weighted edits work on the same layer.
DirectMeasures and edits refusal expression through multiple variants.
Serving and moderation layers
System prompt / serving policyPolicy framing, guard phrases, and response style
The product layer sitting above the model itself.
SeparateUsually untouched unless the serving stack changes too.
SeparateUsually untouched; a different control plane.
SeparateCan sit beside this layer, but does not erase it automatically.
External guardrailsModeration APIs, filters, gateways, blocking, and routing
Hard outer filters applied before or after generation.
No direct effectStill active unless the product stack removes them.
No direct effectOptimization inside the model does not switch them off.
No direct effectTooling breadth does not bypass outer moderation by itself.
All three mostly act on inner model layers, while the outer serving stack remains separate.
Where Ablation, Heretic, and Obliteratus mostly act.
What Came Before Ablation
People were already changing behavior before ablation. Those older methods were cheaper, easier to reverse, or clearer about the layer they touched.
Intervention depth
Outer layersDeeper intervention
Wrapper / interface layer
System / wrapper-layer bypass
Operationally simple and reversible.
Often mistaken for real model change.
Prompt layer
Prompt-space jailbreaks
Fastest way to probe the refusal boundary.
Brittle, model-specific, easy to break with updates.
Decoding layer
Decoding-time steering
Reversible and safe to test.
Often weaker than weight edits.
Representation layer
Representation steering
Great for interpretability and controlled experiments.
Operational complexity rises fast.
Checkpoint composition layer
Model merging / blending
Can preserve fluency well.
Harder to attribute what caused what.
Checkpoint tuning layer
Fine-tune de-alignment
More persistent than prompt tricks.
More compute and more drift risk.
What Came Before Ablation as an intervention-depth map.
Ablation stands out because it edits refusal more directly than the methods around it.
Ablation
In the current LLM context, ablation means: find the internal direction most associated with refusal, then suppress or project it out. That is why the method spread so quickly in 2024. It treats refusal as a geometric feature you can edit.
Reference profile
Ablation
2024
Mechanism
Find a refusal-linked direction from harmful vs harmless activations, then suppress it at inference time or via weight-space orthogonalization.
Origin
Long-standing causal analysis; refusal-direction ablation was popularized in this niche by Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda.
Appeared
Public previews in spring 2024; the arXiv paper followed in June 2024.
Maintainers
Interpretability researchers plus open-model safety and tooling maintainers.