GlossaryUpdated 2026-01-24
Refusal vector ablation
How refusal vector ablation removes refusal behavior while preserving core model capability.
Refusal vector ablation removes the refusal direction from hidden states.
It is the core operation behind abliteration.
Definition
Refusal vector ablation
Refusal vector ablation is the process of subtracting a learned refusal direction from a model's hidden states to reduce refusals without retraining the entire model.
Why it matters
- Targets a narrow behavior change instead of rewriting the full model.
- Keeps the original model weights intact for safety and reversibility.
- Lets teams tune refusal behavior with transparent, testable edits.
How it works
- 01Learn a refusal direction from hidden-state examples.
- 02Choose which layers to apply the ablation.
- 03Subtract the projection onto the refusal vector at those layers.
- 04Validate with benchmarks and refusal-rate checks.
Ablation formula
h_ablit = h - (h · r_hat) r_hat
FAQ
Frequently asked questions.
Is refusal vector ablation the same as fine-tuning?
No. It is a deterministic edit to activations, not a gradient-based weight update.
Can I reverse the ablation?
Yes. Because the edit is applied at inference time, you can remove it or adjust its strength.