Glossary

Refusal vector ablation

Refusal vector ablation removes the refusal direction from hidden states.

It is the core operation behind abliteration.

Definition of Refusal vector ablation

Refusal vector ablation is the process of subtracting a learned refusal direction from a model's hidden states to reduce refusals without retraining the entire model.

Why Refusal vector ablation matters

Targets a narrow behavior change instead of rewriting the full model.
Keeps the original model weights intact for safety and reversibility.
Lets teams tune refusal behavior with transparent, testable edits.

How it works

Learn a refusal direction from hidden-state examples.
Choose which layers to apply the ablation.
Subtract the projection onto the refusal vector at those layers.
Validate with benchmarks and refusal-rate checks.

Ablation formula

h_ablit = h - (h · r_hat) r_hat

FAQ

Frequently Asked Questions

Is refusal vector ablation the same as fine-tuning?

No. It is a deterministic edit to activations, not a gradient-based weight update.

Can I reverse the ablation?

Yes. Because the edit is applied at inference time, you can remove it or adjust its strength.