GlossaryUpdated 2026-01-24
Refusal vector
Definition of refusal vectors in LLMs and how they power abliteration.
A refusal vector is the hidden-state direction most associated with a model refusing requests.
Abliteration estimates this direction and removes its influence at selected layers.
Definition
Refusal vector
A refusal vector is a direction in a model's hidden-state space that correlates strongly with refusal behavior. Projecting activations onto it predicts when the model will refuse.
Why it matters
- Provides a measurable handle on refusal behavior without full retraining.
- Enables targeted edits that preserve general reasoning and language ability.
- Pairs with policy layers to replace blanket refusals with compliant responses.
How it works
- 01Collect refusal and non-refusal examples and record hidden states.
- 02Compute the mean difference between refusal and non-refusal representations.
- 03Normalize the difference to get a refusal direction.
- 04Subtract the projection at chosen layers to dampen refusals.
Vector removal (pseudocode)
# h = hidden state, r = refusal vector r_hat = r / norm(r) h_ablit = h - dot(h, r_hat) * r_hat
FAQ
Frequently asked questions.
Is a refusal vector a model weight?
No. It is a direction in activation space derived from hidden states, not a new set of learned weights.
Does removing a refusal vector break the model?
When applied carefully, it targets refusal behavior without broadly degrading capability.
Can I combine this with policy filters?
Yes. Abliteration reduces blanket refusals, while policy layers enforce your specific rules.