GlossaryUpdated 2026-01-24

Refusal vector

Definition of refusal vectors in LLMs and how they power abliteration.

A refusal vector is the hidden-state direction most associated with a model refusing requests.

Abliteration estimates this direction and removes its influence at selected layers.

Definition

Refusal vector

A refusal vector is a direction in a model's hidden-state space that correlates strongly with refusal behavior. Projecting activations onto it predicts when the model will refuse.

Why it matters
  • Provides a measurable handle on refusal behavior without full retraining.
  • Enables targeted edits that preserve general reasoning and language ability.
  • Pairs with policy layers to replace blanket refusals with compliant responses.
How it works
  1. 01Collect refusal and non-refusal examples and record hidden states.
  2. 02Compute the mean difference between refusal and non-refusal representations.
  3. 03Normalize the difference to get a refusal direction.
  4. 04Subtract the projection at chosen layers to dampen refusals.
Vector removal (pseudocode)
# h = hidden state, r = refusal vector
r_hat = r / norm(r)
h_ablit = h - dot(h, r_hat) * r_hat
FAQ

Frequently asked questions.

Is a refusal vector a model weight?

No. It is a direction in activation space derived from hidden states, not a new set of learned weights.

Does removing a refusal vector break the model?

When applied carefully, it targets refusal behavior without broadly degrading capability.

Can I combine this with policy filters?

Yes. Abliteration reduces blanket refusals, while policy layers enforce your specific rules.