Glossary

Refusal vector

A refusal vector is the hidden-state direction most associated with a model refusing requests.

Abliteration estimates this direction and removes its influence at selected layers.

Definition of Refusal vector

A refusal vector is a direction in a model's hidden-state space that correlates strongly with refusal behavior. Projecting activations onto it predicts when the model will refuse.

Why Refusal vector matters

Provides a measurable handle on refusal behavior without full retraining.
Enables targeted edits that preserve general reasoning and language ability.
Pairs with policy layers to replace blanket refusals with compliant responses.

How it works

Collect refusal and non-refusal examples and record hidden states.
Compute the mean difference between refusal and non-refusal representations.
Normalize the difference to get a refusal direction.
Subtract the projection at chosen layers to dampen refusals.

Vector removal (pseudocode)

# h = hidden state, r = refusal vector
r_hat = r / norm(r)
h_ablit = h - dot(h, r_hat) * r_hat

FAQ

Frequently Asked Questions

Is a refusal vector a model weight?

No. It is a direction in activation space derived from hidden states, not a new set of learned weights.

Does removing a refusal vector break the model?

When applied carefully, it targets refusal behavior without broadly degrading capability.

Can I combine this with policy filters?

Yes. Abliteration reduces blanket refusals, while policy layers enforce your specific rules.