Is a refusal vector a model weight?
No. It is a direction in activation space derived from hidden states, not a new set of learned weights.
Glossary
A refusal vector is the hidden-state direction most associated with a model refusing requests.
Abliteration estimates this direction and removes its influence at selected layers.
A refusal vector is a direction in a model's hidden-state space that correlates strongly with refusal behavior. Projecting activations onto it predicts when the model will refuse.
# h = hidden state, r = refusal vector r_hat = r / norm(r) h_ablit = h - dot(h, r_hat) * r_hat
FAQ
No. It is a direction in activation space derived from hidden states, not a new set of learned weights.
When applied carefully, it targets refusal behavior without broadly degrading capability.
Yes. Abliteration reduces blanket refusals, while policy layers enforce your specific rules.