GlossaryUpdated 2026-01-24
Orthogonalization
How orthogonalization removes unwanted directions in activation space.
Orthogonalization removes a vector component by subtracting its projection.
Abliteration uses this to remove the refusal direction from hidden states.
Definition
Orthogonalization
Orthogonalization is the process of making a vector orthogonal to another by subtracting its projection. In activation editing, it removes a behavior direction from hidden states.
Why it matters
- Provides a simple, transparent way to remove a behavior direction.
- Keeps most of the representation intact while removing a single component.
- Makes behavior edits easy to test and reverse.
How it works
- 01Normalize the direction vector to unit length.
- 02Compute the projection of the hidden state onto that vector.
- 03Subtract the projection to remove the component.
Orthogonalization step
h_orth = h - (h · v_hat) v_hat
FAQ
Frequently asked questions.
Does orthogonalization change model weights?
No. It is applied to activations at inference time, not to weights.
Why is it used for abliteration?
It cleanly removes the refusal component while leaving other information intact.