Glossary

Orthogonalization

Orthogonalization removes a vector component by subtracting its projection.

Abliteration uses this to remove the refusal direction from hidden states.

Definition of Orthogonalization

Orthogonalization is the process of making a vector orthogonal to another by subtracting its projection. In activation editing, it removes a behavior direction from hidden states.

Why Orthogonalization matters

Provides a simple, transparent way to remove a behavior direction.
Keeps most of the representation intact while removing a single component.
Makes behavior edits easy to test and reverse.

How it works

Normalize the direction vector to unit length.
Compute the projection of the hidden state onto that vector.
Subtract the projection to remove the component.

Orthogonalization step

h_orth = h - (h · v_hat) v_hat

FAQ

Frequently Asked Questions

Does orthogonalization change model weights?

No. It is applied to activations at inference time, not to weights.

Why is it used for abliteration?

It cleanly removes the refusal component while leaving other information intact.