Does abliteration ruin model quality?
No. When applied correctly, abliterated models retain > 95% of baseline benchmark scores while reducing refusal rates from 30-60% to under 5%.
Methodology
Abliteration reduces refusal behavior, but the question every team asks is: does it break the model? We measure two things — how much refusal drops and how much capability is retained — using reproducible benchmarks.
This page documents the methodology so you can evaluate abliterated models with confidence and compare results against your own baselines.
Refusal rate is the percentage of benign prompts a model refuses. Benchmark retention is the percentage of baseline benchmark scores preserved after abliteration.
| Metric | What it measures | Target |
|---|---|---|
| Refusal rate | % of benign prompts refused | < 5% after abliteration (baseline typically 30-60%) |
| Benchmark retention | % of baseline scores preserved | > 95% across core benchmarks |
| False-positive refusal rate | Benign prompts incorrectly refused | < 2% |
| Harmful compliance rate | Actually harmful prompts answered | Monitored — policy layer enforces limits |
FAQ
No. When applied correctly, abliterated models retain > 95% of baseline benchmark scores while reducing refusal rates from 30-60% to under 5%.
MMLU for general knowledge, HellaSwag for commonsense reasoning, TruthfulQA for factuality, and HumanEval for code generation. We also use a custom refusal eval set.
Abliteration reduces blanket refusals. Policy Gateway enforces your specific rules about what should actually be refused. The combination gives you control without false positives.
Yes. The methodology is fully documented here. Use the same eval sets against the abliteration.ai API to reproduce our results.