MethodologyUpdated 2026-04-14

How we measure refusal rate and benchmark retention for abliterated models

Methodology for measuring refusal rate reduction and benchmark retention after abliteration. Covers eval datasets, scoring, and quality gates.

Abliteration reduces refusal behavior, but the question every team asks is: does it break the model? We measure two things — how much refusal drops and how much capability is retained — using reproducible benchmarks.

This page documents the methodology so you can evaluate abliterated models with confidence and compare results against your own baselines.

Definition

How we measure refusal rate and benchmark retention for abliterated models

Refusal rate is the percentage of benign prompts a model refuses. Benchmark retention is the percentage of baseline benchmark scores preserved after abliteration.

Why it matters
  • Without measurement, abliteration is a black box. Quantified metrics make the trade-off auditable.
  • Teams need to prove to stakeholders that model quality is preserved after behavior edits.
  • Reproducible methodology lets you compare abliterated models across versions and configurations.
How it works
  1. 01Build a refusal eval set: curated prompts that baseline models refuse but should not (security research, medical questions, creative writing).
  2. 02Build a capability eval set: standard benchmarks (MMLU, HellaSwag, TruthfulQA, HumanEval, etc.).
  3. 03Run the baseline model on both sets and record scores.
  4. 04Apply abliteration and re-run both sets under identical conditions.
  5. 05Compute refusal rate delta and benchmark retention percentage.

Key metrics

MetricWhat it measuresTarget
Refusal rate% of benign prompts refused< 5% after abliteration (baseline typically 30-60%)
Benchmark retention% of baseline scores preserved> 95% across core benchmarks
False-positive refusal rateBenign prompts incorrectly refused< 2%
Harmful compliance rateActually harmful prompts answeredMonitored — policy layer enforces limits

Quality gates

  • Abliterated model must retain > 95% of baseline MMLU score.
  • Refusal rate on the benign eval set must drop below 5%.
  • No regression on TruthfulQA — abliteration must not increase hallucination.
  • Code generation (HumanEval) pass rate must remain within 2% of baseline.
  • If any gate fails, the ablation parameters are adjusted and the eval is re-run.
FAQ

Frequently asked questions.

Does abliteration ruin model quality?

No. When applied correctly, abliterated models retain > 95% of baseline benchmark scores while reducing refusal rates from 30-60% to under 5%.

What benchmarks do you use?

MMLU for general knowledge, HellaSwag for commonsense reasoning, TruthfulQA for factuality, and HumanEval for code generation. We also use a custom refusal eval set.

How do you prevent the model from answering actually harmful prompts?

Abliteration reduces blanket refusals. Policy Gateway enforces your specific rules about what should actually be refused. The combination gives you control without false positives.

Can I run these evals myself?

Yes. The methodology is fully documented here. Use the same eval sets against the abliteration.ai API to reproduce our results.