FAQ

Frequently Asked Questions

How do I validate synthetic training data before fine-tuning?

Use a QA rubric that checks format, deduplication, length, language, toxicity, factual accuracy, diversity, and instruction-following. Automate what you can and spot-check the rest.

What pass rate should I target?

Aim for > 95% pass rate on automated checks and > 90% on spot-checked factual accuracy. Discard or fix rows that fail.

Can I use abliteration.ai to generate synthetic training data?

Yes. The OpenAI-compatible API supports structured generation for training pairs, eval sets, and labeled datasets. Policy Gateway can add governance and quotas to generation workflows.

How do I detect near-duplicates at scale?

Use MinHash or SimHash on tokenized rows. Flag pairs with similarity above your threshold (typically > 0.85 Jaccard similarity) for deduplication.