Use Case · Synthetic Data

Generate training and eval data without refusals.

Fine-tuning pairs, eval sets, adversarial corpora — through the same OpenAI-compatible API your pipelines already use.

ML teams need realistic, controlled data — including data that triggers off-the-shelf model refusals. Policy Gateway sits in front of less-restricted inference and lets you govern generation per project, with structured outputs and decision logs on every example.

The problem

Why teams in synthetic data hit a wall.

Refusal-tuned models can't generate adversarial data

Training a safety classifier? You need examples of unsafe prompts. General-purpose APIs refuse to write them — leaving you with hand-curated datasets that don't scale.

Generation quality drops at scale

Off-the-shelf APIs apply unpredictable refusal rates that vary by topic and even by phrasing. Reproducible large-batch jobs become impractical.

No governance audit on what you generated

When a fine-tuning dataset ships into production, you need provenance: which policy, which prompts, which model. Most generation APIs offer no decision metadata.

How Policy Gateway helps

Built for synthetic data workloads.

Less-restricted inference, your rules

Generate the prompts and completions you actually need for training. Your policy decides what's in scope — not the provider's defaults.

Structured JSONL output

Generate fine-tuning pairs, eval entries, or labeled corpora directly in the format your training pipeline expects. No post-processing required.

Per-project quotas and key scoping

Issue a scoped key per dataset job. Track generation volume, cost, and decision history per project — and prove dataset provenance to your reviewers.

Examples

Scenarios from the field.

Eval set generation

Generate 10k labeled prompts for testing a safety classifier. Track every example with policy ID and reason code so QA can replay decisions.

Fine-tuning pair creation

Produce instruction/response pairs for vertical model fine-tuning. Same governed API; no refusal noise polluting your dataset distribution.

Adversarial training data

Generate jailbreak attempts and edge cases for safety training. Controlled, audited, reproducible — and isolated to the project key that paid for it.

Compliance & alignment

Designed for the frameworks your auditors care about.

Built so your dataset shipping reviews don't stall on questions about provenance.

  • Decision metadata per record
    Policy ID, reason code, and key scope on every generated example.
  • Reproducible runs
    Same prompt, same policy version → comparable output across batches.
  • Per-project quotas
    Hard caps so a dataset job can't blow the budget.
  • JSONL-ready outputs
    Structured straight into your training pipeline.
  • Zero data retention
    Generated content not used for training or shared.
  • SOC 2 (in progress)
    Enterprise audits underway.

Ready to bring governance to your synthetic data stack?

Talk to an engineer about your deployment, or grab an API key and start building today.