What datasets can I generate with Abliteration's training-data API?

Preference pairs (DPO/ORPO), eval sets, classifier examples, and adversarial corpora — plus any custom labelled dataset matching a schema you define. Up to 1 million records per job.

Are reasoning traces included on every row?

Yes. Toggle Thinking in the console and the model's chain-of-thought lands as a schema-aware sidecar on every generated row — the field most frontier APIs strip out. Useful for distillation, RLHF teacher signals, and process-supervision research.

Where can I export the generated data?

One-click push to Hugging Face datasets, Kaggle, Amazon S3, Google Cloud Storage, Azure Blob, or grab a signed URL for direct download. No middleman, no manual upload step.

How is cost estimated before a full run?

Every job starts with a paid 3-row preview against your spec, schema, and toggles. Abliteration charges real preview credits, then estimates the full run as the highest-cost preview row × your row target plus a reserve buffer. Promote to a full job only when the preview and estimate look right.

Why use Abliteration instead of another generation API?

Provider-default APIs refuse the rows your classifier actually needs — synthetic phishing for fraud detection, coded harassment for moderation, jailbreak attempts for safety eval. Abliteration generates the entire taxonomy schema-locked, with reasoning, citations, and provenance on every row.

Try it

Docs Pricing

Platform · Training Data

Training data for trust & safety, security, and defense teams.

Generate labeled datasets at scale: harmful-content classifier rows, adversarial prompts, preference pairs, eval cases, and the edge cases default LLMs refuse to write. Every row is schema-validated with policy labels, severity tiers, provenance, and export paths your training pipeline already uses.

Generate a paid preview Request a sample pack

abliteration.ai/console/synthetic-data

Describe what you want

Generate 10,000 realistic customer support conversations about refunds and billing edge cases.

Chat SFT10,000 recordsOpenAI training JSONL

Generate dataset

billing-support-training.jsonl

10,000 rows · 4.2 MB · ready

OpenAI training JSONL

{"messages":[{"role":"user","content":"My card was charged twice for the same order — can I get one refunded?"},{"role":"assistant","content":"Of course — I can look that up. Could you share the order number?"}]}

{"messages":[{"role":"user","content":"Where's my order for SKU-A-12? It said shipped 4 days ago."},{"role":"assistant","content":"Let me pull the tracking. One moment…"}]}

{"messages":[{"role":"user","content":"Can you change my shipping address before the order leaves?"},{"role":"assistant","content":"Sure — what's the new address?"}]}

{"messages":[{"role":"user","content":"I want to cancel and get a refund. How long does it take?"},{"role":"assistant","content":"Got it. Cancellation processes immediately; refund hits in 3-5 business days."}]}

+ 9,996 more rows

Download Export to Hugging Face

High-intent buyers

Built for teams with legitimate hard-to-source labels.

Trust & safety operations

Balanced toxic and non-toxic examples, marketplace fraud scenarios, coded-harassment variants, and reviewer-ready reason codes for classifier training.

Security and AI red teams

Prompt-injection corpora, jailbreak regression sets, tool-misuse examples, and adversarial eval rows for authorized testing workflows.

Defense and regulated research

Governed dataset generation with policy labels, audit logs, scoped keys, and export controls for teams that need access without losing oversight.

What you can generate

The categories that block your training pipeline.

Coded harassment

Multi-lingual dog-whistles and ingroup-coded slurs for moderation classifier training.

Phishing emails

Synthetic phishing across urgency, authority, and pretexting tactics for detection eval.

Prompt injections

Direct, indirect, ASCII-smuggling, and tool-confusion vectors for AppSec red-team training.

Adversarial edge cases

Refusal-eval prompts, policy edge cases, and known-jailbreak corpora for safety regression sets.

Reasoning

Reasoning included.

Most frontier APIs hide chain-of-thought out of distillation paranoia. We don't. Toggle Thinking in the console and the model's reasoning lands on every row as a schema-aware sidecar, ready to feed your next reasoning model.

row_00042.jsonl + reasoningtoggle · Thinking

promptWalk me through your reasoning. Why would a junior engineer prefer Postgres over MongoDB for an event-log table?

completionPostgres wins for an event-log table when you need transactional integrity, point-in-time recovery, and structured queries over event metadata.

Reasoning · most providers strip this out

reasoningThinking + Export reasoningStep 1: The use case is an event log, so the workload is append-heavy. Step 2: Compare strengths. Postgres gives ACID and a richer query surface; Mongo gives looser schema but no native transactions across collections at this scale. Step 3: Junior engineers benefit from constraints — schema enforcement catches bugs early. Step 4: Decide Postgres.

provenance{"model":"abliterated-model","seed":420,"run_id":"R-018"}

Web search

Grounded in the live web.

Generate against today's web, not the model's frozen pre-training cutoff. Toggle Web search in the console and the sources field on every row carries the URLs the model actually fetched, with titles and fetched-at timestamps. Citations stay with the row through export.

row_00187.jsonl + sourceslive web

promptWhat's the current EU AI Act timeline for general-purpose AI obligations?

completionPer Article 53, GPAI obligations apply from 2 Aug 2025; the GPAI Code of Practice and structured transparency duties phase in alongside…

Sources · fetched live during generation

www

Article 53 — General-purpose AI obligations

artificialintelligenceact.eu/article-53·fetched 2026-05-27T14:32Z

www

European Commission · AI Act enforcement timeline

digital-strategy.ec.europa.eu/en/policies/ai-act·fetched 2026-05-27T14:32Z

www

GPAI Code of Practice — IAPP analysis

iapp.org/news/a/eu-ai-act-gpai-code-of-practice·fetched 2026-05-27T14:32Z

provenance{"model":"abliterated-model","run_id":"R-019","cited_count":3}

Preview & cost

Preview before you commit.

Every job starts with a paid 3-row preview generated against your spec, schema, and toggles. We charge real preview credits, then estimate the full run as the highest-cost preview row × your row target plus a reserve buffer. Top up if short. Promote to full job if it looks right.

preview.run · R-018preview ready

#	sample prompt	tokens in/out	cost
1	Multi-lingual coded harassment, tier-2 severity.	1,402 / 812	$0.0052
2	RAG-grounded answer about Article 53 timeline.	1,538 / 921	$0.0064max
3	Deepfake-script detector seed, voice-clone variant.	1,318 / 784	$0.0049

preview spent$0.0165/ 3 rows

full-run estimate · max-row × 10,000 + 5% reserve$0.0064×10,000→$67.20

Reserve required · $67.20

Output formats

In the format your pipeline already speaks.

JSONLOpenAI fine-tuneDPO / ORPOHugging Face datasetsParquetCSV

Export

Push to where your training pipeline lives.

Stream a run straight to the bucket, dataset hub, or notebook you already use. No middleman, no manual upload step.

See export options

run.flow

Generate

billing-support-training.jsonl

10,000 rows · 4.2 MB · ready

Pick destination

huggingface.co / your-org / billing-support

private dataset · CMEK

Pushed

✓ ready in HF Hub

under 30 seconds

Hugging Face Hub

One-click push to a private or public dataset card.

Kaggle

Push to a private Kaggle dataset for competitions and notebooks.

Amazon S3

Stream directly to your bucket. SSE-KMS supported.

Google Cloud Storage

Stream to GCS with service-account auth and CMEK.

Azure Blob

Write straight to a container with managed identity.

Direct download

Grab a signed URL and pull the run from your training box.

Schema & provenance

Typed once. Validated every row.

Define the dataset shape once. Every generated row is validated before it leaves the workspace, with provenance, tier, and policy label attached. No bespoke clean-up step downstream.

schema.json

{
  "schema": "preference-pair-v3",
  "fields": [
    { "name": "prompt",   "type": "string" },
    { "name": "chosen",   "type": "string" },
    { "name": "rejected", "type": "string" },
    { "name": "policy",   "type": "string" },
    { "name": "tier",     "type": "int"    }
  ]
}

sample.jsonl

{
  "prompt":   "Rewrite this support ticket in a civil tone.",
  "chosen":   "I understand your frustration. Let's solve the refund...",
  "rejected": "Stop being so rude. Read the FAQ.",
  "policy":   "civil-tone-v2",
  "tier":     1
}

Integrate

One line in your training script.

Once a run is exported, load it the same way you load every other dataset: Hugging Face datasets, the OpenAI fine-tune API, or your own RLHF stack.

train.py

from datasets import load_dataset

ds = load_dataset(
  "abliteration/runs/billing-support-training",
  split="train",
)
trainer.train(dataset=ds)

Pricing

Paid previews, then production-scale runs.

Generate up to 1M records per job. Tiered by records-per-month, with usage-based scaling for enterprise.

See pricing

Try the model that doesn’t say no.

Free tier. OpenAI-compatible. Policy Gateway when you scale.

Try the model free Talk to us