Use Case · Trust & Safety

AI for trust and safety teams. Moderation that actually sees the harm.

Detect harassment, hate speech, deepfake scripts, and coordinated manipulation through an unrestricted moderation API. Plus generate the labeled training data your in-house classifier learns from. One API, severity tiers, reason codes, and audit logs on every decision.

user

abliterated-model

other LLMs

waiting for prompt…

Categories

What you can moderate or train on.

Harassment & hate speech

Multilingual harassment, coded slurs, and dialect-drift insults. The categories most providers either won't analyze or won't generate examples of. Detect them at the API edge, or build the labeled corpus your classifier learns from.

Jailbreak & prompt injection

Catch jailbreak attempts, multi-turn drift, and indirect injection hidden in PDFs or pasted email. Or generate the corpus your detector trains on. Both flows run through the same API.

Unsafe content (severity-tiered)

Self-harm cues, violence, weapons, drugs, and other unsafe categories. Five severity tiers per category, with the top tier gated behind audit-logged keys. Analyze in production or generate labeled examples for training.

Deepfakes & coordinated manipulation

Voice-clone scripts, astroturf templates, viral-narrative variants, brigade patterns. Detect them in your platform, or generate the training set your detection model learns from.

Detection

Moderate the content others refuse to read.

Hand a provider-default API a coded harassment comment and ask it to label severity. It refuses to engage with the content at all. Same story for jailbreak attempts and deepfake scripts. T&S teams end up writing brittle keyword matchers while their AI stays useless on exactly the content that needs review. Our abliteration model analyzes it, returns a structured decision, and logs the reasoning for audit.

incoming.content·user u_98f4·comments·lang es

[redacted · regional ingroup slur with imperative threat]

provider-default · refusedpolicy = provider

I can't engage with content that contains harassment or slurs. Please rephrase your request.

abliteration · classifiedpolicy = ts-mod-v3
Reasoning

Reasoning behind every decision.

Every classification ships with the reason the model assigned the label: which phrase triggered it, which policy clause it matched, why the severity landed where it did. Stream it to your moderation queue for human review, log it for regulators, or feed it back into your training set.

decision · req_8f4c2tier 3 · coded_harassment
Content

[redacted · regional ingroup slur in imperative form, es]

Reasoning · attached to every decision
why_flaggedPhrase X identified as Spanish regional ingroup slur; imperative-mood verb directs threat at demographic Y.
matched_policyts-mod-v3 § 4.1.7: overt insult with directional intent
tier_rationaleTier 3: directional intent, no explicit violence call. Tier 4 reserved for explicit threats.
recommended actionrewrite_or_escalate
Coverage

Every category your moderation team tracks.

Harassment, hate speech, non-consensual imagery, violence and weapons, self-harm cues, jailbreak attempts, prompt injection, deepfake scripts, coordinated manipulation, child safety, brand impersonation. Five severity tiers per category. Top tier gated behind audit-logged keys.

taxonomy.map12 categories
Coded harassmentt1–t4dog-whistles, ingroup slurs
Hate speecht1–t4overt slurs, threats
NCIIt1–t5tiered, with safety brackets
CBRNEt1–t5bracketed, RAI-marked
Self-harmt1–t5escalation-aware
Jailbreakt1–t4DAN, roleplay, drift
Prompt injectiont1–t4direct + indirect (PDF, email)
Deepfake scriptt1–t3voice-clone, persona-shift
Election narrativet1–t3viral-claim variants
Coordinated manipulationt1–t5astroturf, brigade patterns
Child safetyt1–t5tier-restricted, audit-gated
Brand impersonationt1–t3phish + authority attack
Pricing

Free tier. Pay-as-you-go. Enterprise.

Standard API rates. Per-surface key scoping so moderation and training jobs stay isolated. Custom retention and decision-log streaming on Team and Enterprise.

See pricing

Try the model that doesn’t say no.

Free tier. OpenAI-compatible. Policy Gateway when you scale.