Moderation Labeling Prompt for User Content
Safe, Spam, Harassment, Hate, Adult — multi-label policy classification with Strict Other and numeric confidence, built for review queues.
Overview
Moderation is the classification setting with the worst failure costs in both directions: a forced wrong label either censors safe content or publishes harmful content. This setup labels user-generated content against five policy categories in Multiple Labels mode (content can violate two policies at once), under Strict ambiguity — anything that fits no category returns "Other" for human review — with 0–100 confidence per label so the queue can auto-action only the unambiguous cases. The definitions keep adjacent harms apart: Harassment targets a person; Hate targets a group with hostility or discrimination.
Workflow
-
Label, don't action
The prompt produces labels and confidence — the auto-hide / human-review / publish thresholds live in your system.
-
Route by confidence bands
90+ on Safe can publish; 90+ on a harm category can auto-hold; everything else queues for a person.
-
Keep Other visible
Content that fits no category is exactly what a policy team needs to see — Strict mode guarantees it surfaces.
Why This Works
- Multi-label matches how violations actually occur — bundled
- Person-vs-group definitions keep Harassment and Hate from collapsing into one label
- Strict Other plus confidence bands builds the human-in-the-loop in, instead of bolting it on
Best for
- Platforms with a review queue between users and publication
- Policies where one post can violate two rules at once
- Teams that need the model to say "unsure" instead of guessing
Not for
- Final moderation decisions — this labels for a human queue; the action policy is yours
- Legal-compliance judgments — policy classification is not legal review
Use cases
- Pre-screening user content before publication
- Labeling multi-violation content with every applicable category
- Feeding a human review queue with confidence-ranked items