Operations

Moderation Labeling Prompt for User Content

Safe, Spam, Harassment, Hate, Adult — multi-label policy classification with Strict Other and numeric confidence, built for review queues.

Open in Data Classification Prompt

Overview

Moderation is the classification setting with the worst failure costs in both directions: a forced wrong label either censors safe content or publishes harmful content. This setup labels user-generated content against five policy categories in Multiple Labels mode (content can violate two policies at once), under Strict ambiguity — anything that fits no category returns "Other" for human review — with 0–100 confidence per label so the queue can auto-action only the unambiguous cases. The definitions keep adjacent harms apart: Harassment targets a person; Hate targets a group with hostility or discrimination.

How to use this resource

Label, don't action

The prompt produces labels and confidence — the auto-hide / human-review / publish thresholds live in your system.
Route by confidence bands

90+ on Safe can publish; 90+ on a harm category can auto-hold; everything else queues for a person.
Keep Other visible

Content that fits no category is exactly what a policy team needs to see — Strict mode guarantees it surfaces.

Why This Works

Multi-label matches how violations actually occur — bundled
Person-vs-group definitions keep Harassment and Hate from collapsing into one label
Strict Other plus confidence bands builds the human-in-the-loop in, instead of bolting it on

Best for

Platforms with a review queue between users and publication
Policies where one post can violate two rules at once
Teams that need the model to say "unsure" instead of guessing

Not for

Final moderation decisions — this labels for a human queue; the action policy is yours
Legal-compliance judgments — policy classification is not legal review

Use cases

Pre-screening user content before publication
Labeling multi-violation content with every applicable category
Feeding a human review queue with confidence-ranked items

FAQ

How can one post end up tagged both Harassment and Hate here?

The two definitions overlap deliberately: Harassment is content that attacks, threatens, or demeans a person or group, while Hate is hostile or discriminatory content targeting a person or group. A threat laced with identity-based hostility satisfies both. Running in Multiple Labels mode, the classification returns every label that applies, ordered strongest-first, so a post can legitimately carry both.

What do the 90, 60-89 and below-60 confidence bands actually mean here?

Each label carries a 0-100 score with defined bands: 90 or above is an unambiguous fit, 60-89 is a good fit with some uncertainty, and below 60 marks the best available choice for a shaky call. The prompt states the number is self-reported certainty, not a computed probability, so treat a low band as a queue-for-a-human signal rather than a measured error rate.

Which output format does the moderation prompt return, and can I parse it directly?

It returns one JSON object with a 'labels' array of label/confidence pairs, like { "label": "Spam", "confidence": 64 }, ordered strongest-fit first. The output rules forbid markdown fences or any text around the JSON and require the exact label text with no paraphrasing or abbreviations, so a strict parser plus an allow-list of the five labels and 'Other' will catch drift wherever you run the prompt.

Customize This Resource

Opens this setup in Data Classification Prompt. Generate to get the full classification prompt — then adjust the labels, ambiguity mode, and confidence output.

Open in Data Classification Prompt

Prompt Template

Copy it as-is, or use Open in Data Classification Prompt to load it pre-filled and customize it with your own context.

TASK
Classify user-generated content against the moderation policy before it is published.
Assign one or more of the labels defined below to the text in the input.

LABELS
- Safe: Content that violates no policy and requires no action.
- Spam: Unsolicited, promotional, or malicious content with no genuine intent.
- Harassment: Content that attacks, threatens, or demeans a person or group.
- Hate: Hostile or discriminatory content targeting a person or group.
- Adult Content: Sexually explicit or adult content.
- Other: Does not clearly fit any label above.

CLASSIFICATION RULES
- Read the entire text before deciding — the deciding signal may come late.
- Judge by content and intent, not by tone, length, or formatting.
- Match the text against the definitions, not just the label names.
- Use only the labels defined above, plus "Other" when nothing fits — never invent any other label.

EDGE CASE RULES
- If the text fits more than one label, return every label that applies.
- Order the labels by how strongly each applies, strongest first.
- A merely mentioned topic does not earn its label — it must be a real subject of the text.

AMBIGUITY POLICY
- If no label clearly fits, return exactly "Other" — do not force a fit.
- "Other" is a valid answer; a forced wrong label is not.

CONFIDENCE
- With each label, report a confidence score from 0 to 100.
- 90 or above = unambiguous fit; 60–89 = good fit with some uncertainty; below 60 = best available choice.
- The score reflects your certainty about the fit — it is self-reported, not computed.

OUTPUT FORMAT
Return your classification as a single valid JSON object with a "labels" array of label/confidence pairs.

EXAMPLE OF A VALID RESPONSE
{
  "labels": [
    { "label": "Safe", "confidence": 92 },
    { "label": "Spam", "confidence": 64 }
  ]
}

OUTPUT RULES
- Return only the JSON object — no text before or after it.
- Do not wrap the output in markdown code fences.
- Return the label text exactly as defined above — no paraphrasing, no abbreviations, no new labels.

More resources from Data Classification Prompt

Resource

Classify Support Tickets with AI

Billing, Technical, Account, How-To, Feature Request — ticket triage with definitions that decide the borderline cases for the model.

Support

Resource

Text Classification Prompt — the Anatomy

The blocks a reliable classification prompt needs: defined labels, classification rules, edge-case rules, an ambiguity policy, and a confidence contract.

Prompt Engineering

Resource

Categorize Customer Feedback with AI

Praise, Complaint, Feature Request, Bug Report, Question — multi-label feedback categorization where one message can carry three signals.

Product

Resources that pair well

Resource

Extract Data From Text with AI

Free text in, named fields out. The extraction prompt pattern that turns any unstructured text into consistent, parseable records.

Prompt Engineering

Resource

Force JSON Output from AI

Stop getting 'Sure, here is the JSON…' — the output-contract pattern that forces models to return only parseable JSON: schema, example, and a strict rule block.

Prompt Engineering

Resource

Fix Invalid JSON from AI

The JSON won't parse and you can't see why. Deterministic cause-sniffing — trailing commas, single quotes, unclosed brackets — and the repair prompt that fixes it.

Engineering

Related tools

Tool

Data Classification Prompt

Build classification prompts that assign labels from a closed set — with label definitions and edge-case rules.

Structured Output

Tip: Save time by exploring related resources and tools that integrate with this resource.