Published: November 13, 2024
Hate speech, harassment, and online abuse have become a pervasive issue online. Toxic comments silence important voices and drive users and customers away. Toxicity detection protects your users and creates a safer online environment.
In this two part series, we explore how to use AI to detect and mitigate toxicity at its source: users' keyboards.
In this first part, we discuss the use cases and benefits of this approach.
In part two, we share the implementation, including code examples and UX tips.
Why perform client-side toxicity detection
Benefits
Client-side toxicity detection is a useful first line of defense, and a great complement to server-side checks. Client-side toxicity detection offers multiples benefits:
- Detect toxicity early. With client-side checks, you can catch toxicity at its source, without touching the server.
- Enable real-time checks. Use the client-side speed to build low-latency applications and provide instant feedback to your users.
- Reduce or optimize server-side workload. Reduce your server-side workload and costs for toxicity detection: first, because your user-facing hints might help decrease the volume of toxic comments, but also because flagging certain comments as likely toxic before they land on your server helps you prioritize them in your server-side checks.
- Reduce human burden. Decrease the burden on human moderators.
Use cases
Here are a few possible reasons to build client-side toxicity detection:
- Immediate detection in comment systems. Provide immediate feedback to users who draft toxic comments, encouraging them to rephrase their message before posting. With client-side AI, you can achieve this with no API key, no runtime server-side classification costs, and low latency. This can be ideal for chat apps.
- Real-time moderation in live chat. Quickly identify and flag toxic messages from users, allowing moderators to intervene immediately.
Keep your server-side checks
While client-side toxicity detection is fast, a malicious frontend-savvy user may disable it. Additionally, no toxicity detection system is 100% accurate.
For these reasons, we strongly recommend you still implement, or keep, an additional review with a server, instead of relying on just the client-side toxicity detection. For example, complement your real-time client-side check with an asynchronous server-side review using the Perspective API. For a comprehensive approach, you can combine these with human moderation.
Caveats
Client-side toxicity detection requires downloading a classification model into your web page, and often a client-side AI library.
Consider the implications:
- Model hosting and serving costs. The model may be large.
- Performance and UX. The library and model will increase your bundle size.
Consider the benefits before deciding if this is right for your use case. Apply performance best practices for client-side AI and cache the model so download is a one-time cost.
How content toxicity classification works
Before we look at the full implementation, take a look at the essentials of toxicity detection.
A toxicity detection model analyzes existing text, rather than generating new content (generative AI). It's a classic NLP (Natural Language Processing) task.
Toxicity detection relies on text classifiers that categorize text as likely toxic or harmless. Toxicity classifiers take text as an input and assign to it various toxicity labels, along with a score. Scores range from 0 to 1. Higher scores indicate the input is more likely to be toxic.
Toxicity detection relies on text classifiers that categorize text as likely toxic or harmless.
Take as an example the Xenova/toxic-bert model, a web-compatible version of unitary/toxic-bert. It offers six labels:
toxic
severe_toxic
insult
obscene
identity_hate
threat
Labels like toxic
and severe_toxic
denote overall toxicity.
Other labels are more fine-grained. They identify specific types of toxicity,
for example identity_hate
(bullying or threats about a person's identity, such
as race, religion, gender identity, and so on) or threat
(a statement of an
intention to inflict damage).
Different toxicity models have different ways to approach classification. Here are a few representative examples.
In this example, the following input includes the word "hate" and is directed at
a person, so the toxicity
score is high (0.92
). No specific toxicity type
has been identified, so other scores are low.
Input: I hate you
Output of the toxicity classifier:
[
{ label: 'toxic', score: 0.9243140482902527 },
{ label: 'insult', score: 0.16187334060668945 },
{ label: 'obscene', score: 0.03452680632472038 },
{ label: 'identity_hate', score: 0.0223250575363636 },
{ label: 'threat', score: 0.16187334060668945 },
{ label: 'severe_toxic', score: 0.005651099607348442 }
]
In the next example, the input has an overall hateful tone, so it's given a high
toxicity
score (0.92
). Due to explicit mention of damage, the threat
score
is high, too (0.81
).
Input: I hate your garden, and I will kill your plants
Output of the toxicity classifier:
[
{ label: 'toxic', score: 0.9243140482902527 },
{ label: 'insult', score: 0.16187334060668945 },
{ label: 'obscene', score: 0.03452680632472038 },
{ label: 'identity_hate', score: 0.0223250575363636 },
{ label: 'threat', score: 0.819197041168808937 },
{ label: 'severe_toxic', score: 0.005651099607348442 }
]
Up next
Now that you understand the context, you can start building a client-side AI toxicity detection system.