Part 2: Build client-side AI toxicity detection

Maud Nalpas
Maud Nalpas

Published: November 13, 2024

Hate speech, harassment, and online abuse have become a pervasive issue online. Toxic comments silence important voices and drive users and customers away. Toxicity detection protects your users and creates a safer online environment.

In this two part series, we explore how to use AI to detect and mitigate toxicity at its source: users' keyboards.

In part one, we discussed the use cases and benefits of this approach.

In this second part, we dive into the implementation, including code examples and UX tips.

Demo and code

Play around with our demo and investigate the code on GitHub.

Demo of comment posting.
When the user stops typing, we analyze the toxicity of their comment. We display a warning in real-time if the comment is classified as toxic.

Browser support

Our demo runs in the latest versions of Safari, Chrome, Edge, and Firefox.

Select a model and library

We use Hugging Face's Transformers.js library, which provides tools for working with machine learning models in the browser. Our demo code is derived from this text classification example.

We choose the toxic-bert model, a pre-trained model designed to identify toxic language patterns. It's a web-compatible version of unitary/toxic-bert. For more details on the model's labels and its classification of identity attacks, refer to the Hugging Face model page.

toxic-bert's download size is 111MB.

Once the model is downloaded, inference is fast.

For example, it typically takes less than 500 milliseconds in Chrome running on a mid-range Android device we've tested on (a regular Pixel 7 phone, not the more performant Pro model). Run your own benchmarks that are representative of your user base.

Implementation

Here are the key steps in our implementation:

Set a toxicity threshold

Our toxicity classifier provides toxicity scores between 0 and 1. Within that range, we need to set a threshold to determine what constitutes a toxic comment. A commonly used threshold is 0.9. This lets you catch overtly toxic comments, while avoiding excessive sensitivity which could lead to too many false positives (on other words, harmless comments categorized as toxic).

export const TOXICITY_THRESHOLD = 0.9

Import the components

We start by importing necessary components from the @xenova/transformers library. We also import constants and configuration values, including our toxicity threshold.

import { env, pipeline } from '@xenova/transformers';
// Model name: 'Xenova/toxic-bert'
// Our threshold is set to 0.9
import { TOXICITY_THRESHOLD, MODEL_NAME } from './config.js';

Load the model and communicate with the main thread

We load the toxicity detection model toxic-bert, and use it to prepare our classifier. The least complex version of this is const classifier = await pipeline('text-classification', MODEL_NAME);

Creating a pipeline, like in the example code, is the first step to running inference tasks.

The pipeline function takes two arguments: the task ('text-classification') and the model (Xenova/toxic-bert).

Key term: In Transformers.js, a pipeline is a high-level API that simplifies the process of running ML models. It handles tasks like model loading, tokenization, and post-processing.

Our demo code does a little more than just preparing the model, because we offload the computationally expensive model preparation steps to a web worker. This allows the main thread to remain responsive. Learn more about offloading expensive tasks to a web worker.

Our worker needs to communicate with the main thread, using messages to indicate the status of the model and the results of the toxicity assessment. Take a look at the message codes we've created that map to different statuses of the model preparation and inference lifecycle.

let classifier = null;
(async function () {
  // Signal to the main thread that model preparation has started
  self.postMessage({ code: MESSAGE_CODE.PREPARING_MODEL, payload: null });
  try {
    // Prepare the model
    classifier = await pipeline('text-classification', MODEL_NAME);
    // Signal to the main thread that the model is ready
    self.postMessage({ code: MESSAGE_CODE.MODEL_READY, payload: null });
  } catch (error) {
    console.error('[Worker] Error preparing model:', error);
    self.postMessage({ code: MESSAGE_CODE.MODEL_ERROR, payload: null });
  }
})();

Classify the user input

In our classify function, we use our previously created classifier to analyze a user comment. We return the raw output of the toxicity classifier: labels and scores.

// Asynchronous function to classify user input
// output: [{ label: 'toxic', score: 0.9243140482902527 },
// ... { label: 'insult', score: 0.96187334060668945 }
// { label: 'obscene', score: 0.03452680632472038 }, ...etc]
async function classify(text) {
  if (!classifier) {
    throw new Error("Can't run inference, the model is not ready yet");
  }
  let results = await classifier(text, { topk: null });
  return results;
}

We call our classify function when the main thread asks the worker to do so. In our demo, we trigger the classifier as soon as the user has stopped typing (see TYPING_DELAY). When this happens, our main thread sends a message to the worker that contains the user input to classify.

self.onmessage = async function (message) {
  // User input
  const textToClassify = message.data;
  if (!classifier) {
    throw new Error("Can't run inference, the model is not ready yet");
  }
  self.postMessage({ code: MESSAGE_CODE.GENERATING_RESPONSE, payload: null });

  // Inference: run the classifier
  let classificationResults = null;
  try {
    classificationResults = await classify(textToClassify);
  } catch (error) {
    console.error('[Worker] Error: ', error);
    self.postMessage({
      code: MESSAGE_CODE.INFERENCE_ERROR,
    });
    return;
  }
  const toxicityTypes = getToxicityTypes(classificationResults);
  const toxicityAssessement = {
    isToxic: toxicityTypes.length > 0,
    toxicityTypeList: toxicityTypes.length > 0 ? toxicityTypes.join(', ') : '',
  };
  console.info('[Worker] Toxicity assessed: ', toxicityAssessement);
  self.postMessage({
    code: MESSAGE_CODE.RESPONSE_READY,
    payload: toxicityAssessement,
  });
};

Process the output

We check if the classifier's output scores exceed our threshold. If so, we take note of the label in question.

If any of the toxicity labels is listed, the comment is flagged as potentially toxic.

// input: [{ label: 'toxic', score: 0.9243140482902527 }, ...
// { label: 'insult', score: 0.96187334060668945 },
// { label: 'obscene', score: 0.03452680632472038 }, ...etc]
// output: ['toxic', 'insult']
function getToxicityTypes(results) {
  const toxicityAssessment = [];
  for (let element of results) {
    // If a label's score > our threshold, save the label
    if (element.score > TOXICITY_THRESHOLD) {
      toxicityAssessment.push(element.label);
    }
  }
  return toxicityAssessment;
}

self.onmessage = async function (message) {
  // User input
  const textToClassify = message.data;
  if (!classifier) {
    throw new Error("Can't run inference, the model is not ready yet");
  }
  self.postMessage({ code: MESSAGE_CODE.GENERATING_RESPONSE, payload: null });

  // Inference: run the classifier
  let classificationResults = null;
  try {
    classificationResults = await classify(textToClassify);
  } catch (error) {
    self.postMessage({
      code: MESSAGE_CODE.INFERENCE_ERROR,
    });
    return;
  }
  const toxicityTypes = getToxicityTypes(classificationResults);
  const toxicityAssessement = {
    // If any toxicity label is listed, the comment is flagged as
    // potentially toxic (isToxic true)
    isToxic: toxicityTypes.length > 0,
    toxicityTypeList: toxicityTypes.length > 0 ? toxicityTypes.join(', ') : '',
  };
  self.postMessage({
    code: MESSAGE_CODE.RESPONSE_READY,
    payload: toxicityAssessement,
  });
};

Display a hint

If isToxic is true, we display a hint to the user. In our demo, we don't use the more fine-grained toxicity type, but we've made it available to the main thread if needed (toxicityTypeList). You may find it useful for your use case.

User experience

In our demo, we've made the following choices:

  • Always allow posting. Our client-side toxicity hint does not prevent the user from posting. In our demo, the user can post a comment even if the model hasn't loaded (and thus isn't offering a toxicity assessment), and even if the comment is detected as toxic. As recommended, you should have a second system to detect toxic comments. If it makes sense for your application, consider informing the user that their comment went through on the client, but then got flagged on the server or during human inspection.
  • Mind false negatives. When a comment isn't classified as toxic, our demo doesn't offer feedback (for example, "Nice comment!"). Apart from being noisy, offering positive feedback may send the wrong signal, because our classifier occasionally but inevitably misses some toxic comments.
Demo of comment posting.
The Post button is always enabled: in our demo, the user can still decide to post their comment, even if it's classified as toxic Even if a comment isn't classified as toxic, we don't display positive feedback.

Enhancements and alternatives

Limitations and future improvements

  • Languages: The model we're using primarily supports English. For multilingual support, you need fine-tuning. Multiple toxicity models listed on Hugging Face do support non-English languages (Russian, Dutch), though they're not compatible with Transformers.js at the moment.
  • Nuance: While toxic-bert effectively detects overt toxicity, it may struggle with more subtle or context-dependent cases (irony, sarcasm). Toxicity can be highly subjective and subtle. For example, you may want certain terms or even emoji to be classified as toxic. Fine-tuning can help improve accuracy in these areas.

We have an upcoming article on fine-tuning a toxicity model.

Alternatives

Conclusion

Client-side toxicity detection is a powerful tool to enhance online communities.

By leveraging AI models like toxic-bert that run in the browser with Transformers.js, you can implement real-time feedback mechanisms that discourage toxic behavior and reduce the toxicity classification load on your servers.

This client-side approach already works across browsers. However, keep in mind the limitations, especially in terms of model serving costs and download size. Apply performance best practices for client-side AI and cache the model.

For comprehensive toxicity detection, combine client-side and server-side approaches.