How Tokopedia reduced operational costs by improving their seller web app using Machine Learning

Dendi Sunardi
Dendi Sunardi
Geoffrey Prasetyo
Geoffrey Prasetyo
Swetha Gopalakrishnan
Swetha Gopalakrishnan

Tokopedia is an Indonesian technology company with the one of the biggest ecommerce marketplace, hosting over 40 digital products and more than 14 million registered sellers on its platform.

Mitra Tokopedia, part of Tokopedia's business verticals, is a web application which helps small business owners to sell digital products such as credit and game vouchers, data packages, electricity tokens, national healthcare bills, and others. The website is one of the primary channels for Mitra Tokopedia sellers in more than 700 cities, making it critical to ensure a smooth user experience.

A key step in the onboarding process requires these sellers to verify their identity. The seller has to upload their National ID as well as a selfie with the ID in order to complete the seller verification. This is referred to as the Know-Your-Customer (KYC) process.

By adding Machine Learning capabilities to this critical KYC process within their web app, Mitra Tokopedia was able to achieve a better user experience with more than 20% reduction in verification failures. They also made operational cost savings by reducing manual approvals by nearly 70%.

Challenge

Most of the KYC data was being rejected, creating thousands of tickets per week to the operations team for manual verification. This not only caused high operational cost but also resulted in a bad user experience for sellers, whose verification process gets delayed. The biggest reason for the rejection was that sellers did not upload selfies with ID cards correctly. Mitra Tokopedia were keen to solve this problem scalably using modern web capabilities.

Solution

The team at Tokopedia decided to use ML with TensorFlow.js to solve this problem at the very first step of the KYC process—when the user uploads the images. They used MediaPipe and TensorFlow's Face Detection library to detect the seller's face with six key points when the seller uploads the ID card and selfie images. The model's output is then used to check against their acceptance criteria. Upon successful verification, the information is sent to the backend. If verification fails, the seller is provided with an error message and an option to retry. A hybrid approach was used where the model performs the inference either on-device or server side depending on the phone's specifications. A lower end device would perform the inference on the server.

Using an ML model early in the KYC process allows them to:

  • Improve the rejection rate in the KYC process.
  • Warn users of possible rejection of their images, based on the quality assessed by the model.

Why choose ML as opposed to other solutions?

ML can automate repetitive tasks that are otherwise time-consuming or impractical to do manually. In Tokopedia's case, optimizing the current non-ML solution couldn't yield significant results whereas an ML solution could significantly reduce the load on the operations team who had to manually process thousands of approvals weekly. With an ML solution, the image checks can be done near instantly, providing a better user experience and improving operational efficiency. Read more about problem framing to determine whether ML is a suitable solution for your problem.

Considerations when choosing a model

The following factors were considered when choosing the ML model.

Cost

They evaluated the overall cost of using the model. Since TensorFlow.js is an open source package that is well maintained by Google, we save on licensing and maintenance costs. An additional consideration is the cost of inference. Being able to run inference on the client side saves a lot of money compared to processing it on the server side with expensive GPUs, especially if the data turns out to be invalid and unusable.

Scalability

They considered the scalability of the model and technology. Is it able to handle the growth in data and model complexity as our project evolves? Can it be extended to cater other projects or use cases? On-device processing helps because the model could be hosted on a CDN and delivered to the client side, which is very scalable.

Performance

They considered the size of the library (in KB) and the latency of the runtime process. The majority of Mitra Tokopedia's user base has mid to low end devices with moderate internet speed and connectivity. Thus, performance in terms of download and runtime (ie. how fast the model can produce an output) is a top priority to cater to their specific needs and ensure great user experience.

Other considerations

Regulatory compliance: They had to ensure that the library chosen complied with relevant data protection and privacy regulations.

Skillset: They evaluated the expertise and skill set of their team. Some ML frameworks and libraries may require specific programming languages or expertise in a particular area. By considering these factors, they made an informed decision when choosing the right model for their machine learning project.

Technology chosen

TensorFlow.js met their needs, after considering these factors. It is able to run fully on-device using its WebGL backend to use the GPU of the device. Running a model on-device enables faster feedback to the user due to reduced server latency and reduces server compute cost. Read more about on-device ML in the article Advantages and limitations of on-device ML.

"TensorFlow.js is an open source machine learning library from Google aimed at JavaScript developers that's able to run client side in the browser. It's the most mature option for Web AI with comprehensive WebGL, WebAssembly and WebGPU backend operator support that can be used within the browser with fast performance."How Adobe used Web ML with TensorFlow.js to enhance Photoshop for web

Technical implementation

Mitra Tokopedia used MediaPipe and TensorFlow's Face Detection library, a package that provides models for running real-time face detection. Specifically, the MediaPipeFaceDetector-TFJS model provided in this library, which implements the tfjs runtime was used for this solution.

Before diving into the implementation, brief recap of what MediaPipe is. MediaPipe lets you build and deploy on-device ML solutions across mobile (Android, iOS), web, desktop, edge devices, and IoT.

There are 14 different solutions offered by MediaPipe at the time of writing this post. You can use either a mediapipe or tfjs runtime. The tfjs runtime is built with JavaScript and provides a JavaScript package that can be downloaded externally by the web application. This is different from a mediapipe runtime, which is built with C++ and compiled to a WebAssembly module. The key differences are performance, debuggability, and bundling. The JavaScript package can be bundled with classic bundlers like webpack. In contrast, the Wasm module is a bigger and separate binary resource (which is mitigated by not being a load-time dependency) and requires a different Wasm debugging workflow. However, it does execute faster to help meet technical and performance requirements.

Diagram of how MediaPipe and TensorFlow models work for the different runtimes, using FaceDetection as an example.
A general illustration on how MediaPipe and TensorFlow models work for the different runtimes, using FaceDetection as an example

Coming back to Tokopedia's implementation, the first step is to initialize the model as follows. When the user uploads a photo, an HTMLImageElementis passed as an input to the detector.

// Create the detector.
const model = faceDetection.SupportedModels.MediaPipeFaceDetector;
const detectorConfig = {
  runtime: 'tfjs'
};

const detector = await faceDetection.createDetector(model, detectorConfig);

// Run inference to start detecting faces.
const estimationConfig = {flipHorizontal: false};
const faces = await detector.estimateFaces(image, estimationConfig);

The result of the face list contains detected faces for each face in the image. If the model cannot detect any faces, the list will be empty. For each face, it contains a bounding box of the detected face, as well as an array of six keypoints for the detected face. This includes features such as eyes, nose, and mouth. Each keypoint contains x and y, as well as a name.

[
  {
    box: {
      xMin: 304.6476503248806,
      xMax: 502.5079975897382,
      yMin: 102.16298762367356,
      yMax: 349.035215984403,
      width: 197.86034726485758,
      height: 246.87222836072945
    },
    keypoints: [
      {x: 446.544237446397, y: 256.8054528661723, name: "rightEye"},
      {x: 406.53152857172876, y: 255.8, "leftEye },
      ...
    ],
  }
]

The box represents the bounding box of the face in the image pixel space, with xMin, xMax denoting the x-bounds, yMin, yMax denoting the y-bounds, and width, height are the dimensions of the bounding box. For the keypoints, x and y represent the actual keypoint position in the image pixel space. The name provides a label for the keypoint, which are 'rightEye', 'leftEye', 'noseTip', 'mouthCenter', 'rightEarTragion', and 'leftEarTragion' respectively. As mentioned at the beginning of this post, the seller has to upload their National ID and a selfie with the ID in order to complete the seller verification. The model's output is then used to check against the acceptance criteria—that is to have a match of the six keypoints mentioned previously to be deemed as a valid Identification Card and selfie image.

Upon successful verification, the relevant seller information is passed to the backend. If the verification fails, the seller is given a failure message and an option to re-try. No information will be sent to the backend.

Diagram of Mitra KYC page, TensorFlow.js model, and server interacting with each other.
How the Mitra KYC page, TensorFlow.js model, and server interact with each other

Performance considerations for low end devices

This package is only 24.8 KB (minified and gzipped), which does not significantly impact the download time. However, for very low end devices, runtime processing takes a long time. Additional logic was added to check for Device RAM and CPU before passing the two images to the machine learning face detection model.

If the device has more than 4GB of RAM, a network connection greater than 4G, and a CPU with more than 6 cores, the images are passed to the on-device model for face verification. If these requirements are not met, then the on-device model is skipped and the images are sent directly to the server for verification using a hybrid approach to cater for these older devices. With time, more devices will be able to offload compute from the server as hardware continues to evolve.

Impact

Due to the ML integration, Tokopedia was successfully able to solve the high rejection rate and saw the following results:

  • Rejection rate decreased by more than 20%.
  • Number of manual approvals decreased by almost 70%.

This not only created a smoother user experience for sellers, but also reduced the operational cost for the Tokopedia team.

Conclusion

Overall, the results of this case study showed that for the right use cases, on-device ML solutions on the web can be valuable in improving the user experience and effectiveness of features as well as creating cost savings and other business benefits.

Try out the MediaPipe Face Detection feature yourself using the MediaPipe Studio and the code sample for MediaPipe Face Detector for web.

If you're interested in extending the capabilities of your own web app with on-device ML, check out the following resources: