As a Data Scientist, I have never had the opportunity to properly explore the latest progress in Natural Language Processing. With the summer and the new boom of Large Language Models since the beginning of the year, I decided it was time to dive deep into the field and embark on some mini-projects. After all, there is never a better way to learn than by practicing.
As my journey started, I realized it was complicated to find content that takes the reader by the hand and goes, one step at a time, towards a deep comprehension of new NLP models with concrete projects. This is how I decided to start this new series of articles.
Building a Comment Toxicity Ranker Using HuggingFace’s Transformer Models
In this first article, we are going to take a deep dive into building a comment toxicity ranker. This project is inspired by the “Jigsaw Rate Severity of Toxic Comments” competition which took place on Kaggle last year.
The objective of the competition was to build a model with the capacity to determine which comment (out of two comments given as input) is the most toxic.
To do so, the model will attribute to every comment passed as input a score, which determines its relative toxicity.
What this article will cover
In this article, we are going to train our first NLP Classifier using Pytorch and Hugging Face transformers. I will not go into the details of how works transformers, but more into practical details and implementations and initiate some concepts that will be useful for the next articles of the series.
In particular, we will see:
- How to download a model from Hugging Face Hub
- How to customize and use an Encoder
- Build and train a Pytorch ranker from one of the Hugging Face models
This article is directly addressed to data scientists that would like to step their game in NLP from a practical point of view. I will not do much…