Cleaning and preprocessing text data in Pandas for NLP tasks

Image by the author

Data cleaning and preprocessing is often one of the most challenging, yet critical, phases in building data-driven ai and machine learning solutions, and text data is no exception.

Our Top 5 Free Course Recommendations

1. Google Cybersecurity Certificate: Get Fast Track to a Career in Cybersecurity

2. Natural Language Processing in TensorFlow: Building NLP Systems

3. Python for Everyone: Develop Programs to Collect, Clean, Analyze, and Visualize Data

4. Google IT Support Professional Certification

5. AWS Cloud Solutions Architect – Professional Certification

This tutorial breaks the ice on tackling the challenge of preparing text data for natural language processing tasks, such as those that language models (LMs) can solve. By encapsulating your text data into pandas DataFrames, the steps below will help you prepare your text to be processed by natural language processing models and algorithms.

Loading data into a Pandas DataFrame

To simplify this tutorial and focus on understanding the necessary text cleaning and preprocessing steps, let’s consider a small sample of four single-attribute text data instances that will be moved into a pandas DataFrame instance. From now on, we will apply all the preprocessing steps on this DataFrame object.

import pandas as pd
data = {'text': ("I love cooking!", "Baking is fun", None, "Japanese cuisine is great!")}
df = pd.DataFrame(data)
print(df)

Production:

    text
0   I love cooking!
1   Baking is fun
2   None
3   Japanese cuisine is great!

Handling missing values

Did you notice the value “None” in one of the example data instances? This is known as a missing value. Missing values are commonly collected for a variety of reasons, often accidental. In short: you need to manage them. The simplest approach is to simply detect and remove instances that contain missing values, as done in the following code:

df.dropna(subset=('text'), inplace=True)
print(df)

Production:

    text
0    I love cooking!
1    Baking is fun
3    Japanese cuisine is great!

Normalize the text to make it consistent

Normalizing text involves standardizing or unifying elements that may appear in different formats in different instances, for example, date formats, full names, or case sensitivity. The simplest method to normalize our text is to convert it all to lowercase, as follows.

df('text') = df('text').str.lower()
print(df)

Production:

        text
0             i love cooking!
1               baking is fun
3  japanese cuisine is great!

Remove noise

Noise is unnecessary or unexpectedly collected data that can hamper downstream modeling or prediction processes if not handled properly. In our example, we will assume that punctuation marks like “!” are not required for the downstream NLP task to be applied, therefore, we will apply some noise removal by detecting the punctuation marks in the text using a regular expression. The Python package ‘re’ is used to work and perform text operations based on regular expression matching.

import re
df('text') = df('text').apply(lambda x: re.sub(r'(^\w\s)', '', x))
print(df)

Production:

         text
0             i love cooking
1              baking is fun
3  japanese cuisine is great

Tokenize the text

Tokenization is arguably the most important text preprocessing step (along with encoding the text into a numerical representation) before using natural language processing and language models. It consists of splitting each text input into a vector of chunks or tokens. In the simplest scenario, tokens are associated with words most of the time, but in some cases, such as compound words, one word can give rise to multiple tokens. Certain punctuation marks (if not previously removed as noise) are also sometimes identified as independent tokens.

This code splits each of our three text inputs into individual words (tokens) and adds them as a new column in our DataFrame, then displays the updated data structure with its two columns. The simplified tokenization approach applied is known as simple whitespace tokenization: it simply uses whitespace as a criterion to detect and separate tokens.

df('tokens') = df('text').str.split()
print(df)

Production:

          text                          tokens
0             i love cooking              (i, love, cooking)
1              baking is fun               (baking, is, fun)
3  japanese cuisine is great  (japanese, cuisine, is, great)

Remove stop words

Once the text is tokenized, we filter out unnecessary tokens. This is often the case for stop words, such as articles “a/an, the” or conjunctions, which add no real semantics to the text and should be removed for efficient further processing. This process is language-dependent: the code below uses the NLTK library to download a dictionary of English stop words and filter them out of the token vectors.

import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df('tokens') = df('tokens').apply(lambda x: (word for word in x if word not in stop_words))
print(df('tokens'))

Production:

0               (love, cooking)
1                 (baking, fun)
3    (japanese, cuisine, great)

Stemming and lemmatization

Almost done! Stemming and lemmatization are additional text preprocessing steps that may be used at times depending on the specific task at hand. Stemming reduces each token (word) to its base or root form, while lemmatization further reduces it to its lemma or dictionary base form depending on the context, e.g. “best” -> “good”. For simplicity, we will only apply lemmatization in this example, using the PorterStemmer implemented in the NLTK library, with the help of the Wordnet dataset of word-root associations. The resulting stemmed words are saved in a new column in the DataFrame.

from nltk.stem import PorterStemmer
nltk.download('wordnet')
stemmer = PorterStemmer()
df('stemmed') = df('tokens').apply(lambda x: (stemmer.stem(word) for word in x))
print(df(('tokens','stemmed')))

Production:

          tokens                   stemmed
0             (love, cooking)              (love, cook)
1               (baking, fun)               (bake, fun)
3  (japanese, cuisine, great)  (japanes, cuisin, great)

Convert your text into numerical representations

Last but not least, computer algorithms, including ai and ML models, do not understand human language, but numbers, so we need to convert our word vectors into numerical representations, commonly known as embedding vectors or simply embedding. The following example converts the tokenized text into the “tokens” column and uses a TF-IDF vectorization approach (one of the most popular approaches in the good old days of classical NLP) to transform the text into numerical representations.

from sklearn.feature_extraction.text import TfidfVectorizer
df('text') = df('tokens').apply(lambda x: ' '.join(x))
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(df('text'))
print(x.toarray())

Production:

((0.         0.70710678 0.         0.         0.         0.       0.70710678)
(0.70710678 0.         0.         0.70710678 0.         0.        0.        )
(0.         0.         0.57735027 0.         0.57735027 0.57735027        0.        ))

And that’s it! As unintelligible as it may seem to us, this numerical representation of our preprocessed text is what intelligent systems, including natural language processing models, understand and can handle exceptionally well for complex linguistic tasks like classifying sentiments in a text, summarizing it, or even translating it into another language.

The next step would be to feed these numerical representations into our NLP model to allow it to do its magic.

Ivan Palomares Carrascosa is a leader, writer, speaker, and advisor on ai, machine learning, deep learning, and law. He trains and guides others to leverage ai in the real world.

Cleaning and preprocessing text data in Pandas for NLP tasks

Technical Terrence Team

Enerflex Receives Notice of Contract Termination from Customer for Kurdistan Gas Project (NYSE:EFXT)

Leave a Reply Cancel reply

Recommended.

Johns Hopkins Researchers Developed a Deep-Learning Technology Capable of Accurately Predicting Protein Fragments Linked to Cancer

Stablecoin inflow to Ethereum L2 surges 5% to over $2 billion

Should I add the VUSA ETF to my Stocks and Shares ISA?

Analyst expects bullish trend for: Bitcoin, Ethereum and Cardano

Goal said to explore the movement where it is incorporated

Categories

Important Links

Cleaning and preprocessing text data in Pandas for NLP tasks

Our Top 5 Free Course Recommendations

Loading data into a Pandas DataFrame

Handling missing values

Normalize the text to make it consistent

Remove noise

Tokenize the text

Remove stop words

Stemming and lemmatization

Convert your text into numerical representations

Related

Technical Terrence Team

Enerflex Receives Notice of Contract Termination from Customer for Kurdistan Gas Project (NYSE:EFXT)

Leave a Reply Cancel reply

Recommended.

Johns Hopkins Researchers Developed a Deep-Learning Technology Capable of Accurately Predicting Protein Fragments Linked to Cancer

Stablecoin inflow to Ethereum L2 surges 5% to over $2 billion

Should I add the VUSA ETF to my Stocks and Shares ISA?

Analyst expects bullish trend for: Bitcoin, Ethereum and Cardano

Goal said to explore the movement where it is incorporated

Categories

Important Links

Get daily news updates to your inbox!