
Image by the author
Data cleaning and preprocessing is often one of the most challenging, yet critical, phases in building data-driven ai and machine learning solutions, and text data is no exception.
Our Top 5 Free Course Recommendations
1. Google Cybersecurity Certificate: Get Fast Track to a Career in Cybersecurity
2. Natural Language Processing in TensorFlow: Building NLP Systems
3. Python for Everyone: Develop Programs to Collect, Clean, Analyze, and Visualize Data
4. Google IT Support Professional Certification
5. AWS Cloud Solutions Architect – Professional Certification
This tutorial breaks the ice on tackling the challenge of preparing text data for natural language processing tasks, such as those that language models (LMs) can solve. By encapsulating your text data into pandas DataFrames, the steps below will help you prepare your text to be processed by natural language processing models and algorithms.
Loading data into a Pandas DataFrame
To simplify this tutorial and focus on understanding the necessary text cleaning and preprocessing steps, let’s consider a small sample of four single-attribute text data instances that will be moved into a pandas DataFrame instance. From now on, we will apply all the preprocessing steps on this DataFrame object.
import pandas as pd
data = {'text': ("I love cooking!", "Baking is fun", None, "Japanese cuisine is great!")}
df = pd.DataFrame(data)
print(df)
Production:
text
0 I love cooking!
1 Baking is fun
2 None
3 Japanese cuisine is great!
Handling missing values
Did you notice the value “None” in one of the example data instances? This is known as a missing value. Missing values are commonly collected for a variety of reasons, often accidental. In short: you need to manage them. The simplest approach is to simply detect and remove instances that contain missing values, as done in the following code:
df.dropna(subset=('text'), inplace=True)
print(df)
Production:
text
0 I love cooking!
1 Baking is fun
3 Japanese cuisine is great!
Normalize the text to make it consistent
Normalizing text involves standardizing or unifying elements that may appear in different formats in different instances, for example, date formats, full names, or case sensitivity. The simplest method to normalize our text is to convert it all to lowercase, as follows.
df('text') = df('text').str.lower()
print(df)
Production:
text
0 i love cooking!
1 baking is fun
3 japanese cuisine is great!
Remove noise
Noise is unnecessary or unexpectedly collected data that can hamper downstream modeling or prediction processes if not handled properly. In our example, we will assume that punctuation marks like “!” are not required for the downstream NLP task to be applied, therefore, we will apply some noise removal by detecting the punctuation marks in the text using a regular expression. The Python package ‘re’ is used to work and perform text operations based on regular expression matching.
Production:
text
0 i love cooking
1 baking is fun
3 japanese cuisine is great
Tokenize the text
Tokenization is arguably the most important text preprocessing step (along with encoding the text into a numerical representation) before using natural language processing and language models. It consists of splitting each text input into a vector of chunks or tokens. In the simplest scenario, tokens are associated with words most of the time, but in some cases, such as compound words, one word can give rise to multiple tokens. Certain punctuation marks (if not previously removed as noise) are also sometimes identified as independent tokens.
This code splits each of our three text inputs into individual words (tokens) and adds them as a new column in our DataFrame, then displays the updated data structure with its two columns. The simplified tokenization approach applied is known as simple whitespace tokenization: it simply uses whitespace as a criterion to detect and separate tokens.
df('tokens') = df('text').str.split()
print(df)
Production:
text tokens
0 i love cooking (i, love, cooking)
1 baking is fun (baking, is, fun)
3 japanese cuisine is great (japanese, cuisine, is, great)
Remove stop words
Once the text is tokenized, we filter out unnecessary tokens. This is often the case for stop words, such as articles “a/an, the” or conjunctions, which add no real semantics to the text and should be removed for efficient further processing. This process is language-dependent: the code below uses the NLTK library to download a dictionary of English stop words and filter them out of the token vectors.
import nltk
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df('tokens') = df('tokens').apply(lambda x: (word for word in x if word not in stop_words))
print(df('tokens'))
Production:
0 (love, cooking)
1 (baking, fun)
3 (japanese, cuisine, great)
Stemming and lemmatization
Almost done! Stemming and lemmatization are additional text preprocessing steps that may be used at times depending on the specific task at hand. Stemming reduces each token (word) to its base or root form, while lemmatization further reduces it to its lemma or dictionary base form depending on the context, e.g. “best” -> “good”. For simplicity, we will only apply lemmatization in this example, using the PorterStemmer implemented in the NLTK library, with the help of the Wordnet dataset of word-root associations. The resulting stemmed words are saved in a new column in the DataFrame.
from nltk.stem import PorterStemmer
nltk.download('wordnet')
stemmer = PorterStemmer()
df('stemmed') = df('tokens').apply(lambda x: (stemmer.stem(word) for word in x))
print(df(('tokens','stemmed')))
Production:
tokens stemmed
0 (love, cooking) (love, cook)
1 (baking, fun) (bake, fun)
3 (japanese, cuisine, great) (japanes, cuisin, great)
Convert your text into numerical representations
Last but not least, computer algorithms, including ai and ML models, do not understand human language, but numbers, so we need to convert our word vectors into numerical representations, commonly known as embedding vectors or simply embedding. The following example converts the tokenized text into the “tokens” column and uses a TF-IDF vectorization approach (one of the most popular approaches in the good old days of classical NLP) to transform the text into numerical representations.
from sklearn.feature_extraction.text import TfidfVectorizer
df('text') = df('tokens').apply(lambda x: ' '.join(x))
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(df('text'))
print(x.toarray())
Production:
((0. 0.70710678 0. 0. 0. 0. 0.70710678)
(0.70710678 0. 0. 0.70710678 0. 0. 0. )
(0. 0. 0.57735027 0. 0.57735027 0.57735027 0. ))
And that’s it! As unintelligible as it may seem to us, this numerical representation of our preprocessed text is what intelligent systems, including natural language processing models, understand and can handle exceptionally well for complex linguistic tasks like classifying sentiments in a text, summarizing it, or even translating it into another language.
The next step would be to feed these numerical representations into our NLP model to allow it to do its magic.
Ivan Palomares Carrascosa is a leader, writer, speaker, and advisor on ai, machine learning, deep learning, and law. He trains and guides others to leverage ai in the real world.