5 Tips for Using Regular Expressions in Data Cleaning

Author's image | Created on Canva

If you are a Linux or Mac user, you have probably used grep On the command line, files can be searched by searching for patterns. Regular expressions (regex) allow you to search, find, and manipulate text based on patterns, making them powerful tools for text processing and data cleansing.

For regular expression matching operations in Python, you can use the built-in function re moduleIn this tutorial, we'll look at how regular expressions can be used to clean data. We'll look at how to remove unwanted characters, extract specific patterns, find and replace text, and more.

1. Remove unwanted characters

Before continuing, let's import the re-incorporated module:

String fields (almost) always require extensive cleaning before they can be parsed. Unwanted characters (often caused by variable formats) can make data analysis difficult. Regular expressions can help you remove them efficiently.

You can use the sub() re module function to replace or remove all occurrences of a pattern or special character. Suppose you have strings with phone numbers that include dashes and parentheses. You can remove them as shown below:

text = "Contact info: (123)-456-7890 and 987-654-3210."
cleaned_text = re.sub(r'(()-)', '', text)
print(cleaned_text)

Here, re.sub(pattern, replacement, string) replaces all occurrences of the pattern in the string with the replacement. We use the (r'(()-)' pattern to match any occurrence of (, ), or – giving us the result:

Output >>> Contact info: 1234567890 or 9876543210

2. Extract specific patterns

Extracting email addresses, URLs, or phone numbers from text fields is a common task, as these are relevant information. And to extract all the specific patterns of interest, you can use the findall() function.

You can extract email addresses from text as follows:

text = "Please reach out to us at [email protected] or [email protected]."
emails = re.findall(r'\b(\w.-)+?@\w+?\.\w+?\b', text)
print(emails)

He re.findall(pattern, string) The function searches for and returns (as a list) all occurrences of the pattern in the string. We use the pattern r'(-)+?@+?\.+?\b' To match all email addresses:

Output >>> ('[email protected]', '[email protected]')

3. Replace patterns

We have already used the sub() Function to remove unwanted special characters. But you can replace one pattern with another to make the field suitable for more consistent analysis.

Below is an example of removing unwanted spaces:

text = "Using     regular     expressions."
cleaned_text = re.sub(r'\s+', ' ', text)
print(cleaned_text)

He r'\s+' The pattern matches one or more whitespace characters. The replacement string is a single space which gives us the result:

Output >>> Using regular expressions.

4. Validate data formats

Data format validation ensures data consistency and accuracy. Regex can validate formats such as emails, phone numbers, and dates.

Here we explain how you can use the match() Function to validate email addresses:

email = "[email protected]"
if re.match(r'^\b(\w.-)+?@\w+?\.\w+?\b$', email):
    print("Valid email")  
else:
    print("Invalid email")

In this example, the email string is valid:

5. Split strings by patterns

Sometimes you may want to split a string into multiple strings based on patterns or the occurrence of specific separators. You can use the split() function to do that.

Let's divide the text chain in sentences:

text = "This is sentence one. And this is sentence two! Is this sentence three?"
sentences = re.split(r'(.!?)', text)
print(sentences)

Here, re.split(pattern, string) splits the string on all occurrences of the pattern. We use the because?! Pattern for matching periods, exclamation marks, or question marks:

Output >>> ('This is sentence one', ' And this is sentence two', ' Is this sentence three', '')

How to Clean Pandas Dataframes with Regular Expressions

Combining regular expressions with pandas allows you to clean data frames efficiently.

To remove non-alphabetic characters from names and validate email addresses in a data frame:

import pandas as pd

data = {
	'names': ('Alice123', 'Bob!@#', 'Charlie$$$'),
	'emails': ('[email protected]', 'bob_at_example.com', '[email protected]')
}
df = pd.DataFrame(data)

# Remove non-alphabetic characters from names
df('names') = df('names').str.replace(r'(^a-zA-Z)', '', regex=True)

# Validate email addresses
df('valid_email') = df('emails').apply(lambda x: bool(re.match(r'^\b(\w.-)+?@\w+?\.\w+?\b$', x)))

print(df)

In the code snippet above:

df('names').str.replace(pattern, replacement, regex=True) replaces occurrences of the pattern in the series.
lambda x: bool(re.match(pattern, x)):This lambda function applies regular expression matching and converts the result to a boolean value.

The result is as shown:

 	  names           	   emails    valid_email
0	  Alice	        [email protected]     	    True
1  	  Bob          bob_at_example.com    	    False
2         Charlie     [email protected]     	    True

Ending up

I hope you found this tutorial useful. Let's review what we've learned:

Wear re.sub to remove unnecessary characters such as dashes and parentheses in phone numbers and the like.
Wear re.findall to extract specific patterns from text.
Wear re.sub to replace patterns, such as converting multiple spaces into a single space.
Validate data formats with re.match to ensure that data adheres to specific formats, such as validating email addresses.
To split strings based on patterns, apply re.split.

In practice, you'll combine regular expressions with pandas to achieve efficient cleaning of text fields in data frames. It's also good practice to comment out regular expressions to explain their purpose, which improves readability and maintainability. To learn more about data cleaning with pandas, read 7 Steps to Master Data Cleaning with Python and Pandas.

twitter.com/balawc27″ rel=”noopener”>Bala Priya C. Bala is a technical developer and writer from India. She enjoys working at the intersection of mathematics, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, programming, and drinking coffee! Currently, she is working on learning and sharing her knowledge with the developer community by creating tutorials, how-to guides, opinion pieces, and more. Bala also creates interesting resource overviews and coding tutorials.

5 Tips for Using Regular Expressions in Data Cleaning

Technical Terrence Team

Intel CEO to present plans to cut assets and costs to board, source says By Reuters

Leave a Reply Cancel reply

Recommended.

Google Pixel Buds A-Series drops to $69

Forget Tesla! Here are my top stocks to buy in October

Free Technology for Teachers: Online Spelling Games

Educator Edtech Review: Makeblock mBot Neo and Ultimate Robotics & Coding Kits

Amazon adds a free $300 gift card when you pre-order the Samsung Galaxy Z Fold 6

Categories

Important Links

5 Tips for Using Regular Expressions in Data Cleaning

1. Remove unwanted characters

2. Extract specific patterns

3. Replace patterns

4. Validate data formats

5. Split strings by patterns

How to Clean Pandas Dataframes with Regular Expressions

Ending up

Related

Technical Terrence Team

Intel CEO to present plans to cut assets and costs to board, source says By Reuters

Leave a Reply Cancel reply

Recommended.

Google Pixel Buds A-Series drops to $69

Forget Tesla! Here are my top stocks to buy in October

Free Technology for Teachers: Online Spelling Games

Educator Edtech Review: Makeblock mBot Neo and Ultimate Robotics & Coding Kits

Amazon adds a free $300 gift card when you pre-order the Samsung Galaxy Z Fold 6

Categories

Important Links

Get daily news updates to your inbox!