Author's image | Created on Canva
If you are a Linux or Mac user, you have probably used grep On the command line, files can be searched by searching for patterns. Regular expressions (regex) allow you to search, find, and manipulate text based on patterns, making them powerful tools for text processing and data cleansing.
For regular expression matching operations in Python, you can use the built-in function re moduleIn this tutorial, we'll look at how regular expressions can be used to clean data. We'll look at how to remove unwanted characters, extract specific patterns, find and replace text, and more.
1. Remove unwanted characters
Before continuing, let's import the re-incorporated module:
String fields (almost) always require extensive cleaning before they can be parsed. Unwanted characters (often caused by variable formats) can make data analysis difficult. Regular expressions can help you remove them efficiently.
You can use the sub()
re module function to replace or remove all occurrences of a pattern or special character. Suppose you have strings with phone numbers that include dashes and parentheses. You can remove them as shown below:
text = "Contact info: (123)-456-7890 and 987-654-3210."
cleaned_text = re.sub(r'(()-)', '', text)
print(cleaned_text)
Here, re.sub(pattern, replacement, string) replaces all occurrences of the pattern in the string with the replacement. We use the (r'(()-)' pattern to match any occurrence of (, ), or – giving us the result:
Output >>> Contact info: 1234567890 or 9876543210
2. Extract specific patterns
Extracting email addresses, URLs, or phone numbers from text fields is a common task, as these are relevant information. And to extract all the specific patterns of interest, you can use the findall()
function.
You can extract email addresses from text as follows:
text = "Please reach out to us at [email protected] or [email protected]."
emails = re.findall(r'\b(\w.-)+?@\w+?\.\w+?\b', text)
print(emails)
He re.findall(pattern, string) The function searches for and returns (as a list) all occurrences of the pattern in the string. We use the pattern r'(-)+?@+?\.+?\b' To match all email addresses:
Output >>> ('[email protected]', '[email protected]')
3. Replace patterns
We have already used the sub()
Function to remove unwanted special characters. But you can replace one pattern with another to make the field suitable for more consistent analysis.
Below is an example of removing unwanted spaces:
text = "Using regular expressions."
cleaned_text = re.sub(r'\s+', ' ', text)
print(cleaned_text)
He r'\s+' The pattern matches one or more whitespace characters. The replacement string is a single space which gives us the result:
Output >>> Using regular expressions.
4. Validate data formats
Data format validation ensures data consistency and accuracy. Regex can validate formats such as emails, phone numbers, and dates.
Here we explain how you can use the match()
Function to validate email addresses:
email = "[email protected]"
if re.match(r'^\b(\w.-)+?@\w+?\.\w+?\b$', email):
print("Valid email")
else:
print("Invalid email")
In this example, the email string is valid:
5. Split strings by patterns
Sometimes you may want to split a string into multiple strings based on patterns or the occurrence of specific separators. You can use the split()
function to do that.
Let's divide the text
chain in sentences:
text = "This is sentence one. And this is sentence two! Is this sentence three?"
sentences = re.split(r'(.!?)', text)
print(sentences)
Here, re.split(pattern, string) splits the string on all occurrences of the pattern. We use the because?! Pattern for matching periods, exclamation marks, or question marks:
Output >>> ('This is sentence one', ' And this is sentence two', ' Is this sentence three', '')
How to Clean Pandas Dataframes with Regular Expressions
Combining regular expressions with pandas allows you to clean data frames efficiently.
To remove non-alphabetic characters from names and validate email addresses in a data frame:
import pandas as pd
data = {
'names': ('Alice123', 'Bob!@#', 'Charlie$$$'),
'emails': ('[email protected]', 'bob_at_example.com', '[email protected]')
}
df = pd.DataFrame(data)
# Remove non-alphabetic characters from names
df('names') = df('names').str.replace(r'(^a-zA-Z)', '', regex=True)
# Validate email addresses
df('valid_email') = df('emails').apply(lambda x: bool(re.match(r'^\b(\w.-)+?@\w+?\.\w+?\b$', x)))
print(df)
In the code snippet above:
df('names').str.replace(pattern, replacement, regex=True)
replaces occurrences of the pattern in the series.lambda x: bool(re.match(pattern, x))
:This lambda function applies regular expression matching and converts the result to a boolean value.
The result is as shown:
names emails valid_email
0 Alice [email protected] True
1 Bob bob_at_example.com False
2 Charlie [email protected] True
Ending up
I hope you found this tutorial useful. Let's review what we've learned:
- Wear
re.sub
to remove unnecessary characters such as dashes and parentheses in phone numbers and the like. - Wear
re.findall
to extract specific patterns from text. - Wear
re.sub
to replace patterns, such as converting multiple spaces into a single space. - Validate data formats with
re.match
to ensure that data adheres to specific formats, such as validating email addresses. - To split strings based on patterns, apply
re.split
.
In practice, you'll combine regular expressions with pandas to achieve efficient cleaning of text fields in data frames. It's also good practice to comment out regular expressions to explain their purpose, which improves readability and maintainability. To learn more about data cleaning with pandas, read 7 Steps to Master Data Cleaning with Python and Pandas.
twitter.com/balawc27″ rel=”noopener”>Bala Priya C. Bala is a technical developer and writer from India. She enjoys working at the intersection of mathematics, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, programming, and drinking coffee! Currently, she is working on learning and sharing her knowledge with the developer community by creating tutorials, how-to guides, opinion pieces, and more. Bala also creates interesting resource overviews and coding tutorials.
<script async src="//platform.twitter.com/widgets.js” charset=”utf-8″>