How to speed up Pandas code: vectorization
If we want our deep learning models to train on a dataset, we need to optimize our code to analyze that data quickly. We want to read our data tables as fast as possible using an optimized way of writing our code. Even the smallest performance gain exponentially improves performance on tens of thousands of data points. In this blog, we will define Pandas and provide an example of how you can vectorize your Python code to optimize dataset analysis using Pandas to speed up your code over 300x faster.
What is Pandas for Python?
Pandas Pandas is an essential and popular open-source data analysis and manipulation library for the Python programming language. Pandas is widely used in various fields such as finance, economics, social sciences, and engineering. It is useful for data cleaning, preparation, and analysis in data science and machine learning tasks.
It provides powerful data structures (such as DataFrame and Series) and data manipulation tools for working with structured data, including reading and writing data in various formats (e.g. CSV, Excel, JSON) and filtering, cleaning, and transforming data. Additionally, it supports time series data and provides powerful data aggregation and visualization capabilities by integrating with other popular libraries such as NumPy and Matplotlib.
Our dataset and problem
The data
In this example, we are going to create a random data set in a Jupyter Notebook Using NumPy to populate our Pandas data frame with arbitrary values and strings. In this dataset, we name 10,000 people of varying ages, the amount of time they work, and the percentage of time they are productive at work. They will also be assigned a random favorite gift, as well as a random bad karma event.
Let's first import our frameworks and generate some random code before we begin:
import pandas as pd
import numpy as np
Next, we will create our dataset by creating some random data. Now, most likely your code will rely on real data, but for our use case, we will create some arbitrary data.
def get_data(size = 10_000):
df = pd.DataFrame()
df('age') = np.random.randint(0, 100, size)
df('time_at_work') = np.random.randint(0,8,size)
df('percentage_productive') = np.random.rand(size)
df('favorite_treat') = np.random.choice(('ice_cream', 'boba', 'cookie'), size)
df('bad_karma') = np.random.choice(('stub_toe', 'wifi_malfunction', 'extra_traffic'))
return df
The parameters and rules
- If a person's 'time_at_work' is at least 2 hours AND their 'productive_percentage' is more than 50%, we return with 'favorite_gift'.
- Otherwise, we give them 'bad karma'.
- If they are over 65, we come back with a 'favorite_treat' as we want our seniors to be happy.
def reward_calc(row):
if row('age') >= 65:
return row ('favorite_treat')
if (row('time_at_work') >= 2) & (row('percentage_productive') >= 0.5):
return row ('favorite_treat')
return row('bad_karma')
Now that we have our dataset and our parameters for what we want to return, we can move forward and explore the fastest way to run this type of analysis.
Which Pandas code is faster: loop, apply, or vectorize?
To time our functions, we will use a Jupyter Notebook to keep it relatively simple with the %%timeit magic function. There are other ways to time a function in Python, but for demonstration purposes, our Jupyter Notebook will suffice. We will perform a demo run on the same dataset with 3 ways to compute and evaluate our problem using Looping/Iterating, Apply, and Vectorization.
Loop/Iteration
Repetition and iteration are the most basic way to perform the same calculation row by row. We call the data frame and iterate the rows with a new cell called reward and run the calculation to complete the new reward
according to our previous definition reward_calc
Code Block. This is the most basic method and probably the first one you learn when coding, similar to For loops.
%%timeit
df = get_data()
for index, row in df.iterrows():
df.loc(index, 'reward') = reward_calc(row)
This is what I got back:
3.66 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Inexperienced data scientists may consider a couple of seconds no big deal, but 3.66 seconds is quite a long time to run a simple function on a dataset. Let's see what that means. apply
This feature can help us gain speed.
Apply
He apply
The function effectively does the same thing as the loop. It will create a new column titled reward and apply the calculation function every 1 row as defined in axis=1
. He apply
The function is a faster way to run a loop on your dataset.
%%timeit
df = get_data()
df('reward') = df.apply(reward_calc, axis=1)
The time it took to execute is as follows:
404 ms ± 18.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Wow, much faster! About 9 times faster, a huge improvement for a loop. Now, the Apply function is perfectly usable and will be applicable in certain scenarios, but for our use case, let’s see if we can speed it up further.
Vectorization
Our last and final way to evaluate this dataset is to use vectorization. We will call our dataset and apply the default reward bad_karma
to the entire data frame. Then we will only check those that satisfy our parameters using boolean indexing. Think of it as setting a true/false value for each row. If any or all rows return false in our calculation, then the reward
The line will remain bad_karma
. While if all rows are true, we will redefine the data frame for the reward
row as favorite_treat
.
%%timeit
df = get_data()
df('reward') = df('bad_karma')
df.loc(((df('percentage_productive') >= 0.5) &
(df('time_at_work') >= 2)) |
(df('age') >= 65), 'reward') = df('favorite_treat')
The time it took to execute this function on our dataset is as follows:
10.4 ms ± 76.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
That's extremely fast. 40 times faster than Apply and approximately 360 times faster than Looping…
Why vectorization in Pandas is 300 times faster
The reason vectorization is much faster than iteration and application is that it does not calculate the entire row each time, but rather applies the parameters to the entire data set as a whole. Vectorization is a process in which operations are applied to entire arrays of data at once, rather than operating on each element of the array individually. This allows for much more efficient use of memory and CPU resources.
When using Loops or Apply to perform calculations on a Pandas dataframe, the operation is applied sequentially. This results in repeated access to memory, calculations, and updated values, which can be slow and resource-intensive.
On the other hand, vectorized operations are implemented in Cython (Python in C or C++) and utilize the vector processing capabilities of the CPU, which can perform multiple operations at once, further increasing performance by calculating multiple parameters at the same time. Vectorized operations also avoid the overhead of constantly accessing memory, which is the basis of Loop and Apply.
How to vectorize Pandas code
- Use Pandas and NumPy built-in functions that have C implementations such as addition(), mean()either maximum().
- Use vectorized operations that can be applied to entire DataFrames and Series, including math, comparisons, and logic to create a boolean mask to select multiple rows from your dataset.
- You can use the .values attribute or the
.to_numpy()
to return the underlying NumPy array and perform vectorized calculations directly on the array. - Use vectorized string operations to apply to your dataset, such as
.str.contains()
,.str.replace()
and.str.split()
.
Whenever you write functions on Pandas DataFrames, try to vectorize your calculations as much as possible. As datasets get larger and larger and your calculations become more and more complex, the time savings add up exponentially when you use vectorization. It's worth noting that not all operations can be vectorized, and sometimes you need to use loops or apply functions. However, whenever possible, vectorized operations can greatly improve performance and make your code more efficient.
Kevin Vu manages Exxact Corp Blog and works with many of its talented authors who write about different aspects of deep learning.