Pandas Code Optimization: The Impact of Sequence of Operations | by Marcin Kozak | March 2024

PROGRAMMING IN PYTHON

Learn how to reorganize your code to achieve significant speed improvements.

Pandas offers a fantastic framework for operating on data frames. In data science, we work with small, large, and sometimes very large data frames. While analyzing smaller ones can be incredibly fast, even a single operation on a large data frame can take considerable time.

In this article I will show that many times you can shorten this time with something that costs practically nothing: the order of operations in a dataframe.

Imagine the following data frame:

import pandas as pdn = 1_000_000
df = pd.DataFrame({
letter: list(range(n))
for letter in "abcdefghijklmnopqrstuwxyz"
})

With a million rows and 25 columns, it's big. Many operations on such a data frame will be noticeable on today's personal computers.

Let's imagine that we want to filter the rows, to take those that meet the following condition: a < 50_000 and b > 3000 and select five columns: take_cols=('a', 'b', 'g', 'n', 'x'). We can do this in the following way:

subdf = df(take_cols)
subdf = subdf(subdf('a') < 50_000)
subdf = subdf(subdf('b') > 3000)

In this code, we first take the required columns and then perform row filtering. We can achieve the same thing in a different order of operations, first performing the filtering and then selecting the columns:

subdf = df(df('a') < 50_000)
subdf = subdf(subdf('b') > 3000)
subdf = subdf(take_cols)

We can achieve the same result by chaining Pandas operations. The corresponding command pipelines are as follows:

# first take columns then filter rows
df.filter(take_cols).query(query)# first filter rows then take columns
df.query(query).filter(take_cols)

From df is large, the four versions will likely differ in performance. Which will be the fastest and which will be the slowest?

Let's compare these operations. We will use the timeit module:

Pandas Code Optimization: The Impact of Sequence of Operations | by Marcin Kozak | March 2024

Technical Terrence Team

OrangeDX ICO (O4DX) Targets $590K: A New DeFi Chapter

Leave a Reply Cancel reply

Recommended.

Greenpeace claims to expose Bitcoin 'puppeteers'

4 Fantastic Abstract AI Generators

What are Metaverse NFTs? Everything You Need to Know

Traditional card game 'Shardbound' to test NFT on Immutable

Analyst Reveals Why BTC Will Peak Within a Month Below $150,000

Categories

Important Links

Pandas Code Optimization: The Impact of Sequence of Operations | by Marcin Kozak | March 2024

PROGRAMMING IN PYTHON

Learn how to reorganize your code to achieve significant speed improvements.

Related

Technical Terrence Team

OrangeDX ICO (O4DX) Targets $590K: A New DeFi Chapter

Leave a Reply Cancel reply

Recommended.

Greenpeace claims to expose Bitcoin 'puppeteers'

4 Fantastic Abstract AI Generators

What are Metaverse NFTs? Everything You Need to Know

Traditional card game 'Shardbound' to test NFT on Immutable

Analyst Reveals Why BTC Will Peak Within a Month Below $150,000

Categories

Important Links

Get daily news updates to your inbox!