PROGRAMMING IN PYTHON
Pandas offers a fantastic framework for operating on data frames. In data science, we work with small, large, and sometimes very large data frames. While analyzing smaller ones can be incredibly fast, even a single operation on a large data frame can take considerable time.
In this article I will show that many times you can shorten this time with something that costs practically nothing: the order of operations in a dataframe.
Imagine the following data frame:
import pandas as pdn = 1_000_000
df = pd.DataFrame({
letter: list(range(n))
for letter in "abcdefghijklmnopqrstuwxyz"
})
With a million rows and 25 columns, it's big. Many operations on such a data frame will be noticeable on today's personal computers.
Let's imagine that we want to filter the rows, to take those that meet the following condition: a < 50_000 and b > 3000
and select five columns: take_cols=('a', 'b', 'g', 'n', 'x')
. We can do this in the following way:
subdf = df(take_cols)
subdf = subdf(subdf('a') < 50_000)
subdf = subdf(subdf('b') > 3000)
In this code, we first take the required columns and then perform row filtering. We can achieve the same thing in a different order of operations, first performing the filtering and then selecting the columns:
subdf = df(df('a') < 50_000)
subdf = subdf(subdf('b') > 3000)
subdf = subdf(take_cols)
We can achieve the same result by chaining Pandas operations. The corresponding command pipelines are as follows:
# first take columns then filter rows
df.filter(take_cols).query(query)# first filter rows then take columns
df.query(query).filter(take_cols)
From df
is large, the four versions will likely differ in performance. Which will be the fastest and which will be the slowest?
Let's compare these operations. We will use the timeit
module: