PROGRAMMING IN PYTHON
How to inspect Pandas data frames in chained operations without splitting the chain into separate statements
Debugging is at the heart of programming. I wrote about this in the following article:
This statement is quite general and language and framework independent. When you use Python for data analysis, you need to debug your code regardless of whether you are performing complex data analysis, writing an ML software product, or creating a Streamlit or Django application.
This article discusses Pandas code debugging, or rather a specific Pandas code debugging scenario in which operations are chained together in a pipeline. This debugging poses a challenge. When you don’t know how to do it, chained Pandas operations seem to be much more difficult to debug than normal Pandas code, that is, individual Pandas operations that use the typical bracket assignment.
To debug normal Pandas code using the typical bracket assignment, it is enough to add a Python breakpoint and use the pdb
Interactive debugger. This would be something like this:
>>> d = pd.DataFrame(dict(
... x=(1, 2, 2, 3, 4),
... y=(.2, .34, 2.3, .11, .101),
... group=("a", "a", "b", "b", "b")
.. ))
>>> d("xy") = d.x + d.y
>>> breakpoint()
>>> d = d(d.group == "a")
Unfortunately, you can’t do that when the code consists of chained operations, like here:
>>> d = d.assign(xy=lambda df: df.x + df.y).query("group == 'a'")
or, depending on your preference, here:
>>> d = d.assign(xy=d.x + d.y).query("group == 'a'")
In this case, there is no place to stop and look at the code; you can only do it before or after the chain. So, one of the solutions is to split the main chain into two subchains (two pipes) in a…