PANDAS FOR DATA SCIENCE
When using Pandas, most data scientists would opt for df('x')
either df("x")
— It doesn't really matter which one you use, as long as you stick to the one you've chosen. You can read more about this here:
Therefore, from now on, wherever you write df("x")
This will also refer to df('x')
. However, there is another option. You can also go for df.x
. While this is a less common option, it can improve readability, assuming the column name is a valid python identifier.¹
Does it matter what syntax you choose? This article aims to address this question, from two very important points of view: readability and performance.
The two approaches df("x")
and df.x
– are common methods to access the column (here, "x"
) of a data frame (here, df
). In data science, the former will most likely be used more frequently; at least my experience in a variety of data science projects suggests this.
Readability and simplicity of use.
Let's consider the advantages and disadvantages of the methods in terms of readability and simplicity:
df("x")
: This is the explicit method. This option allows you to use columns with names that have spaces or special characters or, more generally, that are invalid Python identifiers. Thanks to this syntax, you will immediately know that"x”
is the name of a column. However, this is the least readable version for the eyes: when you see a lot of code of this type, you may have to fight with the visual clutter in front of your eyes.df.x
: This method provides a more concise syntax, since each time you usedf.x
, you save three characters. You will especially appreciate this when concise code is preferred. Wearingdf.x
it is like…