Using code examples to implement Big O best practices. How a change in mindset can dramatically improve runtime code performance
With this article I aim to increase the Big O literacy rate among us data professionals. We often make comparisons between the world of Software Engineering (SWE) and the world of data. While there are best practices that apply in both fields (such as version control, error handling, and testing), from my personal experience, the area where data roles seem to be lagging behind SWE roles is in writing high-performance code. More specifically, the mindset of deploying and verifying code for performance. I've personally been guilty of this in the past – if my script code ran as expected and handled errors correctly, I'd consider it “complete”. I believe that anyone who is creating code at the production level, regardless of their job title, is responsible for ensuring that their code performs well at runtime.
I'm aware that there are many people in data-centric roles already implementing Big O best practices. In talking to your peers, these are largely people who work closely with software engineers and therefore They are in a position to “absorb” these methodologies. This is contrary to those working in an isolated data team. While I can only speculate as to why this best practice hasn't been adopted as easily in the data world, I think a large part of it is due to the paths we took to get to our current roles. As a “career changer” in the data field (I previously worked in the insurance industry), Big O and the idea of performance code were not covered in any of the curricula I studied. It was only when I started working as a professional data scientist in a very small data team that I began to realize the impact of writing high-performance code. Data professionals who run all their experiments on laptops and rely on software engineers or ML-Ops professionals to get their code and models up and running in production will also be in a similar position, as code optimization does not necessarily fall within its purview (although I really think it should).
Every line of code we push to production must have a purpose, and typically this purpose is to execute a process or some form of I/O operation. These…