How a tried and tested solution can produce great results when tackling an everyday machine learning problem
With so much focus on generative ai and vast neural networks, it's easy to overlook the proven machine learning algorithms of yesteryear (they're actually not that old…). I would venture to say that in most business cases, a simple machine learning solution will go further than the most complex ai implementation. Not only do ML algorithms scale extremely well, but the much lower complexity of the model is what (in my opinion) makes them superior in most scenarios. Not to mention, I've also found it much easier to track the performance of said machine learning solutions.
In this article, we will tackle a classic ML problem using a classic ML solution. More specifically, I will show how you can (in just a few lines of code) identify the importance of features within a data set using a random forest classifier. I will start by demonstrating the effectiveness of this technique. I'll then apply a “back to basics” approach to show how this method works on the inside by creating a decision tree and random forest from scratch while comparing the models along the way.
I have found that the initial phases of an ML project are particularly important in a professional environment. Once stakeholders (who pay the bills) have determined the feasibility of the project, they will want to see the return on investment. Part of this feasibility discussion will involve discussions about the data: whether there is enough data, whether the data is of high quality, etc., etc. Some answers to data distribution and quality can only be answered after some initial analysis. The technique I show here assumes that you have completed the initial feasibility assessment and are ready to move on to the next step. The main question we need to ask ourselves at this point is: how many features can I remove while maintaining the performance of the model? There are many benefits to reducing the number of features (dimensionality) in our model. These include, but are not limited to:
- Reduce model complexity
- Faster training times
- Reduce multicollinearity (correlated features)