Data science proves its value when applied to practical challenges. This article shares knowledge gained from practical machine learning projects.
In my experience with machine learning and data science, the transition from development to production is a critical and challenging phase. This process typically develops in iterative steps, gradually refining the product until it meets acceptable standards. Along the way, I have observed recurring obstacles that often slow down the path to production.
This article explores some of these challenges, focusing on the pre-launch process. A separate article will delve into the post-production lifecycle of a project in greater detail.
I believe the iterative cycle is integral to the development process and my goal is to optimize it, not eliminate it. To make the concepts more tangible, I will use the Kaggle Fraud Detection Dataset (DbCL license) as a case study. To model, I will take advantage TabNet and Opt for hyperparameter optimization. For a more in-depth explanation of these tools, see my previous article. article.
Optimizing Loss Functions and Metrics for Impact
When starting a new project, it is essential to clearly define the final objective. For example, in fraud detection, the qualitative objective (detecting fraudulent transactions) must be translated into quantitative terms that guide the model construction process.
There is a tendency to default to using the F1 metric to measure results and an unweighted cross-entropy loss function. ECB lossfor categorical problems. And for good reason: these are very good and solid options for measuring and training the model. This approach remains effective even for imbalanced data sets, as demonstrated later in this section.
To illustrate, we will establish a baseline model trained with a BCE loss (uniform weights) and evaluated using the F1 score. Here is the resulting confusion matrix.
The model shows reasonable performance, but has difficulty detecting fraudulent transactions: it missed 13 cases and only detected one false positive. From a business standpoint, allowing a fraudulent transaction to occur can be worse than incorrectly flagging a legitimate transaction. Adjusting the loss function and evaluation metric to align with business priorities can lead to a more appropriate model.
To guide the model choice toward prioritizing certain classes, we adjust the F-beta metric. By analyzing our metric for choosing a model, F-beta, we can make the following derivation.
Here, a false positive is weighted as beta squared false negatives. Determining the optimal balance between false positives and false negatives is a nuanced process, often tied to qualitative business objectives. In a future article, we'll dive deeper into how we derive a beta from more qualitative business goals. For demonstration purposes, we will use a weight equivalent to the square root of 200, which implies that 200 unnecessary flags are accepted for each additional fraudulent transaction prevented. It is also worth noting that when FN and FP reach zero, the metric reaches one, regardless of the choice of beta.
For our loss function, we have analogously chosen a weight of 0.995 for fraudulent data points and 0.005 for non-fraudulent data points.
The results of the updated model on the test set are shown above. Apart from the base case, our second model prefers 16 false positive cases to two false negative cases. This compensation is in line with the push we expected to get.
Prioritize representative metrics over inflated ones.
In data science, competing for resources is common and presenting inflated results can be tempting. While this might ensure approval in the short term, it often leads to stakeholder frustration and unrealistic expectations.
Instead, presenting metrics that accurately represent the current state of the model encourages better long-term relationships and realistic project planning. Here is a concrete approach.
Split the data accordingly.
Split the data set to reflect real-world scenarios as closely as possible. If your data has a temporal aspect, use it to create meaningful slices. I have covered this in a previous articlefor those who want to see more examples.
In the Kaggle dataset, we will assume that the data is sorted by time, in the Time column. We will do a train-test-val division, into 80%, 10%, 10%. These sets can be considered as: You are training with the training data set, you are optimizing the parameters with the test data set, and you are presenting the metrics from the validation data set..
Note that in the previous section we analyzed the results of the test data, that is, the ones we are using for parameter optimization. We will now analyze the validation data set that survived.
We observed a drop in recall from 75% to 68% and from 79% to 72%, for our base and weighted models, respectively. This is to be expected since the test set is optimized during model selection. The validation set, however, offers a more honest evaluation.
Take into account model uncertainty.
As with manual decision making, some data points are more difficult to evaluate than others. And the same phenomenon could occur from a modeling perspective. Addressing this uncertainty can facilitate a smoother implementation of the model. For this business purpose, do we have to classify all the data points? Do we have to give a pont estimate or is a range sufficient? Initially, focus on limited, high-confidence predictions.
These are two possible scenarios, and their solutions respectively.
Classification.
If the task is classification, consider implementing a threshold on your output. This way, only labels that the model is sure of will be generated. Otherwise the model will pass the task, not label the data. I have covered this in depth in this article.
Regression.
The threshold regression equivalent for the classification case is to introduce a confidence interval instead of presenting a point estimate. The breadth of confidence is determined by the business use case, but of course the trade-off is between prediction accuracy and prediction certainty. This topic is discussed in more detail in a previous article. article.
Explainability of the model
It is preferable to incorporate model explainability whenever possible. Although the concept of explainability is model-independent, its implementation may vary depending on the type of model.
The importance of the explainability of the model is twofold. The first thing is to build trust. Machine learning still faces skepticism in some circles. Transparency helps reduce this skepticism by making the model's behavior understandable and its decisions justifiable.
The second is to detect overfitting. If the model's decision-making process does not align with domain knowledge, it could indicate overfitting to noisy training data. Such a model runs the risk of poor generalization when exposed to new data in production. On the contrary, explainability can provide surprising insights that enhance subject matter expertise.
For our use case, we will evaluate the importance of features to get a clearer understanding of the model's behavior. Feature importance scores indicate how much individual features contribute, on average, to the model predictions.
This is a normalized score across the features in the data set, indicating how much they are used on average to determine the class label.
Consider the data set as if it were not anonymous. I've been on projects where feature importance analysis has provided insight into marketing effectiveness and revealed key predictors for technical systems, such as during predictive maintenance projects. However, the most common reaction from subject matter experts (SMEs) is usually reassuring: “Yes, these values make sense to us.”
An in-depth article exploring various model explanation techniques and their implementations will be published soon.
Preparing for data and tag drift in production systems
A common but risky assumption is that data and label distributions will remain stationary over time. In my experience, this assumption rarely holds true except in certain highly controlled technical applications. Data drift (changes in the distribution of features or labels over time) is a natural phenomenon. Instead of resisting it, we should embrace it and incorporate it into our system design.
Some things we could consider are trying to build a model that better adapts to the change or we can set up a system to monitor the drift and calculate its consequences. And make a plan for when and why to retrain the model. A detailed article on drift modeling and detection strategies will be published soon, which will also cover the explanation of data and tag drift and include retraining and monitoring strategies.
For our example, we will use the Python library. deep controls to analyze feature drift in the Kaggle dataset. Specifically, we will examine the characteristic with the highest Kolmogorov-Smirnov (KS), which indicates the largest drift. We see the drift between the train and the test set.
While it is difficult to predict exactly how data will change in the future, we can be sure that it will. Planning for this inevitability is critical to maintaining robust and reliable machine learning systems.
Summary
Bridging the gap between machine learning development and production is no easy task – it's an iterative journey filled with obstacles and learning opportunities. This article dives into the critical pre-production phase, focusing on optimizing metrics, managing model uncertainty, and ensuring transparency through explainability. By aligning technical options with business priorities, we explore strategies such as tuning loss functions, applying confidence thresholds, and monitoring data drift. After all, a model is only as good as its ability to adapt, similar to human adaptability.
Thank you for taking the time to explore this topic.
I hope this article has provided valuable ideas and inspiration. If you have any comments or questions, please contact us. You can also connect with me on LinkedIn.