A few words about thresholding, the softmax activation function, introducing an additional label, and considerations for output activation functions.
In many real-world applications, machine learning models are not designed to make all-or-nothing decisions. Instead, there are situations where it is more beneficial for the model to flag certain predictions for human review, a process known as human-in-the-loop. This approach is particularly valuable in high-risk scenarios, such as fraud detection, where the cost of false negatives is significant. By allowing humans to intervene when a model is uncertain or encounters complex cases, companies can ensure more nuanced and accurate decision making.
In this article, we will explore how thresholding, a technique used to manage model uncertainty, can be implemented in a deep learning environment. The threshold helps determine when a model is confident enough to make a decision autonomously and when it should defer to human judgment. This will be done using a real world example to illustrate the potential.
At the end of this article, the hope is to provide technical teams and business stakeholders with some advice and inspiration for making decisions about modeling, thresholding strategies, and the balance between automation and human oversight.
To illustrate the value of the threshold in a real-world situation, consider the case of a financial institution tasked with detecting fraudulent transactions. We will use the Kaggle Fraud Detection Dataset (DbCL license), which contains anonymized transaction data tagged with fraudulent activity. Institutions process many transactions, making it difficult to manually review each one. We want to develop a system that accurately flags suspicious transactions while minimizing unnecessary human intervention.
The challenge lies in balancing accuracy and efficiency. Thresholding is a strategy used to introduce this compensation. With this strategy we add an additional label to the sample space:a stranger. This label serves as a signal to the model when it is unsure about a particular prediction, effectively deferring the decision to human review. In situations where the model lacks sufficient certainty to make a reliable prediction, marking a transaction as unknown ensures that only the most reliable predictions are applied.
Additionally, setting a threshold could have another positive side effect. Helps overcome possible technological skepticism. When a model indicates uncertainty and defers to human judgment when necessary, it can foster greater trust in the system. In previous projects, this has been helpful when implementing projects in various organizations.
We will explore the concept of thresholding in a deep learning context. However, it is important to note that thresholding is a model-independent technique that is applied in various types of situations, not just deep learning.
When implementing a threshold step in a neural network, it is not obvious which layer to put it in. In a classification environment, an output transformation can be implemented. The sigmoid function is an option, but also a softmax function. Softmax offers a very practical transformation, making logits adhere to certain interesting statistical properties. These properties are where we are guaranteed that the logits will sum to one and all will be between zero and one.
However, some information is lost in this process. Softmax captures only the relative certainty between labels. It does not provide an absolute measure of certainty for any individual label, which in turn can lead to overconfidence in cases where the true distribution of uncertainty is more nuanced. This limitation becomes critical in applications that require precise decision thresholds.
This article will not delve into the details of the model architecture as they will be covered in a future article for those interested. The only thing that is used from the model are the results before and after the softmax transformation has been implemented, as the final layer. Here is a sample of the result.
As can be seen, the results are quite homogeneous. And without knowing the mechanics of softmax, it seems like the model is pretty confident about the rankings. But as we will see later in the article, the strong relationship we capture here is not the true certainty of the labels. Rather, this should be interpreted as the predictions of one label compared to the other. In our case, this means that the model may capture some labels as significantly more likely than others, but it does not reflect the overall certainty of the model.
With this understanding of the interpretation of the results, let's explore how the model works in practice. Looking at the confusion matrix.
The model does not work entirely well, although it is far from perfect. With these basic results in hand, we will consider implementing a threshold.
We'll start by going to one layer of the network, examining the values just before the final activation function. This generates the following logits.
Here we see a greater variety of values. This layer provides a more detailed view of the model's uncertainty in its predictions and is where the threshold layer is inserted.
By introducing an upper and lower confidence threshold, the model only labels approximately 34% of the data set, focusing on the most certain predictions. But, in turn, the results are more certain, as shown in the following confusion matrix. It is important to note that the threshold does not have to be uniform. For example, some labels may be more difficult to predict than others and label imbalance may also affect the threshold strategy.
Metrics.
In this scenario, we have only touched on the two extreme cases in establishing thresholds; those that let all predictions pass (base case) and those that eliminated all faulty predictions.
Based on practical experience, deciding whether to label fewer data points with high certainty (which could reduce the total number of flagged transactions) or label more data points with lower certainty is a rather complex trade-off. This decision can impact operational efficiency and could be based on business priorities, such as risk tolerance or operational limitations. Discussing this together with experts in the field is a perfectly viable way to determine thresholds. Another is whether you can optimize this in conjunction with a known or approximate metric. This can be done by aligning thresholds with specific business metrics, such as cost per false negative or operational capacity.
Summary.
In conclusion, the goal is not to discard the softmax transformation, as it provides valuable statistical properties. Rather, we suggest introducing an intermediate threshold layer to filter out uncertain predictions and leave room for an unknown label when necessary.
I think the exact way to implement this comes down to the project at hand. The fraud example also highlights the importance of understanding the business need you are trying to solve. Here we show an example where we had removed all the wrong predictions, but this is not entirely necessary in all use cases. In many cases, the optimal solution lies in finding a balance between accuracy and coverage.
Thank you for taking the time to explore this topic.
I hope you found this article useful and/or inspiring. If you have any comments or questions, please contact us. You can also connect with me on LinkedIn.