The work of classification can be seen as a way of summarizing the complexity of structures into finite classes, and it is often useful to make life boring enough that we can reduce such structures into two single types. These labels may have obvious interpretations, as in the case where we differentiate Statisticians from Data Scientists (possibly using income as a unique plausible feature), or they may even be a suffocating attempt to reduce experimental evidence into sentences, rejecting or not a null hypothesis. Embracing the metalanguage, we attribute a classification label to the work itself of sum up stuff into two different types: Binary Classification. This work is an attempt at a deeper description of this specific label, bringing a probabilistic interpretation of what the decision process means and the metrics we use to evaluate our results.
When we try to describe and differentiate an object, we aim to find specific characteristics that highlight its uniqueness. It is expected that for many attributes this difference is not consistently accurate across a population.
For an usual problem of two classes with different n features (V1, … Vn), we can se how these feature values distribute and try to make conclusions from them:
If the task were to use one of those features to guide our decision process, the strategy of deciding, given an individual’s v_n value,
it would be intuitive to predict the class with the highest frequency (or highest probability, if our histograms are good estimators of our distributions). For instance, if v4 measure for a individual were higher than 5, then it would be very likely that it is a positive.
However, we can do more than that, and take advantage different features, so that we can synthesize the information into a single distribution. This is the job of the score function S(x). The score function will do a regression to squeeze the feature distributions in one unique distribution P(s, y), which can be conditioned to each of the classes, with P(s|y=0) for negatives (y=0) and P(s|y=1) for positives (y=1).
From this single distribution, we choose a decision threshold t that determines whether our estimate for a given point — which is represented by ŷ — will be positive or negative. If s is greater than t, we assign a positive label; otherwise, a negative.
Given the distribution P(s, y, t) where s, y and t represents the values of the score, class and threshold respectively, we have a complete description of our classifier.
Developing metrics for our classifier can be regarded as the pursuit of quantifying the discriminative nature of p(s|P) and p(s|N).
Typically will be a overlap over the two distributions p(s|P) and p(s|N) that makes a perfect classification impossible. So, given a threshold, one could ask what is the probability p(s > t|N) — False Positive Rate (FPR) — that we are misclassifying negative individuals as positives, for example.
Surely we can pile up a bunch of metrics and even give names — inconsistently — for each of them. But, for all purposes, it will be sufficient define four probabilities and their associated rates for the classifier:
- True Positive Rate (tpr): p(s > t|P) = TP/(TP+FN);
- False Positive Rate (fpr): p(s > t|N) = FP/(FP+TN);
- True Negative Rate (tnr): p(s ≤ t|N) = TN/(FP+TN);
- False Negative Rate (fnr): p(s ≤ t|P) = FN/(TP+FN).
If you are already familiar with the subject, you may have noticed that these are the metrics that we define from confusion matrix of our classifier. As this matrix is defined for each chosen decision threshold, we can view it as a representation of the conditional distribution P(ŷ, y|t), where each of these objects is part of a class of confusion matrices that completely describe the performance of our classifier.
Therefore, the error ratios fpr and fnr are metrics that quantifies how and how much the two conditional score distributions intersect:
Summarizing performance: the ROC curve
Since the ratios are constrained by tpr + fnr = 1 and fpr + tnr = 1, this would mean that we have just 2 degrees of freedom to describe our performance.
The ROC curve is a curve parameterized by t, described by (x