Credit card fraud detection is a plague that affects all financial institutions. In general, fraud detection is very challenging because fraudsters are coming up with new and innovative ways to detect fraud, so it is difficult to find a pattern that we can detect. For example, in the diagram all the icons look the same, but there is one icon that is slightly different from the rest and we should choose that one. Can you detect it?
Here it is:
With this background, let me give you a plan for today and what you will learn in the context of our 'Credit Card Fraud Detection' use case:
1. What is data imbalance?
2. Possible causes of data imbalance
3. Why is class imbalance a problem in machine learning?
4. Quick update on random forest algorithm
5. Different sampling methods to address data imbalance
6. Comparison of which method works well in our context with a practical demonstration using Python
7. Business vision on which model to choose and why?
In most cases, since the number of fraudulent transactions is not huge, we have to work with data that generally contains many non-fraudulent transactions compared to fraud cases. In technical terms, such a data set is called “imbalanced data.” But it is still essential to detect cases of fraud, because just one fraudulent transaction can cause millions of losses to banks/financial institutions. Now, let's dive into what data imbalance is.
We will consider the credit card fraud data set from https://www.kaggle.com/mlg-ulb/creditcardfraud (Open Data License).
Formally this means that the distribution of samples between different classes is unequal. In our case of binary classification problem, there are 2 classes.
a) Majority class: non-fraudulent/genuine transactions
b) Minority class: fraudulent transactions
In the considered data set, the class distribution is as follows (Table 1):
As we can see, the data set is very imbalanced and only 0.17% of the observations are in the Fraudulent category.
There can be two main causes of data imbalance:
a) Biased sampling/measurement errors: This is due to samples being collected from only one class or a particular region or due to samples being misclassified. This can be resolved by improving sampling methods.
b) Use case/feature domain: A more pertinent problem like in our case could be due to the problem of predicting a rare event, which automatically introduces bias towards the majority class because the occurrence of the minor class is not a frequent practice. .
This is a problem because most machine learning algorithms focus on learning from events that occur frequently, that is, the majority class. This is called frequency bias. Therefore, in cases of imbalanced data sets, these algorithms may not perform well. Typically, few techniques that will work well are tree-based algorithms or anomaly detection algorithms. Traditionally, methods based on business rules are usually used in fraud detection problems. Tree-based methods work well because a tree creates a rule-based hierarchy that can separate both classes. Decision trees tend to overfit the data and to eliminate this possibility we will use an ensemble method. For our use case, today we will use the random forest algorithm.
Random Forest works by constructing multiple decision tree predictors and the mode of the classes of these individual decision trees is the final class or result selected. It's like voting for the most popular class. For example: If two trees predict that rule 1 indicates fraud while another tree indicates that rule 1 predicts non-fraud, then according to the random forest algorithm, the final prediction will be fraud.
Formal definition: a random forest is a classifier consisting of a collection of tree-structured classifiers {h(x,Θk), k=1,…} where the {Θk} are independent, identically distributed random vectors and each tree outputs a unitary vote. for the most popular class in entry x. (Fountain)
Each tree depends on a random vector that is independently sampled and all trees have a similar distribution. The generalization error converges as the number of trees increases. In its splitting criteria, random forest searches for the best feature among a random subset of features and we can also calculate the importance of the variable and accordingly perform feature selection. Trees can be grown using the bagging technique where observations can be selected randomly (without replacement) from the training set. The other method can be random split selection, where a random split is selected among the K best splits at each node.
You can read more about it. here
We will now illustrate three sampling methods that can address data imbalance.
to) Random subsampling: Random draws are taken from the non-fraud observations, i.e. the majority class, to compare with the fraud observations, i.e. the minority class. This means that we are wasting information from the data set that may not always be ideal.
b) Random oversampling: In this case, we do exactly the opposite of subsampling, that is, we double the minority class, that is, random fraud observations to increase the number of the minority class until we obtain a balanced data set. The possible limitation is that we are creating many duplicates with this method.
c) SMOTE: (Synthetic Minority Oversampling Technique) is another method that uses synthetic data with KNN instead of using duplicate data. Each minority class example is considered along with its k nearest neighbors. Then, along line segments joining any or all of the minority class examples and the k nearest neighbor synthetic examples are created. This is illustrated in figure 3 below:
With just oversampling, the decision boundary becomes smaller, while with SMOTE we can create larger decision regions, thus improving the chances of better capturing the minority class.
A possible limitation is that, if the minority class, that is, the fraudulent observations, are distributed throughout the data and are not distinct, then using nearest neighbors to create more cases of fraud introduces noise into the data and this can lead to misclassification.
Some of the metrics that are useful in judging the performance of a model are listed below. These metrics provide a view of how well and how accurately the model is able to predict/classify the target variables:
· TP (True Positive)/TN (True Negative) are the cases of correct predictions, i.e. predicting fraud cases as fraud (TP) and predicting non-fraud cases as non-fraud (TN).
· FP (false positive) are those cases that are not actually fraud but that the model predicts as fraud
· FN (false negative) are those cases that are actually fraud but that the model predicts are not fraud.
Precision = TP / (TP + FP): Precision measures how accurately the model is able to capture fraud, that is, of the total expected fraud cases, how many actually turned out to be fraud.
Recover = TP/ (TP+FN): Recover measures of all actual fraud cases, how many the model could correctly predict as fraud. This is an important metric here.
Precision = (TP +TN)/(TP+FP+FN+TN): Measures how many majority and minority classes could be correctly classified.
F Score = 2*TP/ (2*TP + FP +FN) = 2* Accuracy *Remember/ (Precision *Remember); This is a balance between precision and recall. Note that precision and recall are inversely related, so the F-score is a good measure to strike a balance between the two.
First, we will train the random forest model with some default features. Note that model optimization with feature selection or cross-validation has been kept out of scope here for simplicity. Post that we trained the model using undersampling, oversampling and then SMOTE. The following table illustrates the confusion matrix along with the precision, recall, and accuracy metrics for each method.
to) Without interpretation of the sampling result: Without any sampling we can capture 76 fraudulent transactions. Although the overall accuracy is 97%, the recall is 75%. This means that there are quite a few fraudulent transactions that our model cannot capture.
Below is the code that can be used:
# Training the model
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
classifier.fit(x_train,y_train)# Predict Y on the test set
y_pred = classifier.predict(x_test)
# Obtain the results from the classification report and confusion matrix
from sklearn.metrics import classification_report, confusion_matrix
print('Classifcation report:\n', classification_report(y_test, y_pred))
conf_mat = confusion_matrix(y_true=y_test, y_pred=y_pred)
print('Confusion matrix:\n', conf_mat)
b) Interpretation of subsampling results: With undersampling, although the model can capture 90 fraud cases with a significant improvement in recall, the accuracy and precision decrease dramatically. This is because false positives have increased phenomenally and the model is penalizing many genuine transactions.
Subsampling code snippet:
# This is the pipeline module we need from imblearn
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline # Define which resampling method and which ML model to use in the pipeline
resampling = RandomUnderSampler()
model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
# Define the pipeline,and combine sampling method with the RF model
pipeline = Pipeline((('RandomUnderSampler', resampling), ('RF', model)))
pipeline.fit(x_train, y_train)
predicted = pipeline.predict(x_test)
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
do) Interpretation of oversampling results: The oversampling method has the highest precision and accuracy and the recall is also good at 81%. We can capture 6 more fraud cases and the false positives are also quite low. Overall, from the perspective of all parameters, this model is a good model.
Oversampling code snippet:
# This is the pipeline module we need from imblearn
from imblearn.over_sampling import RandomOverSampler# Define which resampling method and which ML model to use in the pipeline
resampling = RandomOverSampler()
model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
# Define the pipeline,and combine sampling method with the RF model
pipeline = Pipeline((('RandomOverSampler', resampling), ('RF', model)))
pipeline.fit(x_train, y_train)
predicted = pipeline.predict(x_test)
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
d) SMOTE: Smote further improves the oversampling method with 3 more frauds detected on the network and although the false positives increase a little, the recall is quite healthy at 84%.
SMOTE code snippet:
# This is the pipeline module we need from imblearnfrom imblearn.over_sampling import SMOTE
# Define which resampling method and which ML model to use in the pipeline
resampling = SMOTE(sampling_strategy='auto',random_state=0)
model = RandomForestClassifier(n_estimators=10,criterion='entropy', random_state=0)
# Define the pipeline, tell it to combine SMOTE with the RF model
pipeline = Pipeline((('SMOTE', resampling), ('RF', model)))
pipeline.fit(x_train, y_train)
predicted = pipeline.predict(x_test)
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
In our fraud detection use case, the most important metric is recovery. This is because banks/financial institutions are more concerned about detecting most cases of fraud because fraud is costly and they could lose a lot of money because of it. Therefore, even if there are few false positives, that is, marking genuine customers as fraud, it may not be too cumbersome because this only means blocking some transactions. However, blocking too many genuine transactions is also not a feasible solution, so depending on the risk appetite of the financial institution, we can opt for the simple oversampling or SMOTE method. We can also tune the model parameters to further improve the model results using grid search.
For details about the code, see this link at GitHub.
References:
(1) Mythili Krishnan, Madhan K. Srinivasan, Credit Card Fraud Detection: An Exploration of Different Sampling Methods to Solve the Class Imbalance Problem (2022), Research Gate
(1) Bartosz Krawczyk, Learning from imbalanced data: Open challenges and future directions (2016), Springer
(2) Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall and W. Philip Kegelmeyer, SMOTE: Synthetic Minority Oversampling Technique (2002), Journal of artificial intelligence Research.
(3) Leo Breiman, Random forests (2001), stat.berkeley.edu
(4) Jeremy Jordan, Learn from imbalanced data (2018)