Credit Card Fraud Detection — Part 2

Vasudhatapriya
Analytics Vidhya
Published in
10 min readFeb 8, 2021

--

In this part we’ll dive deeper into the models that and the data-imbalance techniques we’re going to use.

We start by dropping all the other amounts(and keep only the log scaled amount) that we had added in the data frame and put the all the features that we have in ‘X’. We have to predict ‘Class’, so we drop it from ‘X’ and store it in a label ‘y’.

As we saw in the EDA portion (Part-1), this dataset has a HUGE class imbalance.

Why is Class Imbalance a Problem?

When a statistical classifier is trained on a highly imbalanced dataset, it has a tendency to pick the patterns in the most popular class and ignore the rest.

For example, in this dataset, 99.9% of the data are labelled as ‘Not Fraud’ and rest are ‘Fraud’. So, even if a model classifies everything that it sees as ‘Not Fraud’, the accuracy is going to be 99.9% which seems excellent.

But is the model good? NO, because it is not classifying any of the transaction as ‘Fraud’. So, even if the model has an accuracy of 99.9%, it is completely useless!

We need some strategies to work with in such a dataset or we need to use some other metrics(except for accuracy) in such scenarios.

Dealing with Class Imbalance

In this blog, we’ll use 4 techniques to deal with the class imbalance.

1. Under Sample majority class

In under-sampling, the number of samples in majority class are down sampled(by eliminating them randomly)to align them to the number of samples in minority class.

This can lead to Data Inefficiency as loss of useful data can make the decision boundary between minority and majority samples harder to learn for rule-based classifiers.

This technique is only effective when minority class has sufficient data despite of being affected by severe imbalance.

2. Over Sample minority class

This is the exact opposite of Under Sampling. The number of samples are duplicated randomly in the minority class to align them to the number of samples in the majority class.

This can lead to overfitting as it makes exact copies of the minority class samples(It is not adding any new information to the model, but is simply replicating the data samples).

3. Synthetic Minority Over Sampling TEchnique(SMOTE)

SMOTE is an oversampling technique which creates new synthetic examples in the minority class which are similar to the ones already there, instead of simply replicating them.

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.

Page 47, Imbalanced Learning: Foundations, Algorithms, and Applications, 2013.

SMOTE Process includes the following:

  1. Identifying the feature vector and its nearest neighbor.
  2. Taking difference between the two.
  3. Multiplying the difference with a random number between 0 and 1.
  4. Identifying a new point on the line segment by adding the- random number to the feature vector.
  5. Repeat the process for identified feature vectors.

Limitations of this approach is when synthetic examples are created without considering the majority class, it results in creation of ambiguous samples if there is a strong overlap between the classes.

4. Adaptive Synthetic Samples(ADASYN)

The essential idea behind ADASYN is to use weighted distribution for different minority class examples according to their level of difficulty in learning(more synthetic data is generated for the minority class examples that are easier to learn).

The ADASYN approach improves learning with respect to the data distributions in two ways:

  1. It reduces the bias introduced by the class imbalance.
  2. It adaptively shifts the classification decision boundary toward the difficult examples.

NOTE: It is important to split into test and train sets before Oversampling techniques are applied. Oversampling before splitting the data can lead to same observation being there in test and train sets which can simply allow our model to memorize those data points(and causing overfitting).

Importing Dependencies:

Let us prepare dataset for all these class imbalanced methods.

Random Under Sample Dataset

Output for Under Sampling dataset

Random Over Sampler Dataset

Output for Over Sampling dataset

SMOTE Dataset

Output for SMOTE dataset

ADASYN Dataset

Output for ADASYN dataset

Metrics

CONFUSION MATRIX(Error Matrix), allows visualization of the performance of an algorithm.

True Positive (TP) : Fraud correctly identified as Fraud

True Negative (TN) : Non-fraud correctly identified as Non-fraud

False Positive (FP) : Fraud incorrectly identified as Non-fraud

False Negative (FN) : Non-fraud incorrectly identified as Fraud

Accuracy

(TP +TN) / (TP + TN + FP +FN)

As discussed before, we’ll need some other metrics to evaluate our model(except for accuracy).

Precision

TP / (TP + FP)

Precision tells us how many of the correctly predicted cases actually turned out to be positive(how likely the prediction of the positive class is correct).

Recall

TP / (TP + FN)

Recall tells us how many of the actual positive cases we were able to predict correctly with our model(how good the model is to recognize a positive class).

F1 score

When we try to increase Precision, Recall goes down and vice versa. The F1-score captures both the trends in a single value. It is the harmonic mean of Recall and Precision.

F1 score: 2 x ((Precision x Recall) / (Precision + Recall))

F1 score is maximum when Precision and Recall are equal.

For this dataset, we’re going to compare the results of various models using F1 score.

ROC Curve

The ROC curve is a plot for the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.

For calculating the performance of each Classification model(with all the five datasets), it would be really beneficial to create a function which could evaluate all the metrics mentioned above and store them so that they could be compared later.

Classification Algorithms

Here, we’ll discuss all these algorithms concisely before applying them.

For each of these Classification Algorithms, we’ll check all the class-imbalance techniques(for the datasets made above) and will compare the results in the end using the metrics mentioned before. So, let’s get started!

1. Logical Regression Classifier

Model Name : LR IMBALANCED
Test Accuracy :0.99826
Test AUC : 0.50000
Test Precision : 0.00000
Test Recall : 0.00000
Test F1 : 0.00000
Confusion Matrix :
[[84970 0]
[ 148 0]]


Model Name : LR UNDERSAMPLE
Test Accuracy :0.99826
Test AUC : 0.50000
Test Precision : 0.00000
Test Recall : 0.00000
Test F1 : 0.00000
Confusion Matrix :
[[84970 0]
[ 148 0]]


Model Name : LR OVERSAMPLE
Test Accuracy :0.99659
Test AUC : 0.79594
Test Precision : 0.27673
Test Recall : 0.59459
Test F1 : 0.37768
Confusion Matrix :
[[84740 230]
[ 60 88]]


Model Name : LR SMOTE
Test Accuracy :0.99659
Test AUC : 0.79594
Test Precision : 0.27673
Test Recall : 0.59459
Test F1 : 0.37768
Confusion Matrix :
[[84740 230]
[ 60 88]]


Model Name : LR ADASYN
Test Accuracy :0.99633
Test AUC : 0.82279
Test Precision : 0.26966
Test Recall : 0.64865
Test F1 : 0.38095
Confusion Matrix :
[[84710 260]
[ 52 96]]
ROC curve for Logistic Regression classifier

2. Random Forest Classifier

Model Name : RF IMABALANCED
Test Accuracy :0.99952
Test AUC : 0.88847
Test Precision : 0.93496
Test Recall : 0.77703
Test F1 : 0.84871
Confusion Matrix :
[[84962 8]
[ 33 115]]


Model Name : RF UNDERSAMPLE
Test Accuracy :0.96343
Test AUC : 0.93784
Test Precision : 0.04173
Test Recall : 0.91216
Test F1 : 0.07981
Confusion Matrix :
[[81870 3100]
[ 13 135]]


Model Name : RF OVERSAMPLE
Test Accuracy :0.99952
Test AUC : 0.87835
Test Precision : 0.95726
Test Recall : 0.75676
Test F1 : 0.84528
Confusion Matrix :
[[84965 5]
[ 36 112]]


Model Name : RF SMOTE
Test Accuracy :0.99947
Test AUC : 0.91542
Test Precision : 0.86014
Test Recall : 0.83108
Test F1 : 0.84536
Confusion Matrix :
[[84950 20]
[ 25 123]]


Model Name : RF ADASYN
Test Accuracy :0.99947
Test AUC : 0.91542
Test Precision : 0.86014
Test Recall : 0.83108
Test F1 : 0.84536
Confusion Matrix :
[[84950 20]
[ 25 123]]
ROC Curve for Random Forest classifier

3. Gaussian Naïve Bayes Classifier

Model Name : NB IMBALANCED
Test Accuracy :0.99316
Test AUC : 0.80097
Test Precision : 0.14658
Test Recall : 0.60811
Test F1 : 0.23622
Confusion Matrix :
[[84446 524]
[ 58 90]]


Model Name : NB UNDERSAMPLE
Test Accuracy :0.99026
Test AUC : 0.88046
Test Precision : 0.12541
Test Recall : 0.77027
Test F1 : 0.21570
Confusion Matrix :
[[84175 795]
[ 34 114]]


Model Name : NB OVERSAMPLE
Test Accuracy :0.99119
Test AUC : 0.87418
Test Precision : 0.13559
Test Recall : 0.75676
Test F1 : 0.22998
Confusion Matrix :
[[84256 714]
[ 36 112]]


Model Name : NB SMOTE
Test Accuracy :0.99161
Test AUC : 0.88113
Test Precision : 0.14358
Test Recall : 0.77027
Test F1 : 0.24204
Confusion Matrix :
[[84290 680]
[ 34 114]]


Model Name : NB ADASYN
Test Accuracy :0.99118
Test AUC : 0.89103
Test Precision : 0.13978
Test Recall : 0.79054
Test F1 : 0.23756
Confusion Matrix :
[[84250 720]
[ 31 117]]
ROC curve for Naive Bayes Classifier

4. Decision Tree Classifier

Model Name : DT IMBALANCED
Test Accuracy :0.99915
Test AUC : 0.86805
Test Precision : 0.76761
Test Recall : 0.73649
Test F1 : 0.75172
Confusion Matrix :
[[84937 33]
[ 39 109]]


Model Name : DT UNDERSAMPLE
Test Accuracy :0.90412
Test AUC : 0.91151
Test Precision : 0.01642
Test Recall : 0.91892
Test F1 : 0.03225
Confusion Matrix :
[[76821 8149]
[ 12 136]]


Model Name : DT OVERSAMPLE
Test Accuracy :0.99887
Test AUC : 0.84767
Test Precision : 0.66883
Test Recall : 0.69595
Test F1 : 0.68212
Confusion Matrix :
[[84919 51]
[ 45 103]]


Model Name : DT SMOTE
Test Accuracy :0.99807
Test AUC : 0.90461
Test Precision : 0.46875
Test Recall : 0.81081
Test F1 : 0.59406
Confusion Matrix :
[[84834 136]
[ 28 120]]


Model Name : DT ADASYN
Test Accuracy :0.99769
Test AUC : 0.89092
Test Precision : 0.41281
Test Recall : 0.78378Test F1 : 0.54079
Confusion Matrix :
[[84805 165]
[ 32 116]]
ROC Curve for Decision Trees classifier

5. K-Nearest Neighbor Classifier

Model Name : KNN IMBALANCE
Test Accuracy :0.99834
Test AUC : 0.52365
Test Precision : 1.00000
Test Recall : 0.04730
Test F1 : 0.09032
Confusion Matrix :
[[84970 0]
[ 141 7]]


Model Name : KNN UNDERSAMPLE
Test Accuracy :0.64224
Test AUC : 0.61171
Test Precision : 0.00282
Test Recall : 0.58108
Test F1 : 0.00562
Confusion Matrix :
[[54580 30390]
[ 62 86]]


Model Name : KNN OVERSAMPLE
Test Accuracy :0.99823
Test AUC : 0.61802
Test Precision : 0.47945
Test Recall : 0.23649
Test F1 : 0.31674
Confusion Matrix :
[[84932 38]
[ 113 35]]


Model Name : KNN SMOTE
Test Accuracy :0.98089
Test AUC : 0.82180
Test Precision : 0.05851
Test Recall : 0.66216
Test F1 : 0.10752
Confusion Matrix :
[[83393 1577]
[ 50 98]]


Model Name : KNN ADASYN
Test Accuracy :0.98032
Test AUC : 0.83164
Test Precision : 0.05842
Test Recall : 0.68243
Test F1 : 0.10762
Confusion Matrix :
[[83342 1628]
[ 47 101]]
ROC Curve for KNNs classifier

6. XG Boost Classifier

Model Name : XGBOOST IMBALANCED
Test Accuracy :0.99954
Test AUC : 0.90871
Test Precision : 0.90977
Test Recall : 0.81757
Test F1 : 0.86121
Confusion Matrix :
[[84958 12]
[ 27 121]]


Model Name : XGBOOST UNDERSAMPLE
Test Accuracy :0.91485
Test AUC : 0.92025
Test Precision : 0.01858
Test Recall : 0.92568
Test F1 : 0.03643
Confusion Matrix :
[[77733 7237]
[ 11 137]]


Model Name : XGBOOST OVERSAMPLE
Test Accuracy :0.99948
Test AUC : 0.91206
Test Precision : 0.87143
Test Recall : 0.82432
Test F1 : 0.84722
Confusion Matrix :
[[84952 18]
[ 26 122]]


Model Name : XGBOOST SMOTE
Test Accuracy :0.99935
Test AUC : 0.91874
Test Precision : 0.80000
Test Recall : 0.83784
Test F1 : 0.81848
Confusion Matrix :
[[84939 31]
[ 24 124]]


Model Name : XGBOOST ADASYN
Test Accuracy :0.99928
Test AUC : 0.91533
Test Precision : 0.77358
Test Recall : 0.83108
Test F1 : 0.80130
Confusion Matrix :
[[84934 36]
[ 25 123]]
ROC curve for XGBOOST classifier

7. MLP Classifier

ROC curve for XGBOOST classifier

Now, we’ll compare all of the F1 score(as accuracy is not a good metric for imbalanced dataset) for test dataset and will compare it for all the models(and all the datasets).

For this, we’ll create a dictionary, ‘comparision’, in which key is is the Label and the value is the list which contains the scores of all the models that we had appended before.

Image Depicting Comparison of various Classifiers(applied with Data Imbalance techniques)

The F1 score of XGBoost(Over Sample) is maximum and is same as that of XGBoost(Imbalanced). It is followed by that of Random Forest(Over Sample).

Conclusion

In this blog, we understood how data imbalance is a major challenge to deal with during building a model. We compared different techniques for dealing with data imbalance for different classification algorithms.

References

--

--

Vasudhatapriya
Analytics Vidhya

Enthusiastic Machine Learning Engineer. I love learning! Would love to connect :)