Credit Card Fraud Detection — Part I
The Credit Card Fraud Detection is an online Challenge on Kaggle where we aim to find if a transaction is Fraudulent or not. I’ve divided this article into two parts, where the Part-1 has information about the dataset and has Exploratory Data Analysis and Part-2 deals with data imbalance and comparison of various classification models.
About the Dataset
We’re given features V1, V2, … V28, that are the principal components obtained with PCA. The only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount, this feature can be used for example-dependent cost-sensitive learning. Feature ‘Class’ is a binary variable and it takes value 1 in case of fraudulent transaction and 0 otherwise. The competition link for the same is given below.
Exploratory Data Analysis
First, we’ll start by importing all the dependencies needed for Exploratory Data Analysis where we’ll analyze the dataset to find some significant patterns, dealing with missing data and duplicate data, heatmaps, distributions etc.
Reading the given dataset using pandas:
Now, we’ll see if there is any missing data in the dataset.
As we can see, there is no null values that we have to deal with in the dataset, we proceed further.
The dataset might consists of some duplicates which we are going to check and remove. For looking at this, we’ll see the shape of the data before and after removing the duplicates.
OUTPUT:(284807, 31)
OUTPUT:(283726, 31)
As we can see the shape of the data after removing duplicates has changed, we infer that 1081 rows have been deleted which had duplicated data.
There are two peaks in the graph. This dataset is for 2 days. We can relate this as two peaks corresponds to the two times in each day where maximum number of transactions that are happening(and depth corresponds to the night time where people are not doing any transactions).
We can see that we have a huge class-imbalance here. We’ll discuss the issues related with it and how we’ll deal with them in the Part-2 of this blog series.
Now we’re going to plot Time-Distribution graphs of Fraud and Non-Fraud Transactions and will observe if we find any patterns.
We don’t observe any significant patterns so we’ll move ahead.
It is better idea to scale the features before using the dataset so that all the values come in similar range. This is important so that features with lesser significance might not end up dominating the more significant features due to its larger range.
Eg. In some dataset, the column Salary might be in Lakhs/Crores but the column age would be under 100. This would lead the salary column to dominate the feature prediction even though it might be less significant. For this reason, different types of Scaling-Log, Standardization and Normalization is used. We’ll decide which of these to choose from depending on our dataset.
Log is a scaling technique which is done when the variables span several orders of magnitude.
Standardization is a scaling technique are the ones where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
Normalization (Min-Max Scaling) is a scaling technique in which values are shifted and are then rescaled so that they end up ranging between 0 and 1.
We’re going to compare which scaling technique suits our dataset best and thus, we’ll make a box-plot for comparing.
The minimum difference could be seen in Log Scaling. Rest have a huge difference in amounts for 0 and 1 class. Thus, we’d go further with Log Scaling.
Link for part-2: https://vasudhatapriya2.medium.com/credit-card-fraud-detection-part-2-3e75d0022b9b