Modelling highly imbalanced credit card fraud detection data using statistical learning

Moodley, Revesa

Modelling highly imbalanced credit card fraud detection data using statistical learning

dc.contributor.advisor	Britz, Stefan
dc.contributor.author	Moodley, Revesa
dc.date.accessioned	2024-05-27T08:46:14Z
dc.date.available	2024-05-27T08:46:14Z
dc.date.issued	2023
dc.date.updated	2024-05-22T08:11:45Z
dc.description.abstract	Credit card fraud is a major concern for businesses worldwide, yielding losses of up to $67 billion per year in major banks and institutions. Machine learning techniques used to detect fraudulent transactions face several challenges when dealing with highly imbalanced data, which is often the case with fraud detection. Whilst different sampling techniques are generally used to reduce the imbalance, minimal studies have focussed on the effect the level imbalance has on the predictive capabilities of various statistical learning techniques. This study investigates the effect of three factors on model performance: 1) sampling technique, 2) supervised learning method, and 3) prevalence rate, also known as imbalance ratio (IR), which refers to the proportion of majority class samples compared to that of the minority class. Three sampling techniques are utilised in the study: Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE), and Random Undersampling (RUS). These methods are used to create varying levels of imbalance in the datatset, at the prevalence rates of 0.2%, 1%, 10%, 20%, 30%, 40%, and 50%. Six supervised learning models are then used to identify fraudulent transactions: Logistic Regression (LR), C4.5 Decision Trees (DT), Random Forests (RF), XGBoost, and Neural Network (NN) models. Precision, recall and F2 score are the primary metrics used to assess model performance. The results suggest that the ROS and SMOTE sampling techniques performed best in terms of F2 score. The best supervised learning models are RF and XGBoost. The tree models were generally well suited to the imbalanced dataset, whilst LR performed the worst, even when applying regularisation. Increasing the prevalence rate surprisingly yielded a decrease in performance. The findings from the experiments can serve as a foundation for selecting the best sampling technique and supervised learning models to utilize with various degrees of dataset imbalance.
dc.identifier.apacitation	Moodley, R. (2023). <i>Modelling highly imbalanced credit card fraud detection data using statistical learning</i>. (). ,Faculty of Science ,Department of Statistical Sciences. Retrieved from http://hdl.handle.net/11427/39715	en_ZA
dc.identifier.chicagocitation	Moodley, Revesa. <i>"Modelling highly imbalanced credit card fraud detection data using statistical learning."</i> ., ,Faculty of Science ,Department of Statistical Sciences, 2023. http://hdl.handle.net/11427/39715	en_ZA
dc.identifier.citation	Moodley, R. 2023. Modelling highly imbalanced credit card fraud detection data using statistical learning. . ,Faculty of Science ,Department of Statistical Sciences. http://hdl.handle.net/11427/39715	en_ZA
dc.identifier.ris	TY - Thesis / Dissertation AU - Moodley, Revesa AB - Credit card fraud is a major concern for businesses worldwide, yielding losses of up to $67 billion per year in major banks and institutions. Machine learning techniques used to detect fraudulent transactions face several challenges when dealing with highly imbalanced data, which is often the case with fraud detection. Whilst different sampling techniques are generally used to reduce the imbalance, minimal studies have focussed on the effect the level imbalance has on the predictive capabilities of various statistical learning techniques. This study investigates the effect of three factors on model performance: 1) sampling technique, 2) supervised learning method, and 3) prevalence rate, also known as imbalance ratio (IR), which refers to the proportion of majority class samples compared to that of the minority class. Three sampling techniques are utilised in the study: Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE), and Random Undersampling (RUS). These methods are used to create varying levels of imbalance in the datatset, at the prevalence rates of 0.2%, 1%, 10%, 20%, 30%, 40%, and 50%. Six supervised learning models are then used to identify fraudulent transactions: Logistic Regression (LR), C4.5 Decision Trees (DT), Random Forests (RF), XGBoost, and Neural Network (NN) models. Precision, recall and F2 score are the primary metrics used to assess model performance. The results suggest that the ROS and SMOTE sampling techniques performed best in terms of F2 score. The best supervised learning models are RF and XGBoost. The tree models were generally well suited to the imbalanced dataset, whilst LR performed the worst, even when applying regularisation. Increasing the prevalence rate surprisingly yielded a decrease in performance. The findings from the experiments can serve as a foundation for selecting the best sampling technique and supervised learning models to utilize with various degrees of dataset imbalance. DA - 2023 DB - OpenUCT DP - University of Cape Town KW - Statistical Sciences LK - https://open.uct.ac.za PY - 2023 T1 - Modelling highly imbalanced credit card fraud detection data using statistical learning TI - Modelling highly imbalanced credit card fraud detection data using statistical learning UR - http://hdl.handle.net/11427/39715 ER -	en_ZA
dc.identifier.uri	http://hdl.handle.net/11427/39715
dc.identifier.vancouvercitation	Moodley R. Modelling highly imbalanced credit card fraud detection data using statistical learning. []. ,Faculty of Science ,Department of Statistical Sciences, 2023 [cited yyyy month dd]. Available from: http://hdl.handle.net/11427/39715	en_ZA
dc.language.rfc3066	eng
dc.publisher.department	Department of Statistical Sciences
dc.publisher.faculty	Faculty of Science
dc.subject	Statistical Sciences
dc.title	Modelling highly imbalanced credit card fraud detection data using statistical learning
dc.type	Thesis / Dissertation
dc.type.qualificationlevel	Masters
dc.type.qualificationlevel	MSc

Files

Original bundle

Now showing 1 - 1 of 1

Name:: thesis_sci_2023_moodley revesa.pdf
Size:: 2.83 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.72 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Masters