The Effect of Dataset Size on the Performance of Classification Algorithms for Credit Scoring
Master Thesis
2022
Permanent link to this Item
Authors
Supervisors
Journal Title
Link to Journal
Journal ISSN
Volume Title
Publisher
Publisher
Department
Faculty
License
Series
Abstract
In the body of research on predicting borrower default in the field of credit risk, the relative performance of different classification algorithms has received much attention by researchers. With the rise of machine learning techniques spurred on by fast, cheap computing and the generation and collection of massive datasets, there has been a great deal of research benchmarking the performance of these powerful machine learning algorithms against traditional techniques used in credit scoring. This dissertation extends the research into the benchmarking of different classification algorithms by looking to establish the effect that the size of a training dataset has on the relative performance of different algorithms. This paper conducts a thorough review of the literature on credit risk prediction to establish the methodologies and findings prevalent in the literature. It finds that most of the research on the subject has been conducted using very small datasets to benchmark algorithm performance on. Furthermore, due to the relatively low number of publicly available credit risk datasets available, most of the research is conducted on the same datasets. This paper uses five publicly available credit risk datasets, on which six algorithms are evaluated. The algorithms evaluated in this paper are logistic regression, random forests, neural networks, gradient boosting machines, extreme gradient boosting, and stacked ensembles. This paper conducts two separate analyses on these datasets and algorithms – firstly, a general analysis, where algorithm performance is benchmarked by average relative performance over all datasets, in line with the conventions established by other researchers prevalent in the literature. Secondly, a learning curve analysis established by researchers in a different field is utilised – this technique, which involves training algorithms on subsamples of different sizes of a large dataset, allows us to investigate the ability of different algorithms to leverage additional data to improve prediction performance. The general analysis affirms the relative outperformance of the stacked ensemble algorithms relative to the other algorithms evaluated – a finding prevalent in the body of literature. It also establishes evidence for the effect of dataset size on the relative performance of different algorithms. The learning curve analysis shows definitively that the more sophisticated classification algorithms - random forests, neural networks, gradient boosting machines, extreme gradient boosting and stacked ensembles – are able to consistently improve performance with more data available to train on. Logistic regression, by contrast, reaches a performance plateau at a relatively small training dataset size. Given the prevalence of small datasets in the literature, this finding implies that existing research might overemphasise the performance of logistic regression relative to more sophisticated algorithms.
Description
Keywords
Reference:
Gidlow, L. 2022. The Effect of Dataset Size on the Performance of Classification Algorithms for Credit Scoring. . ,Faculty of Commerce ,Department of Finance and Tax. http://hdl.handle.net/11427/37193