Browsing by Author "Britz, Stefan"
Now showing 1 - 7 of 7
Results Per Page
Sort Options
- ItemOpen AccessAn exploration of alternative features in micro-finance loan default prediction models(2020) Stone, Devon; Britz, StefanDespite recent developments financial inclusion remains a large issue for the World's unbanked population. Financial institutions - both larger corporations and micro-finance companies - have begun to provide solutions for financial inclusion. The solutions are delivered using a combination of machine learning and alternative data. This minor dissertation focuses on investigating whether alternative features generated from Short Messaging Service (SMS) data and Android application data contained on borrowers' devices can be used to improve the performance of loan default prediction models. The improvement gained by using alternative features is measured by comparing loan default prediction models trained using only traditional credit scoring data to models developed using a combination of traditional and alternative features. Furthermore, the paper investigates which of 4 machine learning techniques is best suited for loan default prediction. The 4 techniques investigated are logistic regression, random forests, extreme gradient boosting, and neural networks. Finally the paper identifies whether or not accurate loan default prediction models can be trained using only the alternative features developed throughout this minor dissertation. The results of the research show that alternative features improve the performance of loan default prediction across 5 performance indicators, namely overall prediction accuracy, repaid prediction accuracy, default prediction accuracy, F1 score, and AUC. Furthermore, extreme gradient boosting is identified as the most appropriate technique for loan default prediction. Finally, the research identifies that models trained using the alternative features developed throughout this project can accurately predict loan that have been repaid, the models do not accurately predict loans that have not been repaid.
- ItemOpen AccessApplication of CNN-gcForestCS to cassava leaf image classification(2023) Carew, Liam; Britz, StefanCassava is one of the most consumed carbohydrates in the world, providing a reliable source of income and nutrition to inhabitants of Latin America, Africa and Asia. However, its production is greatly affected by pathogenic infection with cassava mosaic disease (CMD) posing the greatest threat to cassava farmers in Africa and Asia. Given that developing nations are estimated to be hit hardest by climate change and projected to have the largest population increases in coming decades, optimisation of cassava yield in these areas is imperative to ensure food security. Traditionally, crop health is determined by manual inspection which can be laborious, error-prone and require technical expertise. This produces a costly barrier of entry for smallholding farmers who make up majority of global cassava production. Development of automated disease detection systems using convolutional neural networks (CNNs) deployable on mobile phones have shown to be a cost-efficient and effective method for cassava monitoring, mainly owing to their advanced feature extraction capabilities. However, CNNs require complex hyperparameter tuning and can be computationally intensive to train. GcForestCS (multi-grained cascade forest with confidence screening) presents an alternative statistical learning method that can be trained using CPU, and requires less complex hyperparameter tuning than deep learning while producing competitive performance for lower-dimensionality datasets. Taking advantage of the feature extraction capabilities of CNNs and the competitive performance of gcForestCS for lower-dimensionality datasets, the central aim of this dissertation was to investigate CNN-gcForestCS as an alternative to deep learning for cassava leaf disease detection. The performance of CNN-gcForestCS was compared to gcForestCS and deep learning where the effect of class balance, CNN feature extraction, CNN feature extractor fine-tuning, pooling after multi-grained scanning, and training set curation were assessed. The results showed that the best DenseNet201-gcForestCS model (86.79%) produced marginally worse performance than the best DenseNet201 model (87.43%), while the best MobileNetV2-gcForestCS model (83.66%) produced marginally better performance than the best MobileNetV2 model (82.87%). Overall, the results indicate that it is inconclusive whether CNN-gcForestCS is a viable alternative to deep learning for cassava leaf disease detection, especially when considering the high computational cost associated with the CNN-gcForestCS methodology.
- ItemOpen AccessApplying imputation and statistical learning to predict gamma-glutamyl transferase in underwriting data(2023) Perumal, Yevashan; Britz, StefanInsurance underwriting can be time-consuming and costly for both insurers and customers. However, the insight gained is of critical importance in addressing the information asymmetry between insurers and customers in terms of establishing a customer's risk profile. Consequently, any test that assists in providing a risk assessment is critical in allowing insurance companies to manage risk and price their products appropriately. Gamma-glutamyl Transferase (GGT) is an enzyme which has been used by insurers in underwriting medical tests as an indicator of potential adverse outcomes. However, due to complexities such as differing underwriting strategies, data collection and data storage issues, not every customer on an insurer's books will have a GGT value or even a complete data profile. This research investigates if statistical techniques such as imputation and supervised learning can be used in conjunction with available medical, demographic, underwriting and policy data to accurately predict GGT values. A combination of multivariate imputation by chained equations (MICE) and extremegradient boosted trees (XGBoost) offers a 31% improvement in accuracy compared to a naïve prediction. However, there does appear to be a limit to the performance achieved from all implemented techniques with the analysed dataset, with various model combinations yielding root mean squared error (RMSE) values within a narrow range. In addition, when comparing the predictions from a separate, unlabelled dataset to actual data, it appears as though predictions from the models cannot be reliably deemed to be from the same distribution. This indicates that further research is required before insurers can reliably switch out blood-work based GGT results for those from a supervised learning model. Keywords: insurance, underwriting, gamma-glutamyl transferase, imputation, supervised learning
- ItemOpen AccessAutomated quantification of plant water transport network failure using deep learning(2021) Naidoo, Tristan; Britz, Stefan; Moncrieff, GlennDroughts, exacerbated by anthropogenic climate change, threaten plants through hydraulic failure. This hydraulic failure is caused by the formation of embolisms which block water flow in a plant's xylem conduits. By tracking these failures over time, vulnerability curves (VCs) can be created. The creation of these curves is laborious and time consuming. This study seeks to automate the creation of these curves. In particular, it seeks to automate the optical vulnerability (OV) method of determining hydraulic failure. To do this, embolisms need to be segmented across a sequence of images. Three fully convolutional models were considered for this task, namely U-Net, U-Net (ResNet34), and W-Net. The sample consisted of four unique leaves, each with its own sequence of images. Using these leaves, three experiments were conducted. They considered whether a leaf could generalise across samples from the same leaf, across different leaves of the same species, and across different species. The results were assessed on two levels; the first considered the results of the segmentation, and the second considered how well VCs could be constructed. Across the three experiments, the highest test precision-recall AUCs achieved were 81%, 45%, and 40%. W-Net performed the worst across the models, while U-Net and U-Net (ResNet-34) performed similarly to one another. VC reconstruction was assessed using two metrics. The first is Normalised Root Mean Square Error. The second is the difference in Ψ50 values between the true VC and the predicted VC, where Ψ50 is a physiological value of interest. This study found that the shape of the VCs could be reconstructed well if the model was able to recall a portion of embolisms across all images which had embolisms. Moreover, it found that some images may be more important than others due to a non-linear mapping between time and water potential. VC reconstruction was satisfactory, except for the third experiment. This study demonstrates that, in certain scenarios, automation of the OV method is attainable. To support the ubiquitous use and development of the work done in this study, a website was created to document the code base. In addition, this website contains instructions on how to interact with the code base. For more information please visit: https://plant-network-segmentation.readthedocs.io/.
- ItemOpen AccessModelling first innings totals in T20 cricket: applications in the Indian Premier League(2023) Gilbert, Arlton; Britz, StefanIn the game of cricket, teams batting first are faced with the question of how many runs are enough. This paper proposes a solution to this in the context of the Indian Premier League (IPL). The aim is to build a model that will allow teams to determine what scores they would need to score for any given confidence of avoiding defeat in regular time, viz. before any Super Overs. The following machine learning methods are considered for this purpose: logistic regression, classification trees, bagging, random forest, boosting, support vector machines, artificial neu- ral networks, and naive Bayes. Features are chosen that represent various key aspects of the game, including player strengths, stadium information, the winner of the toss, and which teams are involved. The results show that logistic regression is the best performing model, having a prediction accuracy of 70.27% and a Brier score of 0.2 for the 2022 season of the IPL. The majority of the incorrect predictions occurred in prediction ranges where the model itself suggested the game could have gone either way. The model is, therefore, fit for purpose and can allow teams to pace their innings and reduce unnecessary risks. The model can also be trained and used on other limited-over tournaments, including one-day matches.
- ItemOpen AccessModelling highly imbalanced credit card fraud detection data using statistical learning(2023) Moodley, Revesa; Britz, StefanCredit card fraud is a major concern for businesses worldwide, yielding losses of up to $67 billion per year in major banks and institutions. Machine learning techniques used to detect fraudulent transactions face several challenges when dealing with highly imbalanced data, which is often the case with fraud detection. Whilst different sampling techniques are generally used to reduce the imbalance, minimal studies have focussed on the effect the level imbalance has on the predictive capabilities of various statistical learning techniques. This study investigates the effect of three factors on model performance: 1) sampling technique, 2) supervised learning method, and 3) prevalence rate, also known as imbalance ratio (IR), which refers to the proportion of majority class samples compared to that of the minority class. Three sampling techniques are utilised in the study: Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE), and Random Undersampling (RUS). These methods are used to create varying levels of imbalance in the datatset, at the prevalence rates of 0.2%, 1%, 10%, 20%, 30%, 40%, and 50%. Six supervised learning models are then used to identify fraudulent transactions: Logistic Regression (LR), C4.5 Decision Trees (DT), Random Forests (RF), XGBoost, and Neural Network (NN) models. Precision, recall and F2 score are the primary metrics used to assess model performance. The results suggest that the ROS and SMOTE sampling techniques performed best in terms of F2 score. The best supervised learning models are RF and XGBoost. The tree models were generally well suited to the imbalanced dataset, whilst LR performed the worst, even when applying regularisation. Increasing the prevalence rate surprisingly yielded a decrease in performance. The findings from the experiments can serve as a foundation for selecting the best sampling technique and supervised learning models to utilize with various degrees of dataset imbalance.
- ItemOpen AccessMonitoring and mapping the critically endangered Clanwilliam cedar using aerial imagery and deep learning(2021) Hadebe, Blessings; Britz, Stefan; Moncrieff, GlennThe critically endangered Clanwilliam cedar, Widdringtonia wallichii, is an iconic tree species endemic to the Cederberg mountains in the Fynbos Biome. Consistent declines in its populations have been noted across its range primarily due to the impact of fire and climate change. Mapping the occurrences of this species over its range is key to the monitoring of surviving individuals and is important for the management of biodiversity in the region. Recent efforts have focused on the use of freely available Google EarthTM imagery to manually map the species across its global native distribution. This study advances this work by proposing an approach for automating the process of tree detection using deep-learning. The approach involves using sets of high-resolution red, green, blue (RGB) imagery to train artificial neural networks for the task of tree-crown detection. Additional models are trained on colour-infrared imagery, since live vegetation has a red tone on the near-infrared (NIR) spectrum. Preliminary results show that using an intersection-over-union threshold of 0.5 yields an average tree-crown recall of 0.67 with a precision of 0.53, and that the addition of the NIR spectral band does not result in improved performance. The viability of using this approach to regularly update maps of the Clanwilliam Cedar and monitor its population trends in the Cederberg is assessed.