Browsing by Subject "Data Science"
Now showing 1 - 12 of 12
Results Per Page
Sort Options
- ItemOpen AccessA Machine Learning Approach to Predicting the Employability of a Graduate(2019) Modibane, Masego; Georg, Co-PierreFor many credit-offering institutions, such as banks and retailers, credit scores play an important role in the decision-making process of credit applications. It becomes difficult to source the traditional information required to calculate these scores for applicants that do not have a credit history, such as recently graduated students. Thus, alternative credit scoring models are sought after to generate a score for these applicants. The aim for the dissertation is to build a machine learning classification model that can predict a students likelihood to become employed, based on their student data (for example, their GPA, degree/s held etc). The resulting model should be a feature that these institutions should use in their decision to approve a credit application from a recently graduated student.
- ItemOpen AccessA temporal prognostic model based on dynamic Bayesian networks: mining medical insurance data(2021) Mbaka, Sarah Kerubo; Ngwenya, MzabalazoA prognostic model is a formal combination of multiple predictors from which risk probability of a specific diagnosis can be modelled for patients. Prognostic models have become essential instruments in medicine. The models are used for prediction purposes of guiding doctors to make a smart diagnosis, patient-specific decisions or help in planning the utilization of resources for patient groups who have similar prognostic paths. Dynamic Bayesian networks theoretically provide a very expressive and flexible model to solve temporal problems in medicine. However, this involves various challenges due both to the nature of the clinical domain, and the nature of the DBN modelling and inference process itself. The challenges from the clinical domain include insufficient knowledge of temporal interactions of processes in the medical literature, the sparse nature and variability of medical data collection, and the difficulty in preparing and abstracting clinical data in a suitable format without losing valuable information in the process. Challenges about the DBN methodology and implementation include the lack of tools that allow easy modelling of temporal processes. Overcoming this challenge will help to solve various clinical temporal reasoning problems. In this thesis, we addressed these challenges while building a temporal network with explanations of the effects of predisposing factors, such as age and gender, and the progression information of all diagnoses using claims data from an insurance company in Kenya. We showed that our network could differentiate the possible probability exposure to a diagnosis given the age and gender and possible paths given a patient's history. We also presented evidence that the more patient history is provided, the better the prediction of future diagnosis.
- ItemOpen AccessAn exploration of media repertoires in South Africa: 2002-2014(2019) Bakker, Hans-Peter; Durbach, IanThis dissertation explores trends in media engagement in South Africa over a period from 2002 until 2014. It utilises data from the South African Audience Research Foundation’s All Media and Products Surveys. Using factor analysis, six media repertoires are identified and, utilising structural equation modelling, marginal means for various demographic categories by year are estimated. Measurement error is determined with the aid of bootstrapping. These estimates are plotted to provide visual aids in interpreting model parameters. The findings show general declines in engagement with traditional media and growth in internet engagement, but these trends can vary markedly for different demographic groups. The findings also show that for many South Africans traditional media such as television remain dominant.
- ItemOpen AccessAnomaly detection in laboratory tests subject to gatekeeping in selected health facilities(2023) Nantongo, Ssozi Margaret Eva; Er Sebnem; Silal SheetalThe cost of healthcare is currently a huge burden to governments and health care organisations across the world. In South Africa, laboratory tests administered by government facilities are delivered by the National Health Laboratory Service (NHLS) regardless of payment, and hence there is a possibility that certain tests are over ordered by doctors at government health facilities. Gatekeeping is a demand management tool utilised by facilities across the world to manage costs of laboratory testing. Electronic gatekeeping addresses inappropriate laboratory test ordering to reduce the over ordering or under ordering of tests. In South Africa, the electronic gatekeeping (eGK) system is a standardised set of rules that was developed by the National Department of Health (NDOH) as well as NHLS pathologists and clinicians from the individual provinces (NHLS, 2017; Pema et al., 2018; Smit et al., 2015). The eGK system restricts test ordering by applying a given set of rules to tests ordered by a medical official for each patient. The protocols followed by the eGK system are defined using criteria such as date or result of previous test and location/ward of patient. This project aims to identify facilities and wards that are incurring high violations of tests subject to eGK rules. Anomaly detection methods are utilised to identify these facilities and wards together with the tests that require intervention to address the high violations. Three methods were utilised for anomaly detection and included K-means clustering, isolation forests and one-class Support Vector Machines (SVM). The recommended wards for intervention were mostly the maternity related wards at major hospitals. Within these wards, there was evidence of ordering tests that violated the eGK rules more than other wards. Other wards with evidence of over-ordering and violation of eGK rules included ARV Clinic ward, cardiac wards, high care units and respiratory ICU wards. The tests that were selected for intervention in these wards included calcium, magnesium, inorganic phosphate, total protein, albumin, bilirubin tests, creactive protein, procalcitonin and rubella PCR. The facilities selected for intervention included major hospitals for example Nelson Mandela Academic Hospital, Port Elizabeth Provincial Hospital, Livingstone Hospital and Dora Nginza Hospital. In addition, district hospitals and specialised TB hospitals were selected amongst those recommended for intervention. The tests selected for intervention in these facilities included calcium, magnesium, inorganic phosphate, total protein, albumin, c-reactive protein, procalcitonin, Hepatitis B DNA and CA 15-3. The results of the analysis were compared with results from published literature, and it was found that some of the tests recommended for intervention were also highlighted by previous researchers, for example c-reactive protein tests. A comparison of the results from the K-means clustering, one-class SVM and isolation forests anomaly detection showed that the same wards, facilities, and tests were recommended for intervention. Therefore, anomaly detection is a suitable method for identification of wards and facilities that are violating test ordering rules more than other facilities.
- ItemOpen AccessCollaborative Genre Tagging(2020) Leslie, James; Lacerda, MiguelRecommender systems (RS) are used extensively in online retail and on media streaming platforms to help users filter the plethora of options at their disposal. Their goal is to provide users with suggestions of products or artworks that they might like. Content-based RS's make use of user and/or item metadata to predict user preferences, while collaborative-filtering (CF) has proven to be an effective approach in tasks such as predicting movie or music preferences of users in the absence of any metadata. Latent factor models have been used to achieve state-of-the-art accuracy in many CF settings, playing an especially large role in beating the benchmark set in the Netflix Prize in 2008. These models learn latent features for users and items to predict the preferences of users. The first latent factor models made use of matrix factorisation to learn latent factors, but more recent approaches have made use of neural architectures with embedding layers. This master's dissertation outlines collaborative genre tagging (CGT), a transfer learning application of CF that makes use of latent factors to predict genres of movies, using only explicit user ratings as model inputs.
- ItemOpen AccessDesigning an event display for the Transition Radiation Detector in ALICE(2021) Perumal, Sameshan; Dietel, Thomas; Kuttel, MichelleWe document here a successful design study for an event display focused on the Transition Radiation Detector (TRD) within A Large Ion Collider Experiment (ALICE) at the European Organisation for Nuclear Research (CERN). Reviews of the fields of particle physics and visualisation are presented to motivate formally designing this display for two different audiences. We formulate a methodology, based on successful design studies in similar fields, that involves experimental physicists in the design process as domain experts. An iterative approach incorporating in-person interviews is used to define a series of visual components applying best practices from literature. Interactive event display prototypes are evaluated with potential users, and refined using elicited feedback. The primary artefact is a portable, functional, effective, validated event display – a series of case studies evaluate its use by both scientists and the general public. We further document use cases for, and hindrances preventing, the adoption of event displays, and propose novel data visualisations of experimental particle physics data. We also define a flexible intermediate JSON data format suitable for web-based displays, and a generic task to convert historical data to this format. This collection of artefacts can guide the design of future event displays. Our work makes the case for a greater use of high quality data visualisation in particle physics, across a broad spectrum of possible users, and provides a framework for the ongoing development of web-based event displays of TRD data.
- ItemOpen AccessForecasting and modelling the VIX using Neural Networks(2022) Netshivhambe, Nomonde; Huang, Chun-SungThis study investigates the volatility forecasting ability of neural network models. In particular, we focus on the performance of Multi-layer Perceptron (MLP) and the Long Short Term (LSTM) Neural Networks in predicting the CBOE Volatility Index (VIX). The inputs into these models includes the VIX, GARCH(1,1) fitted values and various financial and macroeconomic explanatory variables, such as the S&P 500 returns and oil price. In addition, this study segments data into two sub-periods, namely a Calm and Crisis Period in the financial market. The segmentation of the periods caters for the changes in the predictive power of the aforementioned models, given the dierent market conditions. When forecasting the VIX, we show that the best performing model is found in the Calm Period. In addition, we show that the MLP has more predictive power than the LSTM.
- ItemOpen AccessLog mining to develop a diagnostic and prognostic framework for the MeerLICHT telescope(2022) Roelf, Timothy Brian; Groot, Paul Joseph; Rakotonirainy, Rosephine GeorginaIn this work we present the approach taken to address the problems anomalous fault detection and system delays experienced by the MeerLICHT telescope. We make use of the abundantly available console logs, that record all aspects of the telescope's function, to obtain information. The MeerLICHT operational team must devote time to manually inspecting the logs during system downtime to discover faults. This task is laborious, time inefficient given the large size of the logs, and does not suit the time-sensitive nature of many of the surveys the telescope partakes in. We used the novel approach of the Hidden Markov model, to address the problems of fault detection and system delays experienced by the MeerLICHT. We were able to train the model in three separate ways, showing some success at fault detection and none at the addressing the system delays.
- ItemOpen AccessNatural Language Processing on Data Warehouses(2019) Maree, Stiaan; Durbach, IanThe main problem addressed in this research was to use natural language to query data in a data warehouse. To this effect, two natural language processing models were developed and compared on a classic star-schema sales data warehouse with sales facts and date, location and item dimensions. Utterances are queries that people make with natural language, for example, What is the sales value for mountain bikes in Georgia for 1 July 2005?" The first model, the heuristics model, implemented an algorithm that steps through the sequence of utterance words and matches the longest number of consecutive words at the highest grain of the hierarchy. In contrast, the embedding model implemented the word2vec algorithm to create different kinds of vectors from the data warehouse. These vectors are aggregated and then the cosine similarity between vectors was used to identify concepts in the utterances that can be converted to a programming language. To understand question style, a survey was set up which then helped shape random utterances created to use for the evaluation of both methods. The first key insight and main premise for the embedding model to work is a three-step process of creating three types of vectors. The first step is to train vectors (word vectors) for each individual word in the data warehouse; this is called word embeddings. For instance, the word `bike' will have a vector. The next step is when the word vectors are averaged for each unique column value (column vectors) in the data warehouse, thus leaving an entry like `mountain bike' with one vector which is the average of the vectors for `mountain' and `bike'. Lastly, the utterance by the user is averaged (utterance vectors) by using the word vectors created in step one, and then, by using cosine similarity, the utterance vector is matched to the closest column vectors in order to identify data warehouse concepts in the utterance. The second key insight was to train word vectors firstly for location, then separately for item - in other words, per dimension (one set for location, and one set for item). Removing stop words was the third key insight, and the last key insight was to use Global Vectors to instantiate the training of the word vectors. The results of the evaluation of the models indicated that the embedding model was ten times faster than the heuristics model. In terms of accuracy, the embedding algorithm (95.6% accurate) also outperformed the heuristics model (70.1% accurate). The practical application of the research is that these models can be used as a component in a chatbot on data warehouses. Combined with a Structured Query Language query generation component, and building Application Programming Interfaces on top of it, this facilitates the quick and easy distribution of data; no knowledge of a programming language such as Structured Query Language is needed to query the data.
- ItemOpen AccessSmall-scale distributed machine learning in R(2022) Taylor, Brenden; Britz, Stefan S; Pienaar, EtienneMachine learning is increasing in popularity, both in applied and theoretical statistical fields. Machine learning models generally require large amounts of data to train and thus are computationally expensive, both in the absolute sense of actual compute time, and in the relative sense of the numerical complexity of the underlying calculations. Particularly for students of machine learning, appropriate computing power can be difficult to come by. Distributed machine learning, which involves sending tasks to a network of attached computers, can offer users access to significantly more computing power than otherwise by leveraging more processors than in a single computer. This research outlines the core concepts of distributed computing and provides brief outlines of the more common approaches to parallel and distributed computing in R, with reference to the specific algorithms and aspects of machine learning that are investigated. One particular parallel backend, doRedis, offers particular advantages as it is easy to set up and implement, and allows for the elastic attaching and detaching of computers from a distributed network. This paper will describe core features of the doRedis package and show, by means of applying certain aspects of the machine learning process, that it is both viable and beneficial to distribute these machine learning aspects. There is the potential for significant time savings when distributing machine learning model training. Particularly for students, the time required for setting up of a distributed network in which to use doRedis is far outweighed by the benefits. The implication that this research aims to explore, is that students will be able to leverage the many computers often available in computer labs to train more complex machine learning models in less time than they would otherwise be able to when using the built-in parallel packages that are already common in R. In fact, certain machine learning packages that already parallelise model training can be distributed to a network of computers, thereby further increasing the gains realised by parallelisation. In this way, more complex machine learning is more accessible. This research outlines the benefits that lie in the distribution of machine learning problems in an accessible, small-scale environment. This small-scale ‘proof of concept' performs well enough to be viable for students, while also creating a bridge, and introducing the knowledge required, to deploy large-scale distribution of machine learning problems.
- ItemOpen AccessUnsupervised Machine Learning Application for the Identification of Kimberlite Ore Facie using Convolutional Neural Networks and Deep Embedded Clustering(2021) Langton, Sean; Er, SebnemMining is a key economic contributor to many regions globally - especially those in developing nations. The design and operation of the processing plants associated with each of these mines is highly dependant on the composition of the feed material. The aim of this research is to demonstrate the viability of implementing a computer vision solution to provide online information of the composition of material entering the plant, thus allowing the plant operators to adjust equipment settings and process parameters accordingly. Data is collected in the form of high resolution images captured every couple of seconds of material on the main feed conveyor belt into the Kao Diamond Mine processing plant. The modelling phase of the research is implemented in two stages. The first stage involves the implementation of a Mask Region-based Convolutional Neural Network (Mask R-CNN) model with a ResNet 101 CNN backbone for instance segmentation of individual rocks from each image. These individual rock images are extracted and used for the second phase of the modelling pipeline - utilizing an unsupervised clustering method known as Convolutional Deep Embedded Clustering with Data Augmentation (ConvDEC-DA). The clustering phase of this research provides a method to group feed material rocks into their respective types or facie using features developed from the auto-encoder portion of the ConvDEC-DA modelling. While this research focuses on the clustering of Kimberlite rocks according to their respective facie, similar implementations are possible for a wide range of mining and rock types.
- ItemOpen AccessUsing state-space time series analysis on wetland bird species to formulate effective bioindicators in the Barberspan wetland(2022) Edwards, Gareth; Altwegg, Andreas; Erni, BirgitThe Coordinated Waterbird Count dataset (CWAC) is a dataset containing waterbird counts from wetlands across South Africa, going as far back as 1970. These data contain valuable information on population sizes and their trends over time. This information could be used more widely if it was more easily accessible to users. The aim of this dissertation is to bridge the gap between the CWAC dataset and the end users (for both experts and non-experts). In so doing the report also provides valuable insight into the state of wetlands in South Africa using various biodiversity indices, starting with Barberspan wetland as the pilot study site. A state-space time series model was applied to the waterbird counts in the CWAC dataset to determine waterbird population trends over the years. Statespace models are able to separate observation error from true population process error, thus providing a more accurate estimation of true population size. This qualifies state-space models as an ideal tool for population dynamics. The state-space model produced estimates of true population size for each waterbird per year. Three different indices were applied to the estimates, namely, exponentiated Shannon's index, Simpson's index and a modified Living Planet Index. These indices aggregate the count data to a measure of effective number of waterbirds in an ecosystem, a measure of evenness of an ecosystem, and an abundance index respectively. Using these three indices, in conjunction with each other, and individual waterbird species as bioindicators for various wetland traits, the end user is presented with a broad overview of the state of the Barberspan wetland. The implication of this research is beneficial to various wetland conservation organisations globally (AEWA, Aichi, RAMSAR) and locally (Working for Wetlands), as it provides valuable insight into the state of wetlands of South Africa. Furthermore, it helps managers at a local level in their decision making to enable more evidence-based approaches to protect South African wetlands and its waterbirds.