OpenUCT :: Browsing by Subject "Statistics"

Browsing by Subject "Statistics"

Now showing 1 - 14 of 14

Open Access
A GLMM analysis of data from the Sinovuyo Caring Families Program (SCFP)
(2018) Nhapi, Raymond T; Little, Francesca; Kassanjee, R
We present an analysis of the data from a longitudinal randomized control trial that assesses the impact of an intervention program aimed at improving the quality of childcare within families. The SCFP was a group-based program implemented over two separate waves conducted in Khayelitsha and Nyanga. The data were collected at baseline, post-test and at one-year follow-up via questionnaires (self-assessment) and observational video coding. Multiple imputation (using chained equations) procedures were used to impute missing information. Generalized linear Mixed Effect Models (GLMMs) were used to assess the impact of the intervention program on the responses, adjusted for possible confounding variables. These summed scores were often right skewed with zero-inflation. All the effects (fixed and random) were estimated through the method of maximum likelihood. Primarily, an intention-to-treat analysis was done after which a per-protocol analysis was also implemented with participants who attended a specified number of the group sessions. All these GLMMs were implemented in the imputation framework.
Open Access
Anomaly detection in a mobile data network
(2019) Salzwedel, Jason Paul; Ngwenya, Mzabalazo
The dissertation investigated the creation of an anomaly detection approach to identify anomalies in the SGW elements of a LTE network. Unsupervised techniques were compared and used to identify and remove anomalies in the training data set. This “cleaned” data set was then used to train an autoencoder in an semi-supervised approach. The resultant autoencoder was able to indentify normal observations. A subsequent data set was then analysed by the autoencoder. The resultant reconstruction errors were then compared to the ground truth events to investigate the effectiveness of the autoencoder’s anomaly detection capability.
Open Access
Classification and visualisation of text documents using networks
(2018) Phaweni, Thembani; Durbach, Ian; Varughese, Melvin; Bassett, Bruce
In both the areas of text classification and text visualisation graph/network theoretic methods can be applied effectively. For text classification we assessed the effectiveness of graph/network summary statistics to develop weighting schemes and features to improve test accuracy. For text visualisation we developed a framework using established visual cues from the graph visualisation literature to communicate information intuitively. The final output of the visualisation component of the dissertation was a tool that would allow members of the public to produce a visualisation from a text document. We represented a text document as a graph/network. The words were nodes and the edges were created when a pair of words appeared within a pre-specified distance (window) of words from each other. The text document model is a matrix representation of a document collection such that it can be integrated into a machine or statistical learning algorithm. The entries of this matrix can be weighting according to various schemes. We used the graph/network representation of a text document to create features and weighting schemes that could be applied to the text document model. This approach was not well developed for text classification therefore we applied different edge weighting methods, window sizes, weighting schemes and features. We also applied three machine learning algorithms, naïve Bayes, neural networks and support vector machines. We compared our various graph/network approaches to the traditional document model with term frequency inverse-document-frequency. We were interested in establishing whether or not the use of graph weighting schemes and graph features could increase test accuracy for text classification tasks. As far as we can tell from the literature, this is the first attempt to use graph features to weight bag-of-words features for text classification. These methods had been applied to information retrieval (Blanco & Lioma, 2012). It seemed they could also be applied to text classification. The text visualisation field seemed divorced from the text summarisation and information retrieval fields, in that text co-occurrence relationships were not treated with equal importance. Developments in the graph/network visualisation literature could be taken advantage of for the purposes of text visualisation. We created a framework for text visualisation using the graph/network representation of a text document. We used force directed algorithms to visualise the document. We used established visual cues like, colour, size and proximity in space to convey information through the visualisation. We also applied clustering and part-of-speech tagging to allow for filtering and isolating of specific information within the visualised document. We demonstrated this framework with four example texts. We found that total degree, a graph weighting scheme, outperformed term frequency on average. The effect of graph features depended heavily on the machine learning method used: for the problems we considered graph features increased accuracy for SVM classifiers, had little effect for neural networks and decreased accuracy for naïve Bayes classifiers Therefore the impact on test accuracy of adding graph features to the document model is dependent on the machine learning algorithm used. The visualisation of text graphs is able to convey meaningful information regarding the text at a glance through established visual cues. Related words are close together in visual space and often connected by thick edges. Large nodes often represent important words. Modularity clustering is able to extract thematically consistent clusters from text graphs. This allows for the clusters to be isolated and investigated individually to understand specific themes within a document. The use of part-of-speech tagging is effective in both reducing the number of words being displayed but also increasing the relevance of words being displayed. This was made clear through the use of part-of-speech tags applied to the Internal Resistance of Apartheid Wikipedia webpage. The webpage was reduced to its proper nouns which contained much of the important information in the text. Training accuracy is important in text classification which is a task that can often be performed on vast amounts of documents. Much of the research in text classification is aimed at increasing classification accuracy either through feature engineering, or optimising machine learning methods. The finding that total degree outperformed term frequency on average provides an alternative avenue for achieving higher test accuracy. The finding that the addition of graph features can increase test accuracy when matched with the right machine learning algorithm suggests some new research should be conducted regarding the role that graph features can have in text classification. Text visualisation is used as an exploratory tool and as a means of quickly and easily conveying text information. The framework we developed is able to create automated text visualisations that intuitively convey information for short and long text documents. This can greatly reduce the amount of time it takes to assess the content of a document which can increase general access to information.
Open Access
Investigating efficiency in the emergency department at Groote Schuur Hospital
(2010) Mowbray, Allister; Stewart, Theodor J
Includes bibliographical references (p. 92-93).
Open Access
Low volatility alternative equity indices
(2015) Oladele, Oluwatosin Seun; Bradfield, David
In recent years, there has been an increasing interest in constructing low volatility portfolios. These portfolios have shown significant outperformance when compared with the market capitalization-weighted portfolios. This study analyses the low volatility portfolios in South Africa using sectors instead of individual stocks as building blocks for portfolio construction. The empirical results from back-testing these portfolios show significant outperformance when compared with their market capitalization weighted equity benchmark counterpart (ALSI). In addition, a further analysis of this study delves into the construction of the low volatility portfolios using the Top 40 and Top 100 stocks. The results also show significant outperformance over the market-capitalization portfolio (ALSI), with the portfolios constructed using the Top 100 stocks having a better performance than portfolio constructed using the Top 40 stocks. Finally, the low volatility portfolios are also blended with typical portfolios (ALSI and the SWIX indices) in order to establish their usefulness as effective portfolio strategies. The results show that the Low volatility Single Index Model (SIM) and the Equally Weight low-beta portfolio (Lowbeta) were the superior performers based on their Sharpe ratios.
Open Access
Machine learning methods for individual acoustic recognition in a species of field cricket
(2018) Dlamini, Gciniwe; Durbach, Ian
Crickets, like other insects, play a vital role in maintaining a balance in the ecosystem. Therefore, the ability to identify individual crickets is crucial as it enables ecologists to estimate important population metrics such as population densities, which in turn are used to investigate ecological questions pertaining to these insects. In this research, classification models were developed to recognise individual field crickets of the species Plebeiogryllus guttiventris based solely on the audio recordings of their calls. Recent advances in technology have made data collection easier, and consequently, large volumes of data, including acoustic data, have become available to ecologists. The task of acoustic animal identifications thus requires the utilisation of models that are well suited for training large datasets. It is for this very reason that convolutional neural networks (CNN) and recurrent neural networks (RNN) were utilised in this research. The results of these models were compared to results of a baseline random forest (RF) model as RFs can also be used to make acoustic classifications. Mel-frequency cepstral coefficients (MFCC), raw acoustic samples as well as two temporal features were extracted from each chirp in the cricket recordings and used as inputs to train the machine learning models. The raw acoustic samples were only used in the deep neural network (DNN) models (CNNs and RNNs) as these models have been successful in training other raw forms of data such as images (for example, Krizhevsky et al. (2012)). Training on the MFCC features was conducted in two ways: the DNN models were trained on MFCC matrices that each spanned a chirp, whereas the RF models were trained on the MFCC frame vectors. This is because RF are only able to train on vector representations of observations, not matrices. The frame-level MFCC predictions obtained from the RF model were then aggregated into chirp-level predictions to facilitate the comparison with the other classification models. The best classification performance was achieved by the RF model trained on the MFCC features with a score of 99.67%. The worst performance was observed from the RF model trained upon the temporal features, which scored 67%. The DNN models attained on average 98.6% classification accuracies when trained on both MFCC features and the raw acoustic samples. These results show that individual recognition of the crickets using acoustics can be achieved with great success through the use of machine learning. Moreover, the performance of the deep learning models when trained upon the raw acoustic samples indicate that the feature (MFCC) extraction step can be bypassed; the deep learning machine algorithms can be trained directly on the raw acoustic data and still achieve great results.
Open Access
Modelling Multivariate Nonlinear Vaccine Induced Immune Responses
(2020) Lapham, Brendon M; Little, Francesca
Interpretable statistical models for multivariate vaccine induced immune response data are important as they provide a rigorous means of deciding which vaccine candidates should be advanced in the clinical trials process. We consider applications of several different statistical models to a vaccine data set which contains multivariate immune responses for several novel Tuberculosis vaccines and the current BCG vaccine. The immune responses in the data set have several features which the models need to account for. In particular, the models need to account for the multivariate repeated measures for the subjects, the nonlinear profiles of the immune responses, and the zero-inflated skew distributions of the immune responses. We find that Tweedie multivariate generalised linear mixed effect and latent variable models with cubic B-splines perform well for this data set relative to linear, nonlinear, and univariate Tweedie generalised linear mixed effect models. In addition, the Tweedie multivariate generalised linear mixed effect and latent variable models have several advantages over the other models we consider and are also capable of interpretation; importantly, we are able to draw clinical conclusions about which novel TB vaccine candidates appear to be the most promising.
Open Access
Nelson Siegel parameterisation of the South African Sovereign Yield Curve: an exploration of its predictors, a link to the main asset classes and implementation of systematic trading strategies
(2014) Petousis, Thalia
The aims of this research are firstly to model the South African Local Government Bond Yield curve according to the Nelson Siegel Parameterisation framework, as implemented in the pivotal work of Diebold and Li (2006) in forecasting the US Treasury curve.
Open Access
Non-Linear diffusion processes and applications
(2016) Pienaar, Etienne A D; Varughese, Melvin
Diffusion models are useful tools for quantifying the dynamics of continuously evolving processes. Using diffusion models it is possible to formulate compact descriptions for the dynamics of real-world processes in terms of stochastic differential equations. Despite the exibility of these models, they can often be extremely difficult to work with. This is especially true for non-linear and/or time-inhomogeneous diffusion models where even basic statistical properties of the process can be elusive. As such, we explore various techniques for analysing non-linear diffusion models in contexts ranging from conducting inference under discrete observation and solving first passage time problems, to the analysis of jump diffusion processes and highly non-linear diffusion processes. We apply the methodology to a number of real-world ecological and financial problems of interest and demonstrate how non-linear diffusion models can be used to better understand such phenomena. In conjunction with the methodology, we develop a series of software packages that can be used to accurately and efficiently analyse various classes of non-linear diffusion models.
Open Access
Plasticity and partitioning of foraging behaviour among and within sympatric Pygocelis penguin populations
(2024) de Kock, Leandri; Oosthuizen, Christiaan
Central-place foragers, such as breeding seabirds, need to adjust their foraging behaviours in response to the growth and development of their offspring. As a result, they need to return to their nests regularly. These breeding constraints limit their foraging ranges. In the Southern Ocean, sympatrically breeding penguin species often have overlapping foraging ranges and niches that may lead to competition. Interspecific (between species) and intraspecific (within species) competition are important processes that may shape the foraging behaviours of penguins. However, competition pressure may vary with changes in environmental conditions across several scales (e.g. between years or within breeding seasons as a function of fluctuating central-place foraging constraints). This dissertation aimed to determine how the foraging behaviour of two closely related and cooccurring seabird species - chinstrap (Pygoscelis antarcticus) and gentoo (P. papua) penguins – differ among and within populations. I analysed high-resolution location (GPS) and dive data from 221 individuals breeding at two sites (Nelson Island and Kopaitic Island) in the West Antarctic Peninsula during the 2018/19 austral summer. These sites are characterised by different environmental conditions and penguin population sizes, two factors that may influence foraging behaviours and niche partitioning. The first chapter includes a general background of the study and the dissertation's aims and objectives. In the second chapter, I investigated intraspecific phenotypic plasticity of foraging behaviours among and within these penguin populations. In a subsequent chapter, I quantified foraging niche separation and identified factors that modify interspecific niche separation between chinstrap and gentoo penguins at the two sites. To test whether penguins exhibited phenotypic plasticity in foraging trip distances and duration, and to partition diving behaviours (e.g. maximum dive depth) among and within populations, I fitted a series of generalized linear mixed-effects models with species, site, breeding stage (incubation, brood and crèche) and environmental variables as covariates. In addition to comparing foraging behaviours between populations, my analysis quantified how individuals differed in their average behavioural expression using a repeatability index. I used an autocorrelated kernel density estimate approach to quantify space use and overlap between species as breeding transitioned from incubation to brood and crèche. Sites greatly influenced both species' foraging behaviours, with the Kopaitic Island environment being a colder, saltier environment which may be more suitable for foraging. Chinstrap penguins, which prey almost exclusively on Antarctic krill (Euphausia superba), along with gentoo penguins (dietary generalists) showed plasticity in foraging trip and dive behaviours between sites and breeding stages. During brood and crèche, chinstrap penguins contracted their foraging ranges and dived deeper, increasing niche overlap and opportunity for interspecific competition with gentoo penguins. Foraging niche overlap was influenced by site-specific environmental conditions. For example, warmer seasurface temperatures (which correlate with increased diving depths) and shallower bathymetry (which limits diving depth) at Nelson Island reduced opportunity for niche separation between the two species, especially during the brood and crèche stages of the breeding season. My results show that chinstrap and gentoo penguin foraging behaviours are plastic depending on site and breeding stage. Furthermore, my results show that seasonal changes in central-place foraging constraints and environmental conditions can modulate niche separation between these co-occurring species. A continuation in climate change (e.g. further warming sea temperatures) in this region of the Southern Ocean is expected to impact penguin prey distribution, which will likely lead to changes in foraging behaviour and niche overlap of chinstrap and gentoo penguins. While chinstrap and gentoo penguins may adjust their foraging behaviour to adapt to changing environmental conditions, these changes may have consequences for population dynamics and the future distribution and abundance of these species.
Open Access
Recurrent neural network language models in the context of under-resourced South African languages
(2018) Scarcella, Alessandro; Lacerda, Miguel
Over the past five years neural network models have been successful across a range of computational linguistic tasks. However, these triumphs have been concentrated in languages with significant resources such as large datasets. Thus, many languages, which are commonly referred to as under-resourced languages, have received little attention and have yet to benefit from recent advances. This investigation aims to evaluate the implications of recent advances in neural network language modelling techniques for under-resourced South African languages. Rudimentary, single layered recurrent neural networks (RNN) were used to model four South African text corpora. The accuracy of these models were compared directly to legacy approaches. A suite of hybrid models was then tested. Across all four datasets, neural networks led to overall better performing language models either directly or as part of a hybrid model. A short examination of punctuation marks in text data revealed that performance metrics for language models are greatly overestimated when punctuation marks have not been excluded. The investigation concludes by appraising the sensitivity of RNN language models (RNNLMs) to the size of the datasets by artificially constraining the datasets and evaluating the accuracy of the models. It is recommended that future research endeavours within this domain are directed towards evaluating more sophisticated RNNLMs as well as measuring their impact on application focused tasks such as speech recognition and machine translation.
Open Access
soMLier: A South African Wine Recommender System
(2022) Redelinghuys, Joshua; Er, Sebnem
Though several commercial wine recommender systems exist, they are largely tailored to consumers outside of South Africa (SA). Consequently, these systems are of limited use to novice wine consumers in SA. To address this, the aim of this research is to develop a system for South African consumers that yields high-quality wine recommendations, maximises the accuracy of predicted ratings for those recommendations and provides insights into why those suggestions were made. To achieve this, a hybrid system “soMLier” (pronounced “sommelier”) is built in this thesis that makes use of two datasets. Firstly, a database containing several attributes of South African wines such as the chemical composition, style, aroma, price and description was supplied by wine.co.za (a SA wine retailer). Secondly, for each wine in that database, the numeric 5-star ratings and textual reviews made by users worldwide were further scraped from Vivino.com to serve as a dataset of user preferences. Together, these are used to develop and compare several systems, the most optimal of which are combined in the final system. Item-based collaborative filtering methods are investigated first along with model-based techniques (such as matrix factorisation and neural networks) when applied to the user rating dataset to generate wine recommendations through the ranking of rating predictions. Respectively, these methods are determined to excel at generating lists of relevant wine recommendations and producing accurate corresponding predicted ratings. Next, the wine attribute data is used to explore the efficacy of content-based systems. Numeric features (such as price) are compared along with categorical features (such as style) using various distance measures and the relationships between the textual descriptions of the wines are determined using natural language processing methods. These methods are found to be most appropriate for explaining wine recommendations. Hence, the final hybrid system makes use of collaborative filtering to generate recommendations, matrix factorisation to predict user ratings, and content-based techniques to rationalise the wine suggestions made. This thesis contributes the “soMLier” system that is of specific use to SA wine consumers as it bridges the gap between the technologies used by highly-developed existing systems and the SA wine market. Though this final system would benefit from more explicit user data to establish a richer model of user preferences, it can ultimately assist consumers in exploring unfamiliar wines, discovering wines they will likely enjoy, and understanding their preferences of SA wine.
Open Access
Towards adaptive management of high-altitude grasslands: Ingula as a case study
(2015) Maphisa, David Hlosi; Altwegg, Res; Underhill, Leslie G
Eastern high - altitude grasslands of South Africa are centres for endemism and harbour fauna and flora of regional and international conservation concern. This area also provides important ecological services such as provision of water to communities downstream. Sweet and sour veld support beef livestock farming during summer months. The aesthetic beauty of the region makes the area a prime tourist destination too. More recently the area is becoming a target of other agricultural projects such man - made forests. Other new developments that need to be mitigated against are development of renewable energy projects such as pumped water schemes to generate electricity or wind farms. Additional habitat is lost when these projects are connected to the national grid. In this thesis, I use bird data and vegetation data to compare, contrast and suggest management tools to manage this area. I present data that I collected at Ingula Pumped Storage Scheme spanning five years from the beginning of the construction of the scheme to near its completion in 2012 as a case study to manage similar habitats. Chapter 1 presents a brief overview of ecological importance of this area and the history behind the construction of pumped storage scheme at Ingula. A literature review in Chapter 2 investigates management tools to manage these grasslands for avian diversity. Fire and grazing is a key management tool cited to make habitat suitable for birds. While few studies from this type of grassland exist, studies from outside South Africa suggest that fire and grazing supplement each other as management tool to make habitat suitable for species with contrasting ecological requirements. A mosaic of grass heights and cover across the landscape translates to species habitat suitability. Chapter 3 explores species richness through years, seasons and impact of grass height and cover on bird species richness. Species richness was highest in summer suggesting that management should make habitat for species suitable in summer when most priority species are likely to use the habitat. The main disadvantage of using bird species richness is that fieldworkers must know their species well. Secondly, use of species richness must be treated with caution because this method does not account for species detectability in time and in space. In Chapter 4 I use hierarchical distance sampling models which take into account both the detection and the biological process. To demonstrate this I used common grassland bird species which can easily be identified during monitoring. The downside of this approach is that because these species are common and therefore occur almost everywhere, they may not easily respond to lack of habitat heterogeneity. The technical disadvantage of using this method is to accurately allocate species to within distance bands, making this method challenging for fieldworkers. Chapter 5 presents random plot occupancy which records only detection - nondetection of birds during repeated plot surveys. This method accounts for observational and biological processes too and in addition implements rigorous statistical inferences to predict how birds respond to habitat variable s as influenced by management decision on fire and grazing. Finally, adverse weather conditions may hamper surveying all plots in some years. Through occupancy modelling it is possible to predict species occupancy on plots that were not surveyed during some years and finally this method has been improved to include rare species. This is my preferred method to monitor management effect on habitat suitability for birds at Ingula. Adaptive management, a pillar of which is adaptive monitoring is a new paradigm shift in conservation. In Chapter 6, I capture interactions between burning and grazing and effects on grass height and cover to predict habitat suitability for birds including large threatened Ingula birds using a simulation models. This model sets a stage for implementing adaptive management through experimental plots to capture a set of management uncertainties regarding the use of fire and grazing as management tools. Chapter 7 summarizes the thesis and acknowledges that Ingula consists of other equally important habitat and ecosystem such as cool moist mountain forest and matrix of grassland wetland that equally need to be conserved.
Open Access
Unravelling black box machine learning methods using biplots
(2019) Rowan, Adriaan; Little, Francesca; Lubbe, Sugnet
Following the development of new mathematical techniques, the improvement of computer processing power and the increased availability of possible explanatory variables, the financial services industry is moving toward the use of new machine learning methods, such as neural networks, and away from older methods such as generalised linear models. However, their use is currently limited because they are seen as “black box” models, which gives predictions without justifications and which are therefore not understood and cannot be trusted. The goal of this dissertation is to expand on the theory and use of biplots to visualise the impact of the various input factors on the output of the machine learning black box. Biplots are used because they give an optimal two-dimensional representation of the data set on which the machine learning model is based.The biplot allows every point on the biplot plane to be converted back to the original ��-dimensions – in the same format as is used by the machine learning model. This allows the output of the model to be represented by colour coding each point on the biplot plane according to the output of an independently calibrated machine learning model. The interaction of the changing prediction probabilities – represented by the coloured output – in relation to the data points and the variable axes and category level points represented on the biplot, allows the machine learning model to be globally and locally interpreted. By visualing the models and their predictions, this dissertation aims to remove the stigma of calling non-linear models “black box” models and encourage their wider application in the financial services industry.

Browsing by Subject "Statistics"

Results Per Page

Sort Options