Browsing by Subject "Statistical Sciences"
Now showing 1 - 20 of 118
Results Per Page
Sort Options
- ItemOpen AccessA comparative study of stochastic models in biology(1997) Brandão, Anabela de Gusmão; Zucchini, Walter; Underhill, LesIn many instances, problems that arise in biology do not fall under any category for which standard statistical techniques are available to be able to analyse them. Under these situations, specifics methods have to be developed to solve and answer questions put forward by biologists. In this thesis four different problems occurring in biology are investigated. A stochastic model is built in each case which describes the problem at hand. These models are not only effective as a description tool but also afford strategies consistent with conventional model selection processes to deal with the standard statistical hypothesis testing situations. The abstracts of the papers resulting from these problems are presented below.
- ItemOpen AccessA Machine Learning Model for Octane Number Prediction(2023) Spencer, Victor; Moller, Klaus; Nyirenda Juwa ChizaAssessing the quality of gasoline blends in blending circuits is an important task in quality control. Gasoline quality however , cannot be measured directly on a process stream. Therefore a quality indicator which can be determined from the stream composition is required. Various quality indicators have been used in the existing body of literature but the indicator in this study will be the Research Octane Number (RON). This is an indicator which measures the ignition of gasoline relative to pure octane (Abdul-Gani et al. 2018). Previous research has used empirical models in the form of phenomeno-logical and machine learning models (Gonz´alez 2019). Phenomeno-logical models have been used in the past as a way of programming an engineer's thought process in the form of differential equations put together. Machine learning models are data driven with primarily regression and deep learning methods being used in literature as prediction models. This study aims to develop a parsimonious machine learning model which can be used to predict the RON from the molar composition of the gasoline product stream. Regression, ensemble learning and Artificial Neural Networks (ANN) will be used specifically in this study. The ensemble learning models which will be trained are Bayesian Additive Regression Trees (BART) and Gradient Boosting Machines (GBM). The raw data will be scraped from multiple journals online and the data frame will be comprised of volume compositions of the reference compounds and the RON of each blend. The existing data frame will be extended to include the molar composition of the structural groups present in each of the blends. The structural groups which may be referred to as functional groups are specific substituents within molecules which may be responsible for the characteristic chemical reactions of the respective molecules. This addition of structural groups adds a layer of information to differentiate between blends with different compound compositions but similar RON. It was hypothesised that the molar compositions of the additives and their substituent structural groups would rank highest and the molar composition of n-heptane would have the lowest ranking. For the Multiple Linear Regression (MLR) models, two cases were trained; one with interaction parameters and another without. Both of these cases were trained with and without the composition constraints on the compound compositions. For the ensemble learning case, a BART model with 200 trees and a GBM model with 1998 trees were trained. Four Single Layer Feed-forward Neural Network (SLFN) models were trained, each with 3, 5, 10 and 15 nodes. The choice of neural network architecture was made because the data frame was small, with only 12 input variables and 350 observations. Prior to training the models, an Explanatory Data Analysis was carried out to assess the potential dimensionality reduction, correlations and outliers. The final regression model was the interaction model with a test MSE of 7.54 and an adjusted R2 of 0.986. The BART model obtained a test MSE of 13.74 and an adjusted R2 of 0.983. The GBM model had a test MSE of 38.12 and an adjusted R2 of 0.917. Lastly the best performing ANN was the 10 node SLFN which obtained a test MSE of 11.26 and an adjusted R2 of 0.969. For each model, a variable importance was carried out and it was observed that the molar composition of n-Heptane consistently ranked high in the variable importance. In addition to these predictive statistics; the parity plots, residual plots and Analysis of Variance (ANOVA) were analysed and taken into consideration in evaluating the performance of each of the models trained. It was concluded that the MLR model performed best followed by the BART model. The ANN models ranked third and the GBM model ranked last. The hypothesis that the molar compositions of the additives and their substituent structural groups would rank highest and iv n-heptane would be the lowest ranking component was disproved as the molar composition of n-heptane and its substituent structural groups consistently ranked high . The recommendation for this study is to train the models with a more representative data set in future and to use a hybrid model which comprises of a phenomeno-logical model and a machine learning model for best results and to reduce the bias of the model in the regions with few data points. With the next step of the study being the integration of the new model into the plant-wide Advanced Process Control (APC).
- ItemOpen AccessA multivariate statistical approach to the assessment of nutrition status(1972) Fellingham, Stephen A; Troskie, Casper GAttention is drawn to the confusion which surrounds the concept of nutrition status and the problem of selecting an optimum subset of variables by which nutrition status can best be assessed is defined. Using a multidisciplinary data set of some 60 variables observed on 1898 school children from four racial groups, the study aims to identify statistically, both those variables which are unrelated to nutrition status and also those which, although related, are so highly correlated that the measurement of all would be an unnecessary extravagance. It is found that, while the somatometric variables provide a reasonably good (but non-specific) estimate of nutrition status, the disciplines form meaningful groups and the variables of the various disciplines tend to supplement rather than replicate each other. Certain variables from most of the disciplines are, therefore, necessary for an optimum and specific estimate of nutrition status. Both the potential and the shortcomings of a number of statistical techniques are demonstrated.
- ItemOpen AccessA web API service for calculating credit attributable to authors(2025) Burgess, Marc; Georg, Co-PierreThe academic project “UniCoin” is designed to use the Ethereum blockchain to provide researchers with a way to licence their work. This provides a relatively economically efficient way to receive compensation for novel ideas, but is limited to research that is commercialisable. Foundational research is less commercialisable and is an environment where funding can be an issue. By considering the indirect impact of those providing the foundations upon which new research is built, it becomes possibly to more fairly allocate income and incentivise ongoing research. This research focusses on providing a means to allocate credit to the authors of works cited by research on the UniCoin platform by means of citation network analysis and the calculation of centrality measures. In particular, I present a web service that is able to generate citation networks for a given piece of research and from them to efficiently calculate measures of the contribution of other papers to the research. I demonstrate this functionality with case studies. This system can be integrated into UniCoin in order to provide a fair allocation of income to all contributors.
- ItemOpen AccessAdapting Large-Scale Speaker-Independent Automatic Speech Recognition to Dysarthric Speech(2022) Houston, Charles; Britz, Stefan S; Durbach, IanDespite recent improvements in speaker-independent automatic speech recognition (ASR), the performance of large-scale speech recognition systems is still significantly worse on dysarthric speech than on standard speech. Both the inherent noise of dysarthric speech and the lack of large datasets add to the difficulty of solving this problem. This thesis explores different approaches to improving the performance of Deep Learning ASR systems on dysarthric speech. The primary goal was to find out whether a model trained on thousands of hours of standard speech could successfully be fine-tuned to dysarthric speech. Deep Speech – an open-source Deep Learning based speech recognition system developed by Mozilla – was used as the baseline model. The UASpeech dataset, composed of utterances from 15 speakers with cerebral palsy, was used as the source of dysarthric speech. In addition to investigating fine-tuning, layer freezing, data augmentation and re-initialization were also investigated. Data augmentation took the form of time and frequency masking, while layer freezing consisted of fixing the first three feature extraction layers of Deep Speech during fine-tuning. Re-initialization was achieved by randomly initializing the weights of Deep Speech and training from scratch. A separate encoder-decoder recurrent neural network consisting of far fewer parameters was also trained from scratch. The Deep Speech acoustic model obtained a word error rate (WER) of 141.53% on the UASpeech test set of commands, digits, the radio alphabet, common words, and uncommon words. Once fine-tuned to dysarthric speech, a WER of 70.30% was achieved, thus demonstrating the ability of fine-tuning to improve upon the performance of a model initially trained on standard speech. While fine-tuning lead to a substantial improvement in performance, the benefit of data augmentation was far more subtle, improving on the fine-tuned model by a mere 1.31%. Freezing the first three layers of Deep Speech and fine-tuning the remaining layers was slightly detrimental, increasing the WER by 0.89%. Finally, both re-initialization of Deep Speech's weights and the encoder-decoder model generated highly inaccurate predictions. The best performing model was Deep Speech fine-tuned to augmented dysarthric speech, which achieved a WER of 60.72% with the inclusion of a language model.
- ItemOpen AccessAgent-based model of the market penetration of a new product(2014) Magadla, Thandulwazi; Durbach, Ian; Scott, LeanneThis dissertation presents an agent-based model that is used to investigate the market penetration of a new product within a competitive market. The market consists of consumers that belong to social network that serves as a substrate over which consumers exchange positive and negative word-of-mouth communication about the products that they use. Market dynamics are influenced by factors such as product quality; the level of satisfaction that consumers derive from using the products in the market; switching constraints that make it difficult for consumers to switch between products; the word-of-mouth that consumers exchange and the structure of the social network that consumers belong to. Various scenarios are simulated in order to investigate the effect of these factors on the market penetration of a new product. The simulation results suggest that: ■ A new product reaches fewer new consumers and acquires a lower market share when consumers switch less frequently between products. ■ A new product reaches more new consumers and acquires a higher market share when it is of a better quality to that of the existing products because more positive word-of-mouth is disseminated about it. ■ When there are products that have switching constraints in the market, launching a new product with switching constraints results in a higher market share compared to when it is launched without switching constraints. However, it reaches fewer new consumers because switching constraints result in negative word-of-mouth being disseminated about it which deters other consumers from using it. Some factors such as the fussiness of consumers; the shape and size of consumers' social networks; the type of messages that consumers transmit and with whom and how often they communicate about a product, may be beyond the control of marketing managers. However, these factors can potentially be influenced through a marketing strategy that encourages consumers to exchange positive word-of-mouth both with consumers that are familiar with a product and those who are not.
- ItemOpen AccessAn Application of Generative Adversarial Networks to One-Dimensional Value-at-Risk(2024) Swallow, Rachel; Mahomed, ObeidA generative adversarial network (GAN) is an implicit generative model made up of two neural networks. This minor dissertation applies GANs to recover target statistical distributions. GANs have a distinctive training architecture designed to create examples that reproduce target data samples. These models have been applied successfully in high-dimensional domains such as natural image generation and processing. Much less research has been reported on applications with low dimensional distributions, where properties of GANs may be better identified and understood. One such area in finance is the use of GANs for estimating value-at-risk (VaR). Through this financial application, this dissertation introduces readers to the concepts and practical implementations of GAN variants to generate one-dimensional portfolio returns over a single period. Large portions of the discussions should be accessible to anyone who has an entry-level statistics course. It is aimed at data science or finance students looking to better their understanding of GANs and the potential of these models for other financial applications. Five GAN loss variants are introduced and three of these models are practically implemented to estimate VaR. The GAN estimates are compared to more traditional VaR estimation techniques and all models are backtested. Most GAN models trained in this dissertation are able to capture key features of each of the distributions, however these models do not outperform historical VaR estimates.
- ItemOpen AccessAn unsupervised approach to COVID-19 fake tweet detection(2024) Jarana, Bulungisa; Ngwenya, MzabalazoContext: With the ongoing COVID-19 pandemic, social media platforms have become a crucial source of information. However, not all information shared on these platforms is accurate. The dissemination of fake news, intentional or unintentional, can lead to panic among readers and further exacerbate the effects of the pandemic. Objectives: This research project aims to explore the potential of unsupervised machine learning algorithms in differentiating between genuine and fake COVID-19 news shared on Twitter. The methodology includes a literature review, experimental analysis, and the utilization of a Twitter dataset. Methods: The study used both Mini-Batch K-means and K-means algorithms of clustering techniques to provide us with ‘grouping' of Twitter data in the two of clusters. Word embedding techniques such as TF-IDF, Word2Vec, and BERT were employed because machine learning models cannot process unprocessed text data directly, and word embedding resolves this issue. Results: The results on the test data show that K-means algorithm was the best performing algorithm (76% accuracy was achieved) in determining fake tweets about Covid-19. K-means algorithm using Bert word embedding is the best performing model followed by Mini-Batch K-means using TF-IDF word embedding (69% accuracy was achieved). Conclusions: The study demonstrates that clustering Twitter COVID-19 news as genuine or fake using K-means and Mini-Batch K-means algorithms is feasible Keywords: Clustering, Machine Learning, unsupervised learning, K-Means, MiniBatch K-Means, TF-IDF, Word2Vec, Bert, Confusion Matrix, Truncated SVD (Singular Value Decomposition), t-distributed stochastic neighbourhood embedding (t-SNE)
- ItemOpen AccessThe analysis of some bivariate astronomical time series(1993) Koen, Marthinus Christoffel; Zucchini, WalterIn the first part of the thesis, a linear time domain transfer function is fitted to satellite observations of a variable galaxy, NGC5548. The transfer functions relate an input series (ultraviolet continuum flux) to an output series (emission line flux). The methodology for fitting transfer function is briefly described. The autocorrelation structure of the observations of NGC5548 in different electromagnetic spectral bands is investigated, and appropriate univariate autoregressive moving average models given. The results of extensive transfer function fitting using respectively the λ1337 and λ1350 continuum variations as input series, are presented. There is little evidence for a dead time in the response of the emission line variations which are presumed driven by the continuum. Part 2 of the thesis is devoted to the estimation of the lag between two irregularly spaced astronomical time series. Lag estimation methods which have been used in the astronomy literature are reviewed. Some problems are pointed out, particularly the influence of autocorrelation and non-stationarity of the series. If the two series can be modelled as random walks, both these problems can be dealt with efficiently. Maximum likelihood estimation of the random walk and measurement error variances, as well as the lag between the two series, is discussed. Large-sample properties of the estimators are derived. An efficient computational procedure for the likelihood which exploits the sparseness of the covariance matrix, is briefly described. Results are derived for two example data sets: the variations in the two gravitationally lensed images of a quasar, and brightness changes of the active galaxy NGC3783 in two different wavelengths. The thesis is concluded with a brief consideration of other analysis methods which appear interesting.
- ItemOpen AccessApplications of Machine Learning in Apple Crop Yield Prediction(2021) van den Heever, Deirdre; Britz, Stefan SThis study proposes the application of machine learning techniques to predict yield in the apple industry. Crop yield prediction is important because it impacts resource and capacity planning. It is, however, challenging because yield is affected by multiple interrelated factors such as climate conditions and orchard management practices. Machine learning methods have the ability to model complex relationships between input and output features. This study considers the following machine learning methods for apple yield prediction: multiple linear regression, artificial neural networks, random forests and gradient boosting. The models are trained, optimised, and evaluated using both a random and chronological data split, and the out-of-sample results are compared to find the best-suited model. The methodology is based on a literature analysis that aims to provide a holistic view of the field of study by including research in the following domains: smart farming, machine learning, apple crop management and crop yield prediction. The models are built using apple production data and environmental factors, with the modelled yield measured in metric tonnes per hectare. The results show that the random forest model is the best performing model overall with a Root Mean Square Error (RMSE) of 21.52 and 14.14 using the chronological and random data splits respectively. The final machine learning model outperforms simple estimator models showing that a data-driven approach using machine learning methods has the potential to benefit apple growers.
- ItemOpen AccessApplying imputation and statistical learning to predict gamma-glutamyl transferase in underwriting data(2023) Perumal, Yevashan; Britz, StefanInsurance underwriting can be time-consuming and costly for both insurers and customers. However, the insight gained is of critical importance in addressing the information asymmetry between insurers and customers in terms of establishing a customer's risk profile. Consequently, any test that assists in providing a risk assessment is critical in allowing insurance companies to manage risk and price their products appropriately. Gamma-glutamyl Transferase (GGT) is an enzyme which has been used by insurers in underwriting medical tests as an indicator of potential adverse outcomes. However, due to complexities such as differing underwriting strategies, data collection and data storage issues, not every customer on an insurer's books will have a GGT value or even a complete data profile. This research investigates if statistical techniques such as imputation and supervised learning can be used in conjunction with available medical, demographic, underwriting and policy data to accurately predict GGT values. A combination of multivariate imputation by chained equations (MICE) and extremegradient boosted trees (XGBoost) offers a 31% improvement in accuracy compared to a naïve prediction. However, there does appear to be a limit to the performance achieved from all implemented techniques with the analysed dataset, with various model combinations yielding root mean squared error (RMSE) values within a narrow range. In addition, when comparing the predictions from a separate, unlabelled dataset to actual data, it appears as though predictions from the models cannot be reliably deemed to be from the same distribution. This indicates that further research is required before insurers can reliably switch out blood-work based GGT results for those from a supervised learning model. Keywords: insurance, underwriting, gamma-glutamyl transferase, imputation, supervised learning
- ItemOpen AccessAutomated detection and classification of red roman in unconstrained underwater environments using Mask R-CNN(2021) Conrady, Christopher; Er, Sebnem; Attwood, Colin GThe availability of relatively cheap, high-resolution digital cameras has led to an exponential increase in the capture of natural environments and their inhabitants. Videobased surveys are particularly useful in the underwater domain where observation by humans can be expensive, dangerous, inaccessible, or destructive to the natural environment. Moreover, video-based surveys offer an unedited record of biodiversity at a given point in time – one that is not reliant on human recall or susceptible to observer bias. In addition, secondary data that is useful in scientific study (date, time, location, etc.) are by default stored in almost all digital formats as metadata. When analysed effectively, this growing body of digital data offers the opportunity for robust and independently reproducible scientific study of marine biodiversity (and how this might change over time, for example). However, the manual review of image and video data by humans is slow, expensive, and not scalable. A large majority of marine data has never gone through analysis by human experts. This necessitates computer-based (or automated) methods of analysis that can be deployed at a fraction of the time and cost, at a comparable accuracy. Mask R-CNN, a deep learning object recognition framework, has outperformed all previous state-of-the-art results on competitive benchmarking tasks. Despite this success, Mask R-CNN and other state-of-the-art object recognition techniques have not been widely applied in the underwater domain, and not at all within the context of South Africa. To address this gap in the literature, this thesis contributes (i) a novel image dataset of red roman (Chrysoblephus laticeps), a fish species endemic to Southern Africa, and (ii) a Mask R-CNN framework for the automated localisation, classification, counting, and tracking of red roman in unconstrained underwater environments. The model, trained on an 80:10:10 split, accurately detected and classified red roman on the training dataset (mAP50 = 80.29%), validation dataset (mAP50 = 80.35%), as well as on previously unseen footage (test dataset) (mAP50 = 81.45%). The fact that the model performs equally well on unseen footage suggests that it is capable of generalising to new streams of data not used in this research – this is critical for the utility of any statistical model outside of “laboratory conditions”. This research serves as a proof-of-concept that machine learning based methods of video analysis of marine data can replace or at least supplement human analysis.
- ItemOpen AccessBayesian analysis of historical functional linear models with application to air pollution forecasting(2022) Junglee, Yovna; Erni, Birgit; Clark, AllanHistorical functional linear models are used to analyse the relationship between a functional response and a functional predictor whereby only the past of the predictor process can affect the current outcome. In this work, we develop a Bayesian framework for the analysis of the historical functional linear model with multiple predictors. Different from existing Bayesian approaches to historical functional linear models, our proposed methodology is able to handle multiple functional covariates with measurement error and sparseness. The proposed model utilises the well-established connection between non-parametric smoothing and Bayesian methods to reduce sensitivity to the number of basis functions which are used to model the functional regression coefficients. We investigate two methods of estimation within the Bayesian framework. We first propose to smooth the functional predictors independently from the regression model in a two-stage analysis, and secondly, jointly with the regression model. The efficiency of the MCMC algorithms is increased by implementing a Cholesky decomposition to sample from high-dimensional Gaussian distributions and by taking advantage of the orthogonal properties of the functional principal components used to model the functional covariates. Our extensive simulation study shows substantial improvements in both the recovery of the functional regression surface and the true underlying functional response with higher coverage probabilities, when compared to a classical model under which the measurement error is unaccounted for. We further found that the Bayesian two-stage analysis outperforms the joint model under certain conditions. A major challenge with the collection of environmental data is that they are prone to measurement error, both random and systematic. Hence, our methodology provides a reliable functional data analytic framework for modelling environmental data. Our focus is on the application of our method to forecast the level of daily atmospheric pollutants using meteorological information such as hourly records of temperature, humidity and wind speed from data collected by the City of Cape Town, South Africa. The forecasts provided by the proposed Bayesian two-stage model are highly competitive against the functional autoregressive models which are traditionally used for functional time series.
- ItemOpen AccessBehavioural, microhabitat, and phylogenetic dimensions of intrasexual contest competition in combatant monkey beetles (Scarabaeidae: Hopliini)(2021) Rink, Ariella N; Altwegg, Res; Colville, Jonathan F; Bowie, Rauri C KThe importance of sexual selection as a driver of evolution, from microevolution to speciation, has overwhelmingly been studied in the context of female choice, but there is evidence that male-male competition can also drive evolution. Recent reviews of the intrasexual competition literature have developed several hypotheses of weapon divergence in both allopatry and sympatry and have suggested means by which weapon divergence may cause reproductive isolation and speciation, both alone and together with mate choice and ecological selection. Here, I assess the role of sexual selection, in the context of environmental variation at the level of the contest substrate and the developmental environment, in contributing to microevolution within the monkey beetles (Coleoptera: Scarabaeidae: Hopliini), a taxonomically and phenotypically diverse group of pollinating insects in the Greater Cape Floristic Region (GCFR) that shows a high degree of sexual dimorphism and mating behaviour driven by male-male competition. I build on previous observations of hind leg use in intrasexual male-male contest for reproductive access to females by showing that, in Heterochelus chiragricus, contests occur in the context of a significantly maleskewed sex-ratio and consist of vigorous wrestling and pushing between two males on the flower heads occupied by embedded, feeding females, who apparently exert no mate choice. Contest outcomes are influenced by hind femur size and residency effects, and I apply hypotheses informed by evolutionary game theory to assess how males make decisions regarding persistence versus retreat. I proceed to assess the evidence for the ‘divergent fighting contexts' hypothesis which predicts weapon divergence driven by intrasexual contest competition in the context of variation in the contest substrate. I find that hind leg size in another combatant monkey beetle, the species complex Scelophysa trimeni, varies across gradients of flower size among several spatially distributed populations, suggesting that variation in flower size (the contest substrate) mediates selection for weapon morphologies that maximise performance under different fighting styles necessitated by differences in the contest substrate. I also find that male elytral colour varies both across gradients in the developmental environment and with variation in flower colour, suggesting that this trait may function as an honest signal of male fitness, but also that it may be under selection to maximise signal transmission against variable backgrounds of contest substrates. Finally, I quantify the extent to which integration, modularity, multivariate allometry, and phylogenetic effects influence the evolutionary lability of male monkey beetle's hind legs, and so mediate the pace of their evolutionary diversification in response to these varying contest substrates. My findings support a two-module pattern of modularity at both static and evolutionary levels, and I find that allometric scaling relationships are conserved within S. trimeni. These findings indicate that monkey beetle weapons are relatively unconstrained in their evolutionary diversification across divergent fighting substrates. I conclude by discussing these findings within the broader field of sexual selection and monkey beetle ecology and suggest directions for further work. The findings presented here support a role for sexual selection, interacting with variation in the flower contest substrate, as being an important driver of the diversification of monkey beetles in the GCFR.
- ItemOpen AccessBioacoustic classification of Hainan gibbon call types using deep learning(2023) Luphade, Nonhlanhla; Durbach, Ian; Britz, Stefan; Dufourq, EmmanuelIn Bawangling National Nature Reserve (BNNR), Hainan, China, there exists a critically endangered primate known as the Hainan gibbon Nomascus hainanus. Many species, including the Hainan gibbon, are at high risk of extinction due to many factors such as unsustainable hunting, climate change, and deforestation. The Hainan gibbons live in social groups and the ability to discriminate between the group is useful for tracking migration patterns, population management, and identification of new groups. Currently, there has not been any study which attempts to distinguish between the groups. More recently, researchers have begun using deep learning to answer ecological questions, in a similar way that deep learning has successfully been used in computer vision and audio classification tasks. This study is the first attempt at investigating how deep learning can be used to distinguish between the Hainan gibbon social groups using only the acoustic data recorded in BNNR. Two convolutional neural networks (CNNs) were developed, the first was a binary classification model to detect gibbon calls from non-gibbon calls, and the second was a group classifier to distinguish between the social groups in BNNR. The audio data was converted into mel-scale spectrograms, resulting in images used as input to train the CNNs. Two steps were taken to train reliable models. Firstly, data augmentation techniques were explored to increase the amount of data as a means to train reliable models, and secondly, hyperparameter tuning was conducted. The binary classifier obtained a testing accuracy of 86%. The findings reveal that the model is able to distinguish between gibbon calls and non-gibbon calls. The social group model was not able to distinguish between the social groups as the model predicted the majority of the calls as one group. The result of this study demonstrates the usefulness of deep learning in addressing ecological questions that would be otherwise very challenging for a human to achieve.
- ItemOpen AccessBiplot graphical display techniques(1991) Iloni, Karen; Underhill, Leslie GThe thesis deals with graphical display techniques based on the singular value decomposition. These techniques, known as biplots, are used to find low dimensional representations of multidimensional data matrices. The aim of the thesis is to provide a review of biplots for a practical statistician who is not familiar with the area. It therefore focuses on the underlying theory, assuming a standard statisticians' knowledge of matrix algebra, and on the interpretation of the various plots. The topic falls in the realm of descriptive statistics. As such, the methods are chiefly exploratory. They are a means of summarising the data. The data matrix is represented in a reduced number of dimensions, usually two, for simplicity of display. The aim is to summarise the information in the matrix and to present a visual representation of this information. The aim in using graphical display techniques is that the "gain in interpretability far exceeds the loss in information" (Greenacre, 1984). A graphical description is often more easy to understand than a numerical one. Histograms and pie charts are familiar forms of data representation to many people with no other, or very rudimentary, statistical understanding. These are applicable to univariate data. For multivariate data sets, univariate methods do not reveal interesting relationships in the data set as a whole. In addition, a biplot can be presented in a manner which can be readily understood by non-statistically minded individuals. Greenacre (1984) comments that only in recent years has the value of statistical graphics been recognised. Young (1989) notes that recently there has been a shift in emphasis, among statisticians towards exploratory data analysis methods. This school of thought was given momentum by the publication of the book "Exploratory Data Analysis" (Tukey, 1977). The trend has been facilitated by advances in computer technology which have increased both the power and the accessibility of computers. Biplot techniques include the popular correspondence analysis. The original proponents of correspondence analysis (among them Benzecri) reject probabilistic modelling. At the other extreme, some view graphical display techniques as a mere preliminary to the more traditional statistical approaches. Under the latter view, graphical display techniques are used to suggest models and hypotheses. The emphasis in exploratory data techniques such as graphical displays is on 'getting a feel' for the data rather than on building models and testing hypotheses. These methods do not replace model building and hypothesis testing, but supplement them. The essence of the philosophy is that models are suggested by the data, rather than the frequently followed route of first fitting a model. Some work has gone into developing inferential methods, with hypothesis tests and associated p-values for biplot-type techniques (Lebart et al, 1984, Greenacre, 1984). However, this aspect is not important if the techniques are viewed merely as exploratory. Chapter Two provides the mathematical concepts necessary for understanding biplots. Chapter Three explains exactly what a biplot is, and lays the theoretical framework for the biplot techniques that follow. The goal of this chapter is to provide a framework in which biplot techniques can be classified and described. Correlation biplots are described in Chapter Four. Chapter Five discusses the principal component biplot, and the link between these and principal component analysis is drawn. In Chapter Six, correspondence analysis is presented. In Chapter Seven practical issues such as choice of centre are discussed. Practical examples are presented in Chapter Eight. The aim is that these examples illustrate techniques commonly applicable in practice. Evaluation and choice of biplot is discussed in Chapter Nine.
- ItemOpen AccessBreeding production of Cape gannets Morus capensis at Malgas Island, 2002-03(2006) Staverees, Linda; Underhill, Les; Crawford, RJMIncludes bibliographical references.
- ItemOpen AccessBuilding a question answering system for the introduction to statistics course using supervised learning techniques(2020) Leonhardt, Waldo; Er, Sebnem; Scott, LeanneQuestion Answering (QA) is the task of automatically generating an answer to a question asked by a human in natural language. Open-domain QA is still a difficult problem to solve even after 60 years of research in this field, as trying to answer questions which cover a wide range of subjects is a complex matter. Closed-domain QA is, on the other hand, more achievable as the context for asking questions is restricted and allows for more accurate interpretation. This dissertation explores how a QA system could be built for the Introduction to Statistics course taught online at the University of Cape Town (UCT), for the purpose of answering administrative queries. This course runs twice a year and students tend to ask similar administrative questions each time that the course is run. If a QA system can successfully answer these questions automatically, it would save lecturers the time in having to do so manually, as well as enabling students to receive the answers immediately. For a machine to be able to interpret natural language questions, methods are needed to transform text into numbers while still preserving the meaning of the text. The field of Natural Language Processing (NLP) offers the building blocks for such methods that have been used in this study. After predicting the category of a new question using Multinomial Logistic Regression (MLR), the past question that is most similar to the new question is retrieved and its answer is used for the new question. The following five classifiers, Naive Bayes, Logistic Regression, Support Vector Machines, Stochastic Gradient Descent and Random Forests were compared to see which one provides the best results for the categorisation of a new question. The cosine similarity method was used to find the most similar past question. The Round-Trip Translation (RTT) technique was explored as an augmentation method for text, in an attempt to increase the dataset size. Methods were compared using the initial base dataset of 744 questions, compared to the extended dataset of 6 614 questions, which was generated as a result of the RTT technique. In addition to these two datasets, features for Bag-of-Words (BoW), Term Frequency times Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDiA), pre-trained Global Vector (GloVe) word embeddings and customengineered features were also compared. This study found that a model using an MLR classifier with TF-IDF unigram and bigram features (built on the smaller 744 questions dataset) performed the best, with a test F1-measure of 84.8%. Models using a Stochastic Gradient Descent classifier also performed very well with a variety of features, indicating that Stochastic Gradient Descent is the most versatile classifier to use. No significant improvements were found using the extended RTT dataset of 6 614 questions, but this dataset was used by the model that ranked eighth in position. A simulator was also built to illustrate and test how a bot (an autonomous program on a network that is able to interact with users) can be used to facilitate the auto-answering of student questions. This simulator proved very useful and helped to identify the fact that questions relating to the Course Information Pack had been excluded from the data that had been initially sourced, as students had been asking such questions through other platforms. Building a QA system using a small dataset proved to be very challenging. Restricting the domain of questions and focusing only on administrative queries was helpful. Lots of data cleaning was needed and all past answers needed to be rewritten and standardised, as the raw answers were too specific and did not generalise well. The features that performed the best for cosine similarity and for extracting the most similar past question were LSA topics built from TF-IDF unigram features. Using LSA topics as the input for cosine similarity, instead of the raw TF-IDF features,resolved the “curse of dimensionality”. Issues with cosine similarity were observed in cases where it favoured short documents, which often led to the selection of the wrong past question. As an alternative, the use of more advanced language-modelling-based similarity measures are suggested for future study. Either, pre-trained word embeddings such as GloVe could be used as a language model, or a custom language model could be trained. A generic UCT language model could be valuable and it would be preferable to build such a language model using the entire digital content of Vula across all faculties where students converse, ask questions or post comments. Building a QA system using this UCT language model is foreseen to offer better results, as terms like “Vula”, “DP”, “SciLab” and “jdlt1” would be endowed with more meaning.
- ItemOpen AccessBusiness process modelling and simulation with application to a start-up actuarial firm(2015) Gweshe, Tatenda Mark; Scott, LeanneIn our research, we set out to model, understand and evaluate the business process at a start-up actuarial firm which employs Report Writers (RWers) who specialise in quantifying actuarial matters. We simulated various "what-if" and extreme scenarios relating to (1) the impact of qualitative variables (stress, morale and health) on RWer productivity, (2) hiring policies for RWers who have various skills sets, (3) the allocation of RWers to various roles within the process, (4) the impact that a high turnover of experienced RWers has on productivity, (5) the impact of introducing a flexible working arrangement (flexitime). This was done through business process modelling and simulations. The business process we modelled was governed by numerous potentially complex inter-relationships between variables and inter-relationships, which we believed could lead to potentially significant feedback loops. The models we built were then simulated over a period of 3 to 7 years to gain insights into the behavioural trends of the firm's business process over time when subject to "what-if" scenarios and policy implementations. The model simulations allowed us to get an understanding of the behaviour of processes over time, and the key variables and relationships involved in bringing about such behaviour as certain variables were subjected to changes in levels, as set out in our objectives. We made use of relevant literature, expert opinion, past data, questionnaires and cognitive mapping techniques to build simulation models. Guided by methodologies used in literature on modelling qualitative variables, bearing in mind the dangers in modelling for them, we modelled for the complex inter-relationships between qualitative and quantitative variables.
- ItemOpen AccessCalculation of calibration factors from the comparative fishing trial between FRS Africana and RV Dr Fridtjof Nansen(2008) Antony, Luyanda Lennox; Dunne, Tim; Leslie, Rob WIncludes abstract. Includes bibliographical references (leaves 153-157).